SeqMiner Quick Tutorial

(Updated May 10, 2014)

To receive email notification of updated manual, please Sign up

SeqMiner (formerly, vcf2geno) is an R package that helps you efficiently extracting sequencing data in the VCF format files. We hope SeqMiner becomes a valuable tool for bioinformaticians, compuational genetists and statistical genetists.

In this tutorial, we will go over some basic SeqMiner functions of handling VCF files. These include gene-based annotations, GERP-score integration and genetic data extraction. These instructions are also available in SeqMiner demos, which can be invoked by > demo("workflow", package = "seqminer").

NOTE First thing to note, SeqMiner requires VCF/BCF files indexed by Tabix.

Gene-based annotation

We will need to prepare annotation parameters. At minimal, we need (1) a reference genome files (FASTA format) and its index (.fa.fai) and (2) gene definitation file (refFlat format)

library(seqminer)
## Loading required package: stringr
param <- list(reference = system.file("tabanno/test.fa", package = "seqminer"), 
    geneFile = system.file("tabanno/test.gene.txt", package = "seqminer"))
param <- makeAnnotationParameter(param)

We have an example VCF file input.demo.vcf and we will specify an output file out.vcf.gz

input <- system.file("tabanno/input.demo.vcf", package = "seqminer")
output <- paste0(getwd(), "/", "out.vcf.gz")
annotateVcf(input, output, param)
## Output file are written to [ /Library/Frameworks/R.framework/Versions/3.1/Resources/library/seqminer/tabanno/out.vcf.gz ].
## NULL

The above messages shows the annotation steps finish successfully.

Annotation and Integration

We will demonstrate how to perform gene-based annotation, region-based annotation and GERP-score integration in just one step. In order to integrate external bioinformatics databas (DB), we use a region-based file as example (BED format). Here are relevant parameters:

  1. we use 'bed=' option to specify a region-based database;
  2. we use 'tabix=' option to specify a tabix-based resource file;
  3. we use 'indexOutput=TRUE' to index outputted VCF file
setwd(system.file("tabanno", package = "seqminer"))
param <- list(reference = "test.fa", geneFile = "test.gene.txt", bed = "REGION=test.bed", 
    tabix = "test.dbNSFP.gz(SIFT=9,PolyPhen=10)", indexOutput = TRUE)
param <- makeAnnotationParameter(param)
input <- "input.demo.vcf"
output <- "out.vcf.gz"
annotateVcf(input, output, param)
## Output file are written to [ out.vcf.gz ].
## NULL

Query genetic data

Now we are able to query the VCF by region or by gene

Extract by region

A quick way to obtain gentoype matrix:

genotype <- readVCFToMatrixByRange(fileName = output, range = "1:1-10", annoType = "")
## 1 region to be extracted.
print(genotype)
## $`1:1-10`
##     NA12891 NA12892
## 1:3       1       0
## 1:5       1       0
## 1:7       1       0

You can also fine-control what fields from VCFs to extract:

genotypeList <- readVCFToListByRange(fileName = output, range = "1:1-10", annoType = "", 
    vcfColumn = c("CHROM", "POS"), vcfInfo = c("ANNO", "SIFT"), vcfIndv = "GT")
print(genotypeList)
## $CHROM
## [1] "1" "1" "1"
## 
## $POS
## [1] 3 5 7
## 
## $ANNO
## [1] "Normal_Splice_Site:GENE1|GENE3" "Nonsynonymous:GENE1|GENE3"     
## [3] "Monomorphic"                   
## 
## $SIFT
## [1] "0.0" "NA"  "NA" 
## 
## $GT
##      [,1]  [,2]  [,3] 
## [1,] "1/0" "1/0" "1/0"
## [2,] "0/0" "0/0" "0/0"
## 
## $sampleId
## [1] "NA12891" "NA12892"

Extract by gene

Another useful feature for seqminer is to extract data by gene. Similar to 'readVCFToMatrixByRange', we use 'readVCFToMatrixByGene':

geneFile = system.file("tabanno/test.gene.txt", package = "seqminer")
genotype <- readVCFToMatrixByGene(fileName = output, geneFile = geneFile, geneName = "GENE1", 
    annoType = "")
## 1 region to be extracted.
print(genotype)
## $GENE1
##      NA12891 NA12892
## 1:3        1       0
## 1:5        1       0
## 1:7        1       0
## 1:13       1       0
## 1:14       1       0
## 1:20       1       0
## 1:22       1       0
## 1:25       1       0
## 1:43       1       0
## 1:44       1       0
## 1:50       1       0
## 1:53       1       0
## 1:53       1       0
## 1:53       1       0
## 1:66       1       0

Similar to readVCFToMatrixByRange, we use readVCFToMatrixByGene:

genotypeList <- readVCFToListByGene(fileName = output, geneFile = geneFile, 
    geneName = "GENE1", annoType = "", vcfColumn = c("CHROM", "POS"), vcfInfo = "ANNO", 
    vcfIndv = "GT")
print(genotypeList)
## $CHROM
##  [1] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## 
## $POS
##  [1]  3  5  7 13 14 20 22 25 43 44 50 53 53 53 66
## 
## $ANNO
##  [1] "Normal_Splice_Site:GENE1|GENE3"       
##  [2] "Nonsynonymous:GENE1|GENE3"            
##  [3] "Monomorphic"                          
##  [4] "Synonymous:GENE1|GENE3"               
##  [5] "CodonGain:GENE1|GENE3"                
##  [6] "Frameshift:GENE1|GENE3"               
##  [7] "Insertion:GENE1|GENE2|GENE3"          
##  [8] "Stop_Loss:GENE1"                      
##  [9] "Normal_Splice_Site:GENE1|GENE3"       
## [10] "Normal_Splice_Site:GENE1|GENE3"       
## [11] "Essential_Splice_Site:GENE1|GENE3"    
## [12] "StructuralVariation:GENE1|GENE2|GENE3"
## [13] "Nonsynonymous:GENE1"                  
## [14] "Nonsynonymous:GENE1"                  
## [15] "Utr5:GENE3"                           
## 
## $GT
##      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10] [,11]
## [1,] "1/0" "1/0" "1/0" "1/0" "1/0" "1/0" "1/0" "1/0" "1/0" "1/0" "1/0"
## [2,] "0/0" "0/0" "0/0" "0/0" "0/0" "0/0" "0/0" "0/0" "0/0" "0/0" "0/0"
##      [,12] [,13] [,14] [,15]
## [1,] "1/0" "1/0" "1/0" "1/0"
## [2,] "0/0" "0/0" "0/0" "0/0"
## 
## $sampleId
## [1] "NA12891" "NA12892"

Contact

SeqMiner is developed by Xiaowei Zhan and Dajiang Liu. We welcome your questions and feedbacks.