Center for Integrative Bioinformatics Vienna
Max F. Perutz Laboratories
Dr. Bohr Gasse 9
A-1030 Vienna, Austria
# R console: install.packages(pkgs="/home/me/downloads/SNP2GO_1.0.2.tar.gz", type="source") library(SNP2GO) #load the package ?SNP2GO #view the help-page of the package ?snp2go #view the help-page of the snp2go functionNote: The package depends on the packages goProfiles, hash and GenomicRanges. Please make sure that those packages are installed. If not, you can install them typing the following commands in your R console:
# R console: install.packages("hash") source("http://bioconductor.org/biocLite.R") biocLite("goProfiles") biocLite("GenomicRanges")
In the next example we use data from D. melanogaster, but the same work-flow can be applied for any other organism. SNP2GO needs a GFF file (or GTF) and a VCF file(s) containing the SNP coordinates.
A GFF file can be found at the Flybase website. This GFF file contains the gene coordinates and the GO terms associated with them. Alternatively, a GTF file can be download from Ensembl. The GTF format is slightly different to GFF's and do not provide the GO annotation. Therefore, if you are using a GTF file, you will also need to provide SNP2GO with a gene association file. To get a gene association file, go to Ensembl's MartView website and select the "Ensembl Genes" database and the "Drosophila melanogaster genes" dataset. The only attributes you need are the "Ensembl Gene ID" and "GO Term Accession". Download the data and request Ensembl to save it in a tsv-file. This protocol is very useful since many annotated genomes are provided in GTF format.
Finally, you need to provide SNP2GO with a list of SNPs. In our example well will use a single VCF file containing the SNPs. The file can be downloaded here, and consists in a list of SNPs sorted according to a P-value. The top 2,000 SNPs are described in the original study as candidate SNPs and the rest as non-candidate SNPs. You can try yourself to import them into R and run SNP2GO with the D. melanogaster GFF file.
You can copy and paste the next work-flow in your R console:
# R console: # load the SNP2GO package library(SNP2GO) # Read the VCF file and construct a GenomicRanges object: snps <- read.delim("BF15.vcf",header=FALSE,comment.char="#") snps[,2] <- as.numeric(snps[,2]) snps <- GRanges(seqnames=snps[,1],ranges=IRanges(snps[,2],snps[,2])) # Use the first 2000 SNPs as candidate SNPs and the rest as non-candidate SNPs: cand <- snps[1:2000] noncand <- snps[2001:length(snps)] # Case 1: Using a GFF file x <- snp2go(gff="dmel-all-no-analysis-r5.49.gff.gz", candidateSNPs=cand, noncandidateSNPs=noncand) # Case 2: Using a GTF file + gene association file y <- snp2go(gtf="Drosophila_melanogaster.BDGP5.70.gtf.gz", goFile="mart_export_dmel.txt", candidateSNPs=cand, noncandidateSNPs=noncand, FDR=0.05, runs=10000, extension=50) # Get all enriched GO terms of GFF analysis: gff.significant.terms <- x$enriched$GO # Get the first of the enriched GO terms: first.term <- gff.significant.terms # this is "GO:0051726" # Print all regions associated with at least one GO term: print(x$regions) # Print the regions associated with the first of the enriched GO terms: # There are two possibilities to do so: # version 1: print(x$regions[ x$go2ranges[[ "regions" ]][[ "GO:0051726" ]] ]) # version 2: print(x$regions[unlist(as.list(x$go2ranges[["regions"]][first.term]))]) # Although version 2 seems more complicated, it allows to get the regions # associated with more than one term. In the following example, all regions associated # with the first ten enriched GO terms (gff.significant.terms[1:10]) are printed: print(x$regions[unlist(as.list(x$go2ranges[["regions"]][gff.significant.terms[1:10]]))]) # Print the candidate SNPs associated with the first of the enriched GO terms: # Like for the GO regions, there are also two possibilities to do that: # version 1: print(cand[ x$go2ranges[["candidates"]][["GO:0051726"]] ]) # version 2: print(cand[unlist(as.list(x$go2ranges[["candidates"]][first.term]))]) # Print the noncandidate SNPs associated with the first of the enriched GO terms: # version 1: print(noncand[ x$go2ranges[["noncandidates"]][["GO:0051726"]] ]) # version 2: print(noncand[unlist(as.list(x$go2ranges[["noncandidates"]][first.term]))]) # Get number of informative candidates of the GTF analysis: z <- y$informative.candidate.snps # Print the list of all GO terms associated to at least one gene in the # GFF analysis: print(x$goterms) # Store the results in tab-seperated files write.table(file="snp2go_gff.tsv",x$enriched,sep="\t",row.names=F) write.table(file="snp2go_gtf.tsv",y$enriched,sep="\t",row.names=F)
Note: SNP2GO will by default print the significant GO terms having at least 10 genome regions associated to them.This behaviour can be changed by the min.regions parameter.Top
|Runs||Average running time (n=10)|