SNP2GO User Guide
Table of contents:
You can download the program here and install it using R's install.packages function:
# R console:
library(SNP2GO) #load the package
?SNP2GO #view the help-page of the package
?snp2go #view the help-page of the snp2go function
Note: The package depends on the packages goProfiles, hash and GenomicRanges. Please make sure that those packages are installed. If not, you can install them typing the following commands in your R console:
# R console:
candidateSNPs: A GenomicRanges object containing the coordinates of the candidate SNPs.
noncandidateSNPs: A GenomicRanges object containing the coordinates of the non-candidate SNPs.
gff: Path to the GFF file. The GFF files must contain the attribute "Ontology_term"
gtf: Path to the GTF file
goFile: Path to a file containing the gene ID --> GO ID associations. The gene IDs must be the same as the gene IDs used in the GTF file. This parameter is needed only when the "gtf" parameter is used.
runs: Number of times the program will repeat the random sampling. Default: 100,000.
FDR: Significance of the Fisher's exact test after false discovery rate. SNP2GO will report GO terms with FDR values lower than the value set by the user. Default: 0.05
extension: Number of nucleotides up and down the gene that will be included for the extended definition of the genome region. Default: 0 (i.e., gene region = gene)
min.regions: Minimum number of regions associated to GO terms. SNP2GO will report GO terms having at least min.regions.
verbose: Progress report. Default: TRUE.
SNP2GO returns a named list containing the following elements:
candidate.snps: Number of candidate SNPs provided by the user.
noncandidate.snps: Number of non-candidate SNPs provided by the user.
informative.candidate.snps: Number of candidate SNPs that can be associated with a GO term.
informative.noncandidate.snps: Number of non-candidate SNPs that can be associated with an annotated GO term.
FDR: FDR value set by the user.
runs: Number of times SNP2GO repeated the hypergeometric sampling.
extension: Number of nucleotides up and down the gene used to extend the definition of the genome region.
goterms: The list of all GO terms analysed by SNP2GO.
regions: A GenomicRanges object that contains all regions of the input file that are associated to at least one GO term. These regions are extended by the number of nucleotides specified by the extension parameter.
go2ranges: A list that contains three hash tables that store the mappings of GO terms to (1) GO regions (i.e. genes), (2) candidate SNPs and (3) noncandidate SNPs:
go2ranges[["regions"]][[g]] contains the indices of all GO regions, go2ranges[["candidates"]][[g]] contains the indices of all candidate SNPs and go2ranges[["noncandidates"]][[g]] contains the indices of all noncandidate SNPs associated with GO term g.
The indices of the GO regions refer to the regions element returned by the snp2go function, the indices of the
candidate- and noncandidate SNPs refer to the candidate- and noncandidate SNPs passed as GenomicRanges objects to the snp2go function.
enriched: A dataframe that reports the significant GO terms.
The enriched dataframe reporting the significant GO terms provides the next information:
- GO: ID of the significant GO term.
- p.L: Proportion of iterations in which the hypergeometric sampling found less or equal candidate regions than observed.
- p.G: Proportion of iterations in which the hypergeometric sampling found more or equal candidate regions than observed.
- g: Number of genomic regions associated with the significant GO term having at least one candidate SNP.
- G: Total number of genomic regions associated with the significant GO term.
- nc: Number of candidate SNPs located in G.
- mc: Number of non-candidate SNPs located in G.
- P: P-value of the Fisher's exact test.
- FDR: Adjusted P-Value after applying the Benjamini-Hochberg method.
- GO.def: Definition of the significant GO term.
- Child.GOs: Child terms that are also significantly enriched with candidate SNPs.
In the next example we use data from D. melanogaster, but the same work-flow can be applied for any other organism. SNP2GO needs a GFF file (or GTF) and a VCF file(s) containing the SNP coordinates.
A GFF file can be found at the Flybase website. This GFF file contains the gene coordinates and the GO terms associated with them. Alternatively, a GTF file can be download from Ensembl. The GTF format is slightly different to GFF's and do not provide the GO annotation. Therefore, if you are using a GTF file, you will also need to provide SNP2GO with a gene association file. To get a gene association file, go to Ensembl's MartView website and select the "Ensembl Genes" database and the "Drosophila melanogaster genes" dataset. The only attributes you need are the "Ensembl Gene ID" and "GO Term Accession". Download the data and request Ensembl to save it in a tsv-file. This protocol is very useful since many annotated genomes are provided in GTF format.
Finally, you need to provide SNP2GO with a list of SNPs. In our example well will use a single VCF file containing the SNPs. The file can be downloaded here, and consists in a list of SNPs sorted according to a P-value. The top 2,000 SNPs are described in the original study as candidate SNPs and the rest as non-candidate SNPs. You can try yourself to import them into R and run SNP2GO with the D. melanogaster GFF file.
You can copy and paste the next work-flow in your R console:
# R console:
# load the SNP2GO package
# Read the VCF file and construct a GenomicRanges object:
snps <- read.delim("BF15.vcf",header=FALSE,comment.char="#")
snps[,2] <- as.numeric(snps[,2])
snps <- GRanges(seqnames=snps[,1],ranges=IRanges(snps[,2],snps[,2]))
# Use the first 2000 SNPs as candidate SNPs and the rest as non-candidate SNPs:
cand <- snps[1:2000]
noncand <- snps[2001:length(snps)]
# Case 1: Using a GFF file
x <- snp2go(gff="dmel-all-no-analysis-r5.49.gff.gz",
# Case 2: Using a GTF file + gene association file
y <- snp2go(gtf="Drosophila_melanogaster.BDGP5.70.gtf.gz",
# Get all enriched GO terms of GFF analysis:
gff.significant.terms <- x$enriched$GO
# Get the first of the enriched GO terms:
first.term <- gff.significant.terms # this is "GO:0051726"
# Print all regions associated with at least one GO term:
# Print the regions associated with the first of the enriched GO terms:
# There are two possibilities to do so:
# version 1:
print(x$regions[ x$go2ranges[[ "regions" ]][[ "GO:0051726" ]] ])
# version 2:
# Although version 2 seems more complicated, it allows to get the regions
# associated with more than one term. In the following example, all regions associated
# with the first ten enriched GO terms (gff.significant.terms[1:10]) are printed:
# Print the candidate SNPs associated with the first of the enriched GO terms:
# Like for the GO regions, there are also two possibilities to do that:
# version 1:
print(cand[ x$go2ranges[["candidates"]][["GO:0051726"]] ])
# version 2:
# Print the noncandidate SNPs associated with the first of the enriched GO terms:
# version 1:
print(noncand[ x$go2ranges[["noncandidates"]][["GO:0051726"]] ])
# version 2:
# Get number of informative candidates of the GTF analysis:
z <- y$informative.candidate.snps
# Print the list of all GO terms associated to at least one gene in the
# GFF analysis:
# Store the results in tab-seperated files
Note: SNP2GO will by default print the significant GO terms having at least 10 genome regions associated to them.This behaviour can be changed by the min.regions parameter.
Running times were tested in one core of a PC with Intel Core i5 Processor with 8 GB of main memory. Computing time depends on the number of significant GO terms because we carry out a second test for these terms. The next table summarizes the average running times using 2,000 candidate SNPs and 1.6 million non-candidate SNPs distributed in the D. melanogaster genome.
|Runs||Average running time (n=10)|