|
 |
Welcome to Bisulfite Sequencing Scorer
(BiSS)
Introduction:
BiSS(Bisulfite Sequencing Scorer) is
an analysis pipeline for the deep sequencing data after bisulfite
conversion (BS-Seq).
Mapping procedure:
using banded Smith-Waterman local alignment
algorithm
implementated in the Graphical Processing Unit based package
NextGenMap.
special asymetric scoring function
3.5 letter hashing table.
Downtream analysis (methylation calling):
using widely-used binomial test
using a comprehensive adaptive error estimate that
accounts for
sequencing errors
erroneous bisulfite conversion
and also
wrongly mapped reads.
Implemented in Perl and R scripts.
Availability:
Latest version BiSS-on-NextGenMap:
- The current version supports the multiple-core computing for
hashing process. Please refer to the old version here.
- Please read the User Manual carefully before using
NextGenMap the first time or when you upgrade the new version!
- The
current version of NextGenMap is currently only tested and released
under Linux.
Download:
How to use BiSS by an example:
Sample input data:
Reference genome in single/multiple FASTA format, e.g.: example_reference.fas
Bisulfite deep sequencing library in FASTA or FASTQ format,
e.g.: example_read.fa
(single-end 56bp short read library). Currently, the pair-end version is on testing phase and will
be released soon.
Download the sample data to your working folder directory.
For lookup table, you can download the default 12-mer
hashing table .
Or create your own k-mer by using a C++ bs_lookup.cpp program.
If you created your own table, you need to compile it first g++ bs_lookup.cpp -o bs_lookup
And then run the program as bs_lookup -k your_k_mer.
Two files bs_k-mer_CT.txt and bs_k-mer_GA.txt will be generated at the same folder.
Parameters:
You need to specify the parameters for NextGenMap as well as for BiSS in
the same parameter file.
You can download an example parameter file: example_parameters .
The order of parameters shown in the example file, the
odd line numbers are for comments (starting with #, for example
#wordsize) that briefly explain the parameter, and is followed by that specific parameter value.
Description of parameter file is as below (please pay attention on the bold parameters that are specficic
for BiSS!)
Parameters |
Description |
Default value |
Parameters |
Description |
Default value |
#word-size |
the k-mer for hashing procedure |
12 |
#Alignment-mode |
Alignment-type option (0:BiSS) |
0 |
#SW-match-score |
SW match score |
4 |
#BiSS-scoreT-T: |
SW match accounts for T-T matches |
4 |
#SW-mismatch-score |
SW mismatch score |
-2 |
#BiSS-scoreC-T: |
SW match accounts for bisulfite conversion C-T (reference-read) |
4 |
#SW-gap-score(read) |
SW penalty for gap in the read |
-10 |
#identity-cutoff(0..1): |
the identity cutoff for short read alignment |
0 |
#SW-gap-score(reference) |
SW penalty for gap in the read |
-10 |
#k-mer-hits_SWcutoff: |
cutoff for SW alignment |
2 |
#expanding-thres/corridor |
extension of the SW local alignment |
10 |
#GPU-ID: |
for setting up which GPU to use, it has to be unique |
0 |
#N-best-hits |
(others: disabled) |
1 |
#Forward/Backward: |
+ or - mapping direction |
2 |
#Minimal-Score |
SW score cutoff |
2 |
#BS-Seq-protocol: |
1: only forward direction (referring Lister et al,
2008), other: both strands |
1 |
#Max-reference-length-per-run |
maximal length of reference for each run |
10000010 |
#Multiple-Hits-allowed |
maximal multiple mapping allowed |
100 |
#Max-number-of-reads-per-run |
maximal number of reads for each run |
300000 |
#Output-format: |
(0== NextGenMap, 1== SAM) |
0 |
#K-mer-step-size |
|
2 |
#BiSS-Watson-lookup |
lookup table file for Watson genome |
lookup/bs_12-mer_CT.txt
|
|
|
|
#BiSS-Crick-lookup |
lookup table file for Crick genome |
lookup/bs_12-mer_GA.txt
|
Running BiSS on NextGenMap package
Please make sure that you have all necessary files (parameter
file, reference genome, short read library) in the working folder and following the below template.
NextGenMap_v0.0.1 <Parameter File> <Reference Genome> <Short read file>
NextGenMap_v0.0.1 example_parameters example_reference.fas
example_read.fa example.output |
Output:
The output format (e.g
example.output.) is NextGenMap output format as below (We also
support to procedure the SAM format (parameter #Output-format) for
visualization)
Example Output:
Seq ID: 0 Match ID: 0 Name: ref_W_35047_0:0:0_0/0 forward
matches at:
35046 at scaffold:
example_reference Score: 224
times: 1 identity: 0.732143 ++ Q_start: 0 Q_end: 0
TTTCTTCTCTCTCTCCTCTTTTGACAACTCCTATACACCAAAATAGGCAATTTAAC
TTTTTTTTCTTTTTTCTTTTTTGATAATTTTTATATATTAAAATAGGTAATTTAAT
For each short read, 4 output lines are reported:
Line 1 |
|
Seq ID: NextGenMap ID.
Match ID: index for each equal scored
alignment.
Name: short read name.
Strand: mapped strand (forward (+) or backward
(-)).
matches at: mapped location in reference
(0-base coordinates).
at scaffold: mapped chromosome.
Score: SW local alignment score.
|
Line 2 |
|
times: number of equally best scored
alignments.
identity: local alignment identity (excluding
any insertion or deletion site).
++/+-/-+/--: indicate the genome strand and
direction of mapping.
Q_start: start mapped location in the short
read.
Q_end: end mapped location in the short read.
|
Line 3 |
|
Aligned Reference: reference sequence of the
alignment.
|
Line 4 |
|
Aligned Read: short read sequence of the
alignment.
|
Downstream analysis
Parsing the NextGenMap output mapping to produce the
alignment profile for each cytosine by a Perl script biss_downstream.pl
perl biss_downstream.pl -r <reference_genome> -m
<output_alignment> -c <identity_cutoff>
(identity_cutoff is to remove the alignment whose less identity from short read SW alignment
excluding C-T(G-A) mismatches.)
perl biss_downstream.pl
-r example_reference.fas -m
example.output -c 85 |
example.output gives the alignment profile for cytosines (both strand, e.g example_W.profile and example_C.profile).
It consists of
4 columns:
(1) corresponding genomic coordinate,
(2) total number of mapped reads,
(3) total number of mapped reads with C,
(4) total number of mapped reads with T
at that coordinate.
Below is one example
(1) (2) (3)
(4)
1058 0
12 12
1068 0
9 9
1070 0
8 9
1088 1
8 9
1097 7
0 7
1101 2
4 7
1102 4
2 6
1107 1
4 5
1112 2
2 4
1118 1
1 2
1125 2
0 2
1132 1
0 1
1138 0
0 0
Performing the methylation calling test using a methylation_calling.R .
methylation_calling.R --profile
<alignment_profile_file>
--output <methylation_profile_file> --errrate
<default_error_rate> [0.04]
--cov <minimal_coverage>
( is the parameter for the error of binomial test used, default by 0.04)
R --vanilla <
methylation_calling.R --profile
outW.profile --output outW.meth -errrate 0.4 --cov 5 |
The program computes the methylation profile (e.g example_W.meth) for the input
alignment profile.
It includes 4 lines: (1) Genomic coordinate, (2) number of mapped reads
with C at that corrdinate,
(3) number of mapped reads with T at that corrdinate, (4) total
coverage, and (5) Benjamini-Hochberg FDR-corrected p.value
Below is one example
(1) (2)
(3) (4) (5)
1058 0
12 12 1
1068 0
9
9 1
1070 0
8
9 1
1088 1
8
9 1
1097 7
0
7 4.70988912692589e-09
1101 2
4
7 0.366540175861658
1102 4
2
6 0.000460400569779035
1107 1
4
5 1
1112 2
2
4 NA
1118 1
1
2 NA
1125 2
0
2 NA
1132 1
0
1 NA
1138 0
0
0 NA
You can customize the
sinlge-cytosine methylation profiles for your own purpose.
Note:
Please, let us know if you download or have questions about using this
pipeline by sending
an email to {fritz.sedlazeck,huy.dinh}@univie.ac.at.
Reference:
The manuscript is currently being prepared for submission:
Huy Q. Dinh, Manu Dubin, Fritz J. Sedlazeck, Nicole Lettner, Ortrun Mittelsten Scheid,
Arndt von Haesler. Advanced methylome analysis after bisulfite deep sequencing: an example in Arabidopsis
Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD,
Berry CC, Millar AH, Ecker JR. Highly integrated
single-base resolution maps of the epigenome in Arabidopsis Cell. 2008 May 2;133(3):523-36.
Supplementary data: An example on real data from floral-bud tissue Arabidopsis
thaliana:
You can download our results on re-analysis of floral bud Arabidopsis ( TAIR7 genome version ) that include the
comparison with previously published data. |
|
[an error occurred while processing this directive]
|
|
 |
 |
 |
|