NextGenMap (NGM) - The Next Generation of Mapping
The central element of NextGenMap is the banded alignment algorithm that allows for a
sensible adjustment of the scores, when for example evolutionary diverged reads or reads, where
a high error rate is suspected, are mapped to a reference genome. NextGenMap supports a Smith-Waterman (SW)
and a Needleman-Wunsch (NW) banded alignment computation, thus allowing for a user-defined maximal number of
admissible consecutive insertions and deletions.
A lookup table is computed that assigns putative genomic positions to a read based on a short exact matching word.
NextGenMap requires a PC with Linux operation system equipped with at least 4 GB of memory.
Furthermore NGM requires a CUDA enabled graphic card. See http://en.wikipedia.org/wiki/CUDA for further details.
However, we strongly suggest using a graphic card with more then 500 MB memory, especially for 454 reads.
First you have to install CUDA on the PC, where NextGenMap should be run on.
For this please go to the NVIDIA Homage ( http://developer.nvidia.com/object/cuda_3_2_downloads.html#Linux. ).
Please download and install the Developer Drivers for Linux. Note you need root permissions for installing.
Then extract the NGM package. Open a console and switch to the NGM directory. Type: make and press enter.
After a few seconds you can find the executable of NextGenMap in the bin folder. An example Parameter file can be found
in the bin folder.
If you have problems installing or running NextGenMap on your PC, please visit (http://www.cibiv.at/software/ngm).
The file containing the reference sequences. This file has to be a (multi) fasta file containing all the sequences the reads should be mapped on.
A read file contains all the reads that should be mapped to a reference. NGM supports fasta and fastq files as input.
The name where the mapping results should be saved. All reads that could not be mapped to the reference are written to a fasta file which is named like the Output file followed by _notmapped.fa.
In the Parameter file a variety of parameters can be specified.
Word length k: defines the length of the words that should be identical between the read and the reference genome. A small k implies many potential positions in the reference genome. The current version of NextGenMap supports k values from 6 to 12.
s : width of the banded Smith Waterman. s specifies the number of consecutive insertions or deletions in the SW alignment. The wobble parameter h that defines the range of possible distance between two matching words is directly related to s and is defined as 1/3 of s.
Word matches: This value specifies the number of words that have to be present in both the read and a region in the reference, before both are subjected to the alignment. The length of the region is defined as the length of the read plus the corridor width (s).
Max Scaffold: NextGenMap splits the reference sequences to adjust the search to the memory of the computer.
Default: 10,000,000 nucleotides
Max number of reads: NextGenMap partitions the read data to run the computations on computers with little memory.
Default: 3,000,000 reads
Match score: The value that gets assigned for equal characters in the alignment
Mismatch score: The value that gets assigned for unequal characters in the alignment
Deletion in read score: The value that gets assigned for a shift in the reference.
Insertion in read score: The value that gets assigned for a shift event in the read
Minimal Score: Any Smith-Waterman alignment score larger than the minimal score is reported.
Word sliding: Defines the distance of the starting points of words given the read. Effectively this means if the first word starts at position 0 of the read, the second is generated on 0+ Word sliding. This parameter strongly influences the running time and sensitivity of NextGenMap.
Identity: Here you can set what the minimum identity of an alignment is. If a read does not fulfill this criterion it is discarded. Identity is defined as the ratio of number of identical nucleotide pairs divided by the alignment length of the corresponding read.
GPU_Id: If your Pc is equipped with 1 graphic card then this value is always 0. If you have a PC with more then one graphic card you have to assign NextGenMap to the card that it should run on. To do so, each instant of NextGenMap has to have a different GPU ID. The IDs starts at 0 and goes up to (Number of graphic cards) -1.
Shortcut: If this value is set to 1 the reads are only mapped to the plus strand. If it is set to 2 then the reads are mapped to both strands.
Equal scoring: Given this value you can specify how many maximal equal scoring positions are reported.
Format: NGM supports two different formats, its own and SAM. The enable SAM output set this value to 1. The NGM human readable format is supported by 0.
In the following an example parameter file is given:
Deletion in read
Insertion in read
Corridor width of the Alignment (s)
Minimal Score a read has to fulfill in order to be reported
Max Scaffold length per iteration (Required to limited memory)
Max number of reads per iteration (Required to limited memory)
Word sliding step size of the reads
Minimum identity (0-1)
Number of word matches before a read is subjected to a alignment
GPU_Id of the graphic card, which should be used (starting at 0)
Shortcut: if set to 1 just plus strand is calculated
Equal scoring: Maximum number of reported equal scoring positions per read.
Format: 0== default NGM format, 1== Sam Format
A call of NGM looks like this:
For mapping single end Solexa or 454 reads:
./NextGenMap Parameter_file Reference_file Read_file Output_file
For mapping paired end Solexa sequences (both read files have to be fastq):
./NextGenMap Parameter_file -p Reference_file Read_file1 Read_file2 Outputfile
NGM supports its own human readable format and SAM format.
The output of NGM format per read is defined as follows:
Seq ID: Seq Match ID: id Name: read strand matches at: position at scaffold: chr Score: val
times: num identity: identity ++ Q_start: stat Q_end: end
Seq: Is an artificial ID.
Id: Is an artificial ID for each equal scoring position.
Read: Shows the name of the read, which is taken form the input read file.
Strand: indicating on which strand the read mapped to (either forward (+) or backward (-))
Position: The mapping position on the plus strand. This value is count from 0.
Chr: The name of the chromosome or other source parsed from the reference sequence input file.
Val: Indicates the alignment score given the scoring values specified in the parameter file
Num: Shows the number of equally good scoring positions. This number is count from 1.
Identity: Gives the ratio of matching position divided by the length of the alignment. Excluding any insertion or deletion site.
Q_start: Gives the number of 5’ bases of the read that are not aligned.
Q_end: Gives the number of 3’ bases of the read that are not aligned.
Aligned Reference: Gives the resulting sequence of the reference region this read was aligned to.
Aligned Read: Gives the resulting sequence of the read given the region of the reference.
Seq ID: 2 Match ID: 0 Name: SRR002323.3.1 forward matches at: 13695 at scaffold: chr2 Score: 126
times: 1 identity: 0.916667 % ++ Q_start: 0 Q_end: 0