Publications
-
Niko Popitsch and Arndt von Haeseler
NGC: lossless and lossy compression of aligned high-throughput sequencing data
Nucl. Acids Res. first published online October 12, 2012 doi:10.1093/nar/gks939
(html), (pdf)
Software
NGC can be downloaded here (for non-commercial use only!):
Start NGC with
java -jar -Xmx4G ngc-core-0.0.1-standalone.jar <params>
Examples
Example: compress a BAM that was aligned to hg19 file with standard compression parameters and 4GB dedicated RAM
java -jar -Xmx4G ngc-core-0.0.1-standalone.jar compress -i data.bam -o data.ngc -r hg19.faPlease note that you may adapt the stringency settings of the used Picard SAM/BAM parser using the -validationStringency parameter. You may, e.g., set this param to "SILENT" if NGC/Picard complain about "MAPQ not being zero for unmapped reads" and similar format inaccuracies.
Example: decompress the resulting NGC file
java -jar -Xmx4G ngc-core-0.0.1-standalone.jar decompress -i data.ngc -o data-decompressed.bam -r hg19.fa
Example: compress/decompress a NGC file using parameters for (i) various per-base quality quantization schemes, (ii) maximum (bzip2) compression, (iii) qvalue RLE encoding. The read names are dropped and base qualities are preserved at the variant positions provided in the passed VCF file. Finally, the input SAM/BAM is not validated (fastest option). Please refer to the paper for details about the various quality quantization and compression strategies.
java -jar -Xmx4G ngc-core-0.0.1-standalone.jar compress -i data.sam -r hg19.fa -best -q1levels 30,50 -q2levels standard -qvalRleEncoding -truncateNames -variantList list.vcf -validationStringency SILENT
java -jar -Xmx4G ngc-core-0.0.1-standalone.jar decompress -i data.sam.ngc -r hg19.fa
Evaluation data
The following data sets were used in the NGC evaluation. The data was mapped with bwa, unmapped reads were pruned from the data. The resulting BAM files are also linked here:
- ChIP-Seq (mouse): SRX014899/SRR032209, [BAM]
- Reseq (human): SRX000376/SRR001471, [BAM]
- RNA-Seq (e.coli): ERX007969/ERR019653, [BAM]
- Reseq (e.coli, paired end): ERX008638/ERR022075, [BAM]
- Reseq (e.coli): SRX118029/SRR402891, [BAM]
- Reseq (a. thaliana): SRX011868/SRR029316, [BAM]
- Exome data set : please contact the author if you need access to the raw data. [BAM]
- Reseq/chr20 (human): NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam, be sure to login as user gsapubftp-anonymous, see here.