Genome centers around the world have at last found something that we agree on. It seems like only a few months ago that I first heard of SAM (Sequence Alignment/Map) and his brother BAM (binary/compressed SAM). Yet those who produce next-gen sequencing data are rapidly adopting this universal short read data format as the de facto standard.
The Need for a Standard Short Read Data Format
From my work with short read aligners over the past year, it was quite obvious that we needed a standard format. More than a dozen alignment tools for next-gen data have been released – Maq, Bowtie, Novoalign, SeqMap, BFAST, and SHRiMP, just to name a few – and every one of them has its own custom format. The closest thing to a standard was Maq’s MAP format, a non-human-readable file that reached prominence only because Maq was the most widely used tool out there.
NGS Community Looks to Heng Li
The need for a standard is almost certainly not the only reason for the rapid adoption of the SAM/BAM specification. Another key factor is Heng Li, of Richard Durbin’s lab at Sanger. His work in developing Maq, and his reputation as a scientist, established a wide credibility for Heng in the NGS community. While Maq is, and continues to be, a powerful tool for analyzing Illumina and ABI SOLiD data, newer algorithms that leveraged Burrows-Wheeler Transformation (BWT) proved orders of magnitude faster at mapping reads. Bowtie was one of the first; soon after the makers of SOAP released SOAP2, a revamped version of their short read aligner that uses Burrows-Wheeler and is no longer open source.
Heng’s group soon developed a sequel to Maq, called BWA, that leverages this algorithm to speed up read alignments. It also produces BAM format output by default, a key feature that encouraged friends of Maq to make the switch. Richard Durbin’s group worked with a number of institutions to develop the SAM/BAM specification. Perhaps even more important was the development (and recent publication) of SAMtools, a suite of programs for manipulating SAM/BAM files and calling variants. By working with a standard alignment format, SAMtools can serve as a generic, aligner-agnostic variant detection pipeline.
Adoption of SAM/BAM by the sequencing community has been surprisingly rapid and widespread. A few short read aligners like Maq and Bowtie already have tools for converting their native formats to SAM. The latest Novoalign has a SAM output option built-in. There’s even a SAM/BAM converter for 454 data, which sounds like a dicey prospect given the homopolymer undercalling/overcalling problem. Large-scale sequencing efforts like the 1,000 Genomes Project and The Cancer Genome Atlas now consider BAM the standard format for data exchange. Many variant calling algorithms, like those developed at the Broad Institute, now accept BAM input as well.
The rush to make SAM/BAM a standard surprised me, but I don’t think it’s a bad thing. We needed a way to share NGS data that could be easily compressed, but also had a human-readable form. My only concern is this: many people may not realize that the data in a SAM/BAM file is aligner-dependent. There’s no such thing as a SAM file for an Illumina lane – instead, there’s on file for Maq alignments, one file for Bowtie, one file for BWA, etc. The good news, however, is that now we will have a centralized format for direct head-to-head comparisons of short read aligners.