As promised, I’m sharing the results of my short read aligners poster from AGBT 2009. There was a healthy amount of interest in this topic at the meeting; I spoke with numerous people who largely were using Maq, but wanted to explore other options for short read alignment. Some people were interested in colorspace-capable aligners, presumably for SOLiD data, while others wanted a tool that could align across gaps (indels) in single-end mode.
The Short Read Aligner Comparison
In the end I presented a comparison of ten short read aligners: BFAST, Bowtie, CELL (CLCbio), cross_match, Maq, Novoalign (Novocraft), RMAP, SeqMap, SHRiMP, and SOAP. These are, of course, only a subset of the tools currently available – several people at AGBT asked about other aligners – but they’re a good sampling of different approaches to the same problem.
I focused on three data sets, all of which were based on 36-bp Illumina/Solexa paired-end libraries sequenced here as part of the 1,000 Genomes Project:
- 1 million simulated read pairs from C. elegans
- 1 million simulated read pairs from Hs36
- 1 million real Illumina/Solexa read pairs from a YRI sample
Speed and Accuracy of Aligners
One trend was immediately apparent: the aligners that used Burrows-Wheeler Transformation indexing of the reference sequence (Bowtie & SOAP) were consistently faster, especially at the human genome scale. It’s not surprising that Heng Li is now focusing on BWA as opposed to Maq. Another trend that I find rather frightening is this: when I introduced SNPs and indels, almost every aligner mis-placed (as in, placed uniquely but to the wrong location) 18% of the reads. This has important implications for variant detection, since even single base changes can have a dramatic effect.
Aligners that Disappointed
I noticed very quickly that SHRiMP was the slowest aligner, and given that it’s slated towards SOLiD (a platform we’re not heavily invested in), it was easy to drop SHRiMP from consideration. Another downer we knew about in advance was RMAP, whose authors abandoned their project after getting the publication out.
Aligners that Surprised Me
There’s a close relationship between our genome center and David Gordon / Phil Greene, which no doubt accounts for why cross_match continues to be used. Months ago Gordon came and touted the latest CM revision as something that was faster than Maq. Of course, he pointed out that you had to adjust several parameters from their default settings to make this happen. Did it turn out to be true? Not necessarily – in a few tests CM was faster than Maq, in a few it was slower, though they were always comparable. Where cross_match shined was sensitivity at mutated sites – in the C. elegans simulation, it successfully placed more reads in single-end mode than other aligners.
Most Promising Aligners
Maq is our current tool of choice, but we’re looking closely at Novoalign (a Maq-like tool with more sensitivity), as well as Bowtie/SOAP for speed considerations. For details, see the “Short Read Aligners” section of my blog.