I’ll be attending the coveted Marco Island meeting early next month (February 4-8), where I’ll present a poster on my evaluations of short read aligners for next-gen sequencing data. As you might infer from our AML cancer genome paper, Maq has been the central alignment tool here for over a year. This may not always be the case, because the longer reads (75-100 bp) promised by Illumina/Solexa may eventually reach lengths where the maq algorithm no longer has superiority. Heng Li, Maq’s developer, is already working on a new aligner currently in beta that uses the Burrows-Wheeler Algorithm. Has anyone looked at it yet? I’m curious about it, but don’t yet have time.
Alignment Programs Evaluated
Here’s a partial list of the short read aligners that I’m evaluating for this poster. I listed 10 aligners in my accepted AGBT abstract, but I expect there will be some changes.
- Maq – obviously we’re evaluating Maq, not only as a benchmark for other aligners, but to better understand the results we’re getting from it. While we run ELAND (Illumina’s aligner) as well, more and more of our runs are paired-end, an area in which Maq is far stronger. Also, have you ever tried to look at ELAND output? It’s incomprehensible to me. No, it’s safe to say that we decided to gamble on Maq over a year ago, and so far, the bet has paid off.
- Novoalign – this is Colin Hercus’ alignment tool, already in v2.0. Its speed is at worst comparable to Maq’s (in single-ended mode), and it does offer paired-end alignment. High marks for usability and allowing gaps in single-end alignments.
- Bowtie – an aligner from Steven Salzberg’s group that claims to be 35x faster than Maq. My colleague Todd Wylie has evaluated Bowtie in some depth. Sadly, no paired-end mode yet.
- cross_match – the classic pairwise aligner has seen some dramatic performance changes to address nextgen data. Still waiting for the usability and documentation to catch up.
- RMAP – one of the few aligners (other than Maq/NovoCraft) that makes use of quality scores during alignment, RMAP shows promise. Unfortunately, there have been no updates since the initial release, and I hear through the grapevine that the authors have abandoned the project.
- SOAP – this tool has seen the most dramatic changes since I began my evaluation. Initially, I had several problems with SOAP v1 (couldn’t get PE mode to work, for example). And, the practice of scanning reads into memory was rather slow. However, SOAP v2 has significant performance improvements (PE works too) and I see that BGI is also developing SNP and indel callers. This is probably a tool to watch.
Alignment Metrics and Comparisons
So what do I look for in a short read aligner? Obviously speed is a consideration, since we’re generating ever-more-overwhelming amounts of data. Usability and compatability with our in-house platforms (notably Illumina/Solexa) are just as important. And because we have a pipeline in place already, I’m looking for aligners that can beat Maq – in performance, features, or sensitivity – and that’s not easy to do. Maq is fast and does quality-based alignment, single or paired-end, assembly, SNP calling… there’s a reason why the rest of the industry seems to be conforming to it. Furthermore, Maq is well documented and (thus far) consistently updated. The latter point is, I think, a very serious consideration. We have no use for a tool that was developed once just to get a publication and will never see future improvements.
The Advantage of Open Source
Maq is open source, too, which is certainly not a requirement for a next-gen aligner, though it’s a strong selling point. My former colleague Brian Dunford-Shore used to delve into the code of earlier Maq releases when we encountered a problem. Now that the codebase seems to be more robust, it’s still useful to be able to look at the Maq code (and .map file format) to develop our own ancillary tools. It’s safe to say that no matter how good the aligner, we’ll almost certainly use more than one in order to build the most comprehensive pipeline.