I’ll be attending the coveted Marco Island meeting early next month (February 4-8), where I’ll present a poster on my evaluations of short read aligners for next-gen sequencing data. As you might infer from our AML cancer genome paper, Maq has been the central alignment tool here for over a year. This may not always be the case, because the longer reads (75-100 bp) promised by Illumina/Solexa may eventually reach lengths where the maq algorithm no longer has superiority. Heng Li, Maq’s developer, is already working on a new aligner currently in beta that uses the Burrows-Wheeler Algorithm. Has anyone looked at it yet? I’m curious about it, but don’t yet have time.
Alignment Programs Evaluated
Here’s a partial list of the short read aligners that I’m evaluating for this poster. I listed 10 aligners in my accepted AGBT abstract, but I expect there will be some changes.
- Maq – obviously we’re evaluating Maq, not only as a benchmark for other aligners, but to better understand the results we’re getting from it. While we run ELAND (Illumina’s aligner) as well, more and more of our runs are paired-end, an area in which Maq is far stronger. Also, have you ever tried to look at ELAND output? It’s incomprehensible to me. No, it’s safe to say that we decided to gamble on Maq over a year ago, and so far, the bet has paid off.
- Novoalign – this is Colin Hercus’ alignment tool, already in v2.0. Its speed is at worst comparable to Maq’s (in single-ended mode), and it does offer paired-end alignment. High marks for usability and allowing gaps in single-end alignments.
- Bowtie – an aligner from Steven Salzberg’s group that claims to be 35x faster than Maq. My colleague Todd Wylie has evaluated Bowtie in some depth. Sadly, no paired-end mode yet.
- cross_match – the classic pairwise aligner has seen some dramatic performance changes to address nextgen data. Still waiting for the usability and documentation to catch up.
- RMAP – one of the few aligners (other than Maq/NovoCraft) that makes use of quality scores during alignment, RMAP shows promise. Unfortunately, there have been no updates since the initial release, and I hear through the grapevine that the authors have abandoned the project.
- SOAP – this tool has seen the most dramatic changes since I began my evaluation. Initially, I had several problems with SOAP v1 (couldn’t get PE mode to work, for example). And, the practice of scanning reads into memory was rather slow. However, SOAP v2 has significant performance improvements (PE works too) and I see that BGI is also developing SNP and indel callers. This is probably a tool to watch.
Alignment Metrics and Comparisons
So what do I look for in a short read aligner? Obviously speed is a consideration, since we’re generating ever-more-overwhelming amounts of data. Usability and compatability with our in-house platforms (notably Illumina/Solexa) are just as important. And because we have a pipeline in place already, I’m looking for aligners that can beat Maq – in performance, features, or sensitivity – and that’s not easy to do. Maq is fast and does quality-based alignment, single or paired-end, assembly, SNP calling… there’s a reason why the rest of the industry seems to be conforming to it. Furthermore, Maq is well documented and (thus far) consistently updated. The latter point is, I think, a very serious consideration. We have no use for a tool that was developed once just to get a publication and will never see future improvements.
The Advantage of Open Source
Maq is open source, too, which is certainly not a requirement for a next-gen aligner, though it’s a strong selling point. My former colleague Brian Dunford-Shore used to delve into the code of earlier Maq releases when we encountered a problem. Now that the codebase seems to be more robust, it’s still useful to be able to look at the Maq code (and .map file format) to develop our own ancillary tools. It’s safe to say that no matter how good the aligner, we’ll almost certainly use more than one in order to build the most comprehensive pipeline.
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
I looked at BWA and sadly it crashed. But I think it’s matured a fair bit and if Burrows-Wheeler transforms are anything to go by (Bowtie, SOAP2) , I would say it should turn out quite well.
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Hello Dan,
I am the main author of Bowtie and I wanted to note that a version of Bowtie with initial paired-end functionality should be available in a few weeks. That’s currently my highest priority. I’m trying to get it done before AGBT, where I’ll present a poster on Bowtie.
Thanks! – great blog,
Ben
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Are you able to comment/share experience with MAQ on paired-end data, especially on its sv script for structural variants.
What is the confidence in structural variants and indels detected by MAQ paired end module?
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Hi Dan,
I have some interesting ideas about use of MAQ. Can you please share your contact information? I would like to run my ideas by you and get your thoughts.
Thanks.
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Sounds good! Sadly I won’t be attending AGBT this year. Will you make your poster and data available after the conference?
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Thanks for the comments, everyone. Yes, I’ll undoubtedly post my results (and hopefully my entire poster) after AGBT.
AS, you can reach me at dkoboldt [at] genetics [dot] wustl [dot] edu.
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Have you looked at MIRA, and if so, any impressions?
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
In All fairness to the BWA developer I need to mention that I downloaded BWA 0.4.1 this week and gave it another try and it worked beautifully.
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
I know you only use freeware in your work. Me, too. However, it is still worth trying non-free programs such as ZOOM and CLC Bio. There are reasons why they dare to ask for thousands of dollars in this competitive area. My experience is ZOOM is really an attractive software. It seems to me that ZOOM achieves a good balance between effeciency and accuracy. Soap2 is fast, but it allows at most two mismatches at the moment; novoalign is accurate, but it is a little bit slow.
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
Deprecated: Function get_magic_quotes_gpc() is deprecated in /home/dkoboldt/public_html/massgenomics/wp-includes/formatting.php on line 4387
You make a good point, MB – there are certainly commercial options for short read alignment out there. Trouble is, when freely available and open source tools like Maq are comparable to a commercial product, what’s the motivation for anyone to spend ever-more-limited funding on the latter? Nevertheless, I was offered a trial license to CLCbio’s CELL2 and hope to include the program in my poster.