This month I’ve come across some interesting statistics on the performance of Maq, Eland, and other short-read alignment tools as applied to Illumina/Solexa data. I took note because these programs are finally being evaluated against appropriate data sets, as opposed to simulated reads or tiny genomes. First the disclaimers: all of these numbers came from people other than myself (see Credits, below), so please forgive any inaccuracies. Also, this entry reflects my personal second-hand impressions of the different alignment tools, and should not be considered an endorsement or criticism of the different alignment tools by the WashU GC.
Short-Read Data Sets at the WashU Genome Center
One of our data sets includes 100+ Solexa runs (non-paired) from the genomic DNA of a single individual. We’ve applied a number of alignment tools to these data: Eland (part of the Illumina suite), Maq (free/open source), SX Oligo Search (proprietary), SlimSearch (proprietary), and even BLAT. Our group (Medical Genomics) is currently leaning toward Maq for read mapping and SNP discovery purposes. There’s recently been a new release of Maq (0.6.5) which seems to run substantially faster:
Metric | Maq 0.6.3 | Maq 0.6.5 |
Average alignment time for normal runs | 17.7 hours | 9.1 hours |
Max alignment time for a normal run | 240 hours | 28.8 hours |
Total number of jobs | 2168 | 1467 |
Jobs that took longer than 1 day | 443 | 3 |
The developer of Maq, Heng Li, presented a poster describing the Maq algorithm at CSHL last week and also gave a small workshop talk on issues in short read mapping. He sent these links out to the Maq user list along with a benchmarking comparison of various read mapping tools.
Heng Li’s Comparison of Short-Read Aligners
For the comparison, Heng generated 1 million simulated read-pairs from chromosome X. The numbers themselves are a bit mind-boggling, but fortunately he summarized the results with these notes:
- Eland: eland is definitely the fastest, much faster than all the competitors. What is more important, eland gives the number of alternative places, which makes it possible for you to get further information about the repetitive structures of the genome and to select reads that can be really confidently mapped. In addition, with the help of additional scripts, Eland IS able to map reads longer than 32bp. Eland is one of the best software I ever used. It would be even superior if Tony could make it easier to use for a user, like me, who wants to run eland independently of the GAPipeline.
- RMAP: the strength of rmap is to use base qualities to improve the alignment accuracy. I believe it can produce better alignment than maq -se because maq trades accuracy for speed at this point (technically it is a bit hard to explain here). Nonetheless, I think rmap would be more popular if its authors could add support for fastq-like quality string which is now the standard in both Illumina and the Sanger Institute (although maybe not elsewhere). rmap supports longer reads, which is also a gain. Furthermore, I did learn a lot from its way to count the number of mismatches.
- SOAP: soap is a versatile program. It supports iterative-trimmed alignment, long reads, gapped alignment, TAG alignment and PE mode. Its PE mode is easier to use than eland. In principle, soap and eland should give almost the same number of wrong alignments. However, soap gives 442 more wrong alignments. Further investigation shows that most of these 442 wrong ones are flagged as R? (repeat) by eland.
- SHRiMP: Actually I was not expecting that a program using seeding +Smith-Waterman could be that fast. So far as I know, all the other software in the list do not do Smith-Waterman (maq does for PE data only), which is why they are fast. SHRiMP’s normodds score has similar meaning to mapping quality. Such score helps to determine whether an alignment is reliable. The most obvious advantage is SHRiMP can map long reads (454/capillary) with the standard gapped alignment. If you only work with small genomes, SHRiMP is a worthy choice. I think SHRiMP would be better if it could make use of paired end information; it would be even better if it could calculate mapping quality. The current normodds score helps but is not exactly the mapping quality. In addition, I also modified probcalc codes because in 1.04 underflow may occur to long reads and leads to “nan” normodds. However, although my revision fixes the underflow, it may lead to some inaccurate normodds.
- Maq: at the moment maq is easier to use than eland. Supporting SNP calling is maq’s huge gain. Its paired end mode is also highly helpful to recover some repetitive regions. Maq’s random mapping, which is frequently misused by users who have not noticed mapping qualities, may be useful to some people, too, and at least it helps to call SNPs at the verge of repeats.
What a nice guy! Here he is, comparing his own tool against several competitors and he manages to praise the strengths of each one. That takes humility.
More Comments from Heng Li
Ken Chen, a colleague of mine, happened to discuss the benchmarking with Heng at Cold Spring Harbor. According to his evaluation, the current version of recently-published SOAP may be somewhat buggy (it had more mapping errors and crashed on paired-end alignment), but is nevertheless promising because it supports gapped alignment and longer reads. Paired-end alignment is perhaps Maq’s greatest strength; the alignment error rate from Maq for paired-end data is significantly reduced. Heng also mentioned that the upcoming new release of Eland will support longer read lengths (>32 bp) and will also calculate mapping quality scores.
Unbiased Comparisons of Short-Read Aligners
In summary, there are a number of competing tools for short read alignment, each with its own set of strengths, weaknesses, and caveats. It’s hard to trust any benchmarking comparison on tools like these because usually, it’s the developers of one of the tools that publish them. Here’s an idea: what if NHGRI, Illumina, or another group put together a short-read-aligning contest? They generate a few short-read data sets: real, simulated, with/without errors, with/without SNPs and indels, etc. Then, the developers of each aligner are invited to throw their best efforts at it. Every group submits the results to a DCC, which analyzes the results in a simple, unbiased way: # of reads placed correctly/incorrectly. # of SNPs/indels detected, missed, or false-positives. The results are published on a web site or in the literature for all to see. Yeah, I know, there are hurdles, like the fact that most proprietary tool developers would probably chicken out of an unbiased head-to-head comparison, given the stakes. But wouldn’t it be nice to know the results? Unless that happens, however, I think Heng’s analysis is about as unbiased as can be.
Credits
WashU GC Maq version comparisons were sent out by Jim Eldred on 5/01/2008. Heng Li’s benchmarking comparison was sent to the Maq user list on 5/12/2008. Additional comments from Heng Li were reported by Ken Chen on 5/12/2008.
I like the idea of an independent contest. I’m actually working on these sort of problems at the moment and hopefully it will shape up into a nice little publication. I’d be happy to share some of my insights if you’re interested.
Have you tried DNAstar?
From what I can see there are three high level steps required for SNP calling. Mapping, Assembling and SNP calling. Which tools are best for each component would be interesting. MAQ currently does them all but so does ssaha_pileup and Mosiak. After using these two tools I agree with Heng MAQ is definitely much user friendly. Based on Heng’s poster the FP SNP rate is practically zero for both PE and SE reads but we have experienced a much higher FP rate using SE reads. Does anyone have any data on MAQ’s true FP rate and what can contribute to FPs? Some obvious causes would be contamination, amplification bias, and sequencing error.
Yes! You can run Eland independently – just get the binaries (eland_?? where ?? is the length of the reads and squashGenome) and the prompt is reasonably well documented. Cheers. It is definitely the fastest gun around.
The main contributor (other is read quality) to FPs is repeats or near repeats so a read has alternative alignment locations. Sometimes this will result in one alignment which may turn out to be a FP. Sometimes we’ll get multiple alignments with the same score. With MAQ, its habit of randomly choosing one alignment from a set of equal scoring alignments introduces some FP and some extra (random) TP. Other tools just report the read as having multiple alignments which some evaluations report as a FN.
At the current version, eland itself does not do paired end alignment, but there is other scripts in the GAPipeline which can achieve this by post-processing eland output. Another script calculate mapping qualities. Unfortunately, not many people know how to use them as they have not been well documented, so far as I know.
In addition, it is easy to benchmark alignment, but it is quite hard to bechmark SNP calling with simulation. A lot of troubles that cause wrong SNPs can hardly be simulated accurately as people are simply not aware of them.
I am under the impression that the random assignment protocol in MAQ for repeats results in a lower mapping quality score for such assigned reads.
Can anyone here point me in a direction to understand how SNP calling is conditioned on mapping quality score for MAQ?
I couldn’t find it on the MAQ page.
“The main contributor (other is read quality) to FPs is repeats or near repeats so a read has alternative alignment locations. Sometimes this will result in one alignment which may turn out to be a FP. Sometimes we’ll get multiple alignments with the same score. With MAQ, its habit of randomly choosing one alignment from a set of equal scoring alignments introduces some FP and some extra (random) TP. Other tools just report the read as having multiple alignments which some evaluations report as a FN.”
We generally get around this by providing only uniquely aligned reads from Eland as input to MAQ. The obvious drawback to this is that you may be losing some reads. However, I haven’t seen any drop in depth beyond 1X unless it is a highly repetitive region. Fortunately we did see some FPs go away with this approach but does anyone see any drawbacks?
I’m still not so keen on using eland in situations when my reads are less than 25bp but greater than 20bp e.g with microRNAs sequenced from Solexa. The FP question can only reliably answered on testing various scenarios where we actually know where our reads will map.
Something I’ve also seen is that you may not align a read where there are ambiguous characters around the mid-section of the short sequence read.
Using paired-end will definitely push up FP but with higher read errors, indels, etc, the curve for PE alignment may look better.
A new short-read aligner, novoalign, has just been released. It’s free to use, you dont need expensive hardware and works pretty well with Illumina reads.
Some features are:
* Gaps up to 7bp, affine gap penalties
* Can handle ambiguous codes in ref sequence.
* Quality-based scoring
* Adapter stripping for miRNA reads
* No heuristics – reports the best alignment
* Options for handling multiple alignments includes none, random, all alignments.
* Alignment Quality scores
* Can use fasta, fastq, solexa fastq, prb input formats
* Paired end with full Needleman-Wunsch on both ends.
* Paired end accepts a structural variation penalty and the best alignment may be two independent ends if score with SV penalty is better than the best pair that fits the fragment length distribution.
* Supports variable read lengths
* Includes optional soft masking of repeats.
* Iterative read trimming
Give it a whirl. In terms of performance it’s quite fast, some users on seqanswers.com have commented that it runs faster than the SOAP program.
Have a read on the website http://www.novocraft.com and download the executables for 64-bit Linux and Mac OS 10.5.3.