Ten Favorite Things About Maq

August 26, 2008 by Dan Koboldt

Heng Li’s brilliant short read alignment tool finally went legit with a publication in Genome Research that came online this month. It’s an important milestone for the open source tool that, by most accounts, out-performs just about every next-gen alignment algorithm to come out.

To commemorate the occasion, I decided to put together this list of my Ten Favorite Things About Maq:

10. The map file. This single file is a one-stop shop. It keeps the alignments, sequences, everything you need to process Solexa data.

9. Random placement. Reads in repeats are assigned alignment scores of zero and randomly placed, which helps paint a more accurate picture of the sequencing coverage across a genome.

8. Conversion tools. Although you have to convert just about any input file 2-3 times, at least maq provides all of the conversion scripts.

7. No gaps, please. Maq generally won’t even try for gapped alignments for short reads, a decision that I wholeheartedly support.

6. The version. It’s widely used and well documented, yet the version’s not even to 0.7.

5. Binary files. You know a program’s fast when it won’t touch ASCII input.

4. Alignment qualities. The real reason maq is superior to most aligners: maq uses individual base qualities when searching for a read’s best alignment.

3. Read simulation. Maq will “train” itself on a real data set, then generate simulated Solexa reads from a reference sequence based on the “real” data characteristics.

2. Good docs. For once, software that comes with complete, usable documentation.

1. The name. You might think “maq” is confusing, but it’s better than the old name, mapASS.

It reminds me of this gem of dialogue from “The Princess Bride”:

Prince: Such an unusual name, “Latrine.” How did your family come by it?
Latrine: We changed it in the 9th century.
Prince: You mean you changed it TO “Latrine”?
Latrine: Yeah. Used to be “Shithouse.”

Maq is good stuff. Thanks, Brian, for showing me the light.

The False Positives in Deep Resequencing

August 22, 2008 by Dan Koboldt

At last the PNAS article previewed earlier this week by In Sequence is available on the journal’s site. Subcloncal phylogenetic structures in cancer revealed by ultra-deep sequencing had two aspects that appealed strongly to me – the use of massively parallel sequencing to study leukemia, and a formalized algorithm to distinguish true variants from false-positives.

The authors set out to examine clonal evolution in cancer with next-generation sequencing of B-cell chronic lymphocytic leukemia (CLL) samples. CLL was an appealing model for this study because its high mutation rate in the short stretch of DNA that encodes the IG heavy chain (IGH). The short size of the locus was ideal for 454 sequencing, and because single-molecule reads are generated, the authors were able to identify haplotypes of somatic hypermutations carried by individual leukemic cells.

A key part of this study was the characterization of sequencing error rates and their causes. Three patterns of sequence errors were apparent:

Errors found near runs of 4 or more bases of the same nucleotide (homopolymers). This well-known artifact of pyrosequencing accounted for many false indel calls, and created false SNP calls as well.
Errors near the end of the sequence. These arise from a reduced signal-to-noise ratio after about 200 bases have been read.
Polymerase misincorporation during PCR. These are not sequencing errors, but random polymerase errors that created a low rate of substitutions through the length of the amplicon.

Weeding out false-positives is one of the greatest challenges facing those of us who analyze massively parallel sequencing data. Often this issue is addressed *after* the sequencing is done, with concordance estimates, decision trees, and the like. What I like about this study is that the authors looked at sequencing errors first, to precisely classify the sources of false-positives, and then built their variant-calling algorithm around the results.

The evolutionary biology aspect of this study is fascinating as well. Cancer is a powerful micro-system to study evolution, since subclones of cells have a mixture of shared and private somatic mutations and compete with one another to grow. Subclones with the best evolutionary fitness will, in time, come to dominate the population. It’s Darwinian fitness at its best.

By identifying haplotypes from single-molecule reads, the authors were able to construct phylogenetic trees of the leukemic cells in a single patient, something that could only be done on the 454 platform. Intriguingly, the initiating driver mutation of leukemogenesis occurred before the earliest branching of trees. Yet there were numerous different subclone haplotype – one came to dominate, but the others persisted as well. This suggests that every subclone persisting in the population picked up at least one additional mutation that gave it a competitive advantage. Thus even the rare subclones carry driver mutations that contribute to cancer cell survival.

The more rare subclones we can detect, the more mutations we can find, and the better we can come to understand the complex set of disease mechanisms that play a role in cancer.

NextGen Aligner Focus Group

June 23, 2008 by Dan Koboldt

As our genome center makes the tradition from capillary-based to massively parallel sequencing platforms, the development of automated pipelines for data processing has become a high priority. Last week we had a visit from Illumina’s informatics group to discuss several issues related to the GA (Solexa) platform, including image compression, data storage, workflow informatics, etc. There was also talk of a downstream analysis tool, called Bullfrog, that will perform SNP/indel/SV detection (though I got the impression that the software’s nowhere near release at present).

But Illumina is not the only platform, and Eland is certainly not the only aligner. Thus we’ve formed a focus group to evaluate the different programs for sequence alignment and variant detection in next-generation sequence data. We met last week and put together a list of aligners that work with Illumina (Solexa) and/or Roche (454) data. We also compiled, separately, a shorter list of external and internal programs that do SNP and indel detection on either platform. Some programs, like Maq, were in both lists because they do alignments and SNP detection. Some tools are feasible for short (Solexa-length) reads but not long (454-length) reads, and vice versa. In the end we had a list of 15 different aligners for Illumina/Roche data. Some are good, some are bad, and some we simply don’t know.

We agreed that the plan was to evaluate each aligner on the same data set, but decisions on which data set to use, and how to compare the different aligners, were matters of more intense debate. Should we work with human data, or focus on less complex genomes like C. elegans or E. coli? Performance metrics like CPU time, memory usage, disk space, and cost (some are non-free) are obvious points for comparison, but what about alignment accuracy? We need some way to determine if a read placement on the genome is correct or erroneous. How do we know? The question of alignment “truth” and how to determine it was not an easy one to answer.

After an hour of discussion, we tentatively agreed on a dataset – Illumina PE runs on the first human samples that we’ve already sequenced in-house for the 1000 Genomes Project. These runs come from one of the HapMap Project trios, which means that we can validate our SNP detection results against the known HapMap genotypes that were generated on a variety of platforms (and predominantly by other centers). Also, the 1000 Genomes Project DCC will be performing its own evaluation of alignment tools and sequence analysis using the same data, so we can compare notes.

We put together a short list, by platform, of the aligners to evaluate first. Some decisions here were easy – we’re obviously going to look at Maq and Eland for Solexa data, and we’re already evaluating BLAT and cross_match on some of our 454 data. Other decisions were more difficult – should we evaluate RMAP, whose authors [allegedly] don’t plan to continue development? What about SX OligoSearch, which we can currently only run on Itanium servers? We eventually had five or six aligners per platform that made the short list. This week, we’re putting together the data, and next week, the real work begins.

Short Read Aligners: Maq, Eland, and Others

May 14, 2008 by dkoboldt

This month I’ve come across some interesting statistics on the performance of Maq, Eland, and other short-read alignment tools as applied to Illumina/Solexa data. I took note because these programs are finally being evaluated against appropriate data sets, as opposed to simulated reads or tiny genomes. First the disclaimers: all of these numbers came from people other than myself (see Credits, below), so please forgive any inaccuracies. Also, this entry reflects my personal second-hand impressions of the different alignment tools, and should not be considered an endorsement or criticism of the different alignment tools by the WashU GC.

Short-Read Data Sets at the WashU Genome Center

One of our data sets includes 100+ Solexa runs (non-paired) from the genomic DNA of a single individual. We’ve applied a number of alignment tools to these data: Eland (part of the Illumina suite), Maq (free/open source), SX Oligo Search (proprietary), SlimSearch (proprietary), and even BLAT. Our group (Medical Genomics) is currently leaning toward Maq for read mapping and SNP discovery purposes. There’s recently been a new release of Maq (0.6.5) which seems to run substantially faster:

Metric	Maq 0.6.3	Maq 0.6.5
Average alignment time for normal runs	17.7 hours	9.1 hours
Max alignment time for a normal run	240 hours	28.8 hours
Total number of jobs	2168	1467
Jobs that took longer than 1 day	443	3

The developer of Maq, Heng Li, presented a poster describing the Maq algorithm at CSHL last week and also gave a small workshop talk on issues in short read mapping. He sent these links out to the Maq user list along with a benchmarking comparison of various read mapping tools.

Heng Li’s Comparison of Short-Read Aligners

For the comparison, Heng generated 1 million simulated read-pairs from chromosome X. The numbers themselves are a bit mind-boggling, but fortunately he summarized the results with these notes:

Eland: eland is definitely the fastest, much faster than all the competitors. What is more important, eland gives the number of alternative places, which makes it possible for you to get further information about the repetitive structures of the genome and to select reads that can be really confidently mapped. In addition, with the help of additional scripts, Eland IS able to map reads longer than 32bp. Eland is one of the best software I ever used. It would be even superior if Tony could make it easier to use for a user, like me, who wants to run eland independently of the GAPipeline.

RMAP: the strength of rmap is to use base qualities to improve the alignment accuracy. I believe it can produce better alignment than maq -se because maq trades accuracy for speed at this point (technically it is a bit hard to explain here). Nonetheless, I think rmap would be more popular if its authors could add support for fastq-like quality string which is now the standard in both Illumina and the Sanger Institute (although maybe not elsewhere). rmap supports longer reads, which is also a gain. Furthermore, I did learn a lot from its way to count the number of mismatches.

SOAP: soap is a versatile program. It supports iterative-trimmed alignment, long reads, gapped alignment, TAG alignment and PE mode. Its PE mode is easier to use than eland. In principle, soap and eland should give almost the same number of wrong alignments. However, soap gives 442 more wrong alignments. Further investigation shows that most of these 442 wrong ones are flagged as R? (repeat) by eland.

SHRiMP: Actually I was not expecting that a program using seeding +Smith-Waterman could be that fast. So far as I know, all the other software in the list do not do Smith-Waterman (maq does for PE data only), which is why they are fast. SHRiMP’s normodds score has similar meaning to mapping quality. Such score helps to determine whether an alignment is reliable. The most obvious advantage is SHRiMP can map long reads (454/capillary) with the standard gapped alignment. If you only work with small genomes, SHRiMP is a worthy choice. I think SHRiMP would be better if it could make use of paired end information; it would be even better if it could calculate mapping quality. The current normodds score helps but is not exactly the mapping quality. In addition, I also modified probcalc codes because in 1.04 underflow may occur to long reads and leads to “nan” normodds. However, although my revision fixes the underflow, it may lead to some inaccurate normodds.

Maq: at the moment maq is easier to use than eland. Supporting SNP calling is maq’s huge gain. Its paired end mode is also highly helpful to recover some repetitive regions. Maq’s random mapping, which is frequently misused by users who have not noticed mapping qualities, may be useful to some people, too, and at least it helps to call SNPs at the verge of repeats.

What a nice guy! Here he is, comparing his own tool against several competitors and he manages to praise the strengths of each one. That takes humility.

More Comments from Heng Li

Ken Chen, a colleague of mine, happened to discuss the benchmarking with Heng at Cold Spring Harbor. According to his evaluation, the current version of recently-published SOAP may be somewhat buggy (it had more mapping errors and crashed on paired-end alignment), but is nevertheless promising because it supports gapped alignment and longer reads. Paired-end alignment is perhaps Maq’s greatest strength; the alignment error rate from Maq for paired-end data is significantly reduced. Heng also mentioned that the upcoming new release of Eland will support longer read lengths (>32 bp) and will also calculate mapping quality scores.

Unbiased Comparisons of Short-Read Aligners

In summary, there are a number of competing tools for short read alignment, each with its own set of strengths, weaknesses, and caveats. It’s hard to trust any benchmarking comparison on tools like these because usually, it’s the developers of one of the tools that publish them. Here’s an idea: what if NHGRI, Illumina, or another group put together a short-read-aligning contest? They generate a few short-read data sets: real, simulated, with/without errors, with/without SNPs and indels, etc. Then, the developers of each aligner are invited to throw their best efforts at it. Every group submits the results to a DCC, which analyzes the results in a simple, unbiased way: # of reads placed correctly/incorrectly. # of SNPs/indels detected, missed, or false-positives. The results are published on a web site or in the literature for all to see. Yeah, I know, there are hurdles, like the fact that most proprietary tool developers would probably chicken out of an unbiased head-to-head comparison, given the stakes. But wouldn’t it be nice to know the results? Unless that happens, however, I think Heng’s analysis is about as unbiased as can be.

Credits

WashU GC Maq version comparisons were sent out by Jim Eldred on 5/01/2008. Heng Li’s benchmarking comparison was sent to the Maq user list on 5/12/2008. Additional comments from Heng Li were reported by Ken Chen on 5/12/2008.

« Previous Page