Short Read Aligner Results

February 13, 2009 by Dan Koboldt

As promised, I’m sharing the results of my short read aligners poster from AGBT 2009. There was a healthy amount of interest in this topic at the meeting; I spoke with numerous people who largely were using Maq, but wanted to explore other options for short read alignment. Some people were interested in colorspace-capable aligners, presumably for SOLiD data, while others wanted a tool that could align across gaps (indels) in single-end mode.

The Short Read Aligner Comparison

In the end I presented a comparison of ten short read aligners: BFAST, Bowtie, CELL (CLCbio), cross_match, Maq, Novoalign (Novocraft), RMAP, SeqMap, SHRiMP, and SOAP. These are, of course, only a subset of the tools currently available – several people at AGBT asked about other aligners – but they’re a good sampling of different approaches to the same problem.

I focused on three data sets, all of which were based on 36-bp Illumina/Solexa paired-end libraries sequenced here as part of the 1,000 Genomes Project:

1 million simulated read pairs from C. elegans
1 million simulated read pairs from Hs36
1 million real Illumina/Solexa read pairs from a YRI sample

Speed and Accuracy of Aligners

One trend was immediately apparent: the aligners that used Burrows-Wheeler Transformation indexing of the reference sequence (Bowtie & SOAP) were consistently faster, especially at the human genome scale. It’s not surprising that Heng Li is now focusing on BWA as opposed to Maq. Another trend that I find rather frightening is this: when I introduced SNPs and indels, almost every aligner mis-placed (as in, placed uniquely but to the wrong location) 18% of the reads. This has important implications for variant detection, since even single base changes can have a dramatic effect.

Aligners that Disappointed

I noticed very quickly that SHRiMP was the slowest aligner, and given that it’s slated towards SOLiD (a platform we’re not heavily invested in), it was easy to drop SHRiMP from consideration. Another downer we knew about in advance was RMAP, whose authors abandoned their project after getting the publication out.

Aligners that Surprised Me

There’s a close relationship between our genome center and David Gordon / Phil Greene, which no doubt accounts for why cross_match continues to be used. Months ago Gordon came and touted the latest CM revision as something that was faster than Maq. Of course, he pointed out that you had to adjust several parameters from their default settings to make this happen. Did it turn out to be true? Not necessarily – in a few tests CM was faster than Maq, in a few it was slower, though they were always comparable. Where cross_match shined was sensitivity at mutated sites – in the C. elegans simulation, it successfully placed more reads in single-end mode than other aligners.

Most Promising Aligners

Maq is our current tool of choice, but we’re looking closely at Novoalign (a Maq-like tool with more sensitivity), as well as Bowtie/SOAP for speed considerations. For details, see the “Short Read Aligners” section of my blog.

Short Read Aligners Update at AGBT

January 20, 2009 by Dan Koboldt

Marco Island: Feb 4-8, 2009

I’ll be attending the coveted Marco Island meeting early next month (February 4-8), where I’ll present a poster on my evaluations of short read aligners for next-gen sequencing data. As you might infer from our AML cancer genome paper, Maq has been the central alignment tool here for over a year. This may not always be the case, because the longer reads (75-100 bp) promised by Illumina/Solexa may eventually reach lengths where the maq algorithm no longer has superiority. Heng Li, Maq’s developer, is already working on a new aligner currently in beta that uses the Burrows-Wheeler Algorithm. Has anyone looked at it yet? I’m curious about it, but don’t yet have time.

Alignment Programs Evaluated

Here’s a partial list of the short read aligners that I’m evaluating for this poster. I listed 10 aligners in my accepted AGBT abstract, but I expect there will be some changes.

Maq – obviously we’re evaluating Maq, not only as a benchmark for other aligners, but to better understand the results we’re getting from it. While we run ELAND (Illumina’s aligner) as well, more and more of our runs are paired-end, an area in which Maq is far stronger. Also, have you ever tried to look at ELAND output? It’s incomprehensible to me. No, it’s safe to say that we decided to gamble on Maq over a year ago, and so far, the bet has paid off.
Novoalign – this is Colin Hercus’ alignment tool, already in v2.0. Its speed is at worst comparable to Maq’s (in single-ended mode), and it does offer paired-end alignment. High marks for usability and allowing gaps in single-end alignments.
Bowtie – an aligner from Steven Salzberg’s group that claims to be 35x faster than Maq. My colleague Todd Wylie has evaluated Bowtie in some depth. Sadly, no paired-end mode yet.
cross_match – the classic pairwise aligner has seen some dramatic performance changes to address nextgen data. Still waiting for the usability and documentation to catch up.
RMAP – one of the few aligners (other than Maq/NovoCraft) that makes use of quality scores during alignment, RMAP shows promise. Unfortunately, there have been no updates since the initial release, and I hear through the grapevine that the authors have abandoned the project.
SOAP – this tool has seen the most dramatic changes since I began my evaluation. Initially, I had several problems with SOAP v1 (couldn’t get PE mode to work, for example). And, the practice of scanning reads into memory was rather slow. However, SOAP v2 has significant performance improvements (PE works too) and I see that BGI is also developing SNP and indel callers. This is probably a tool to watch.

Alignment Metrics and Comparisons

So what do I look for in a short read aligner? Obviously speed is a consideration, since we’re generating ever-more-overwhelming amounts of data. Usability and compatability with our in-house platforms (notably Illumina/Solexa) are just as important. And because we have a pipeline in place already, I’m looking for aligners that can beat Maq – in performance, features, or sensitivity – and that’s not easy to do. Maq is fast and does quality-based alignment, single or paired-end, assembly, SNP calling… there’s a reason why the rest of the industry seems to be conforming to it. Furthermore, Maq is well documented and (thus far) consistently updated. The latter point is, I think, a very serious consideration. We have no use for a tool that was developed once just to get a publication and will never see future improvements.

The Advantage of Open Source

Maq is open source, too, which is certainly not a requirement for a next-gen aligner, though it’s a strong selling point. My former colleague Brian Dunford-Shore used to delve into the code of earlier Maq releases when we encountered a problem. Now that the codebase seems to be more robust, it’s still useful to be able to look at the Maq code (and .map file format) to develop our own ancillary tools. It’s safe to say that no matter how good the aligner, we’ll almost certainly use more than one in order to build the most comprehensive pipeline.

Everybody was SV Detecting…

December 12, 2008 by Dan Koboldt

It seems like everyone is looking at structural variant detection these days.

We recently had a visit from Ben Raphael, a friend of the genome center whom we tried to recruit years ago when he was a postdoc. Now he heads a group at Brown University, where (by his own admission) they basically taps into some of the large datasets out there (like TSP and TCGA) and develop/apply their own algorithms. Ben gave a talk on structural variation in human and cancer genomes, in which he presented some of the work that he and colleagues have pioneered in End Sequence Profiling (ESP).

Who is this guy?

Ben’s main background is in mathematics and computer science. The cancer research came later, when (in 2003) a group at UCSF approached him with a cancer genome sequence that had seen massive rearrangement. They developed a way to reconstruct the tumor genome architecture and published the results in Bioinformatics in 2004. Incidentally, this work of Ben’s was profiled when he was named one of Tomorrow’s PIs by Genome Technology. The GT article came out in late 2006, a time when I was very interested in SV, and I remember thinking “who is this guy?”

End Sequence Profiling in Cancer

When I think of ESP, I tend to think of the Tuzun et al 2004 paper, as many people in the field do. There was, however, a study published a year earlier (in 2003) on ESP as an approach for sequence-based analysis of rearranged genomes. The idea is to sequence 500 bp at each end of clones (100-250kbp in size) and then apply a geometric clustering algorithm to look for rearrangements. Ben Raphael’s group applied this method to BRCA cell lines as well as primary tumors (breast, prostate, ovarian, and brain cancers). The principal goal was to identify fusion genes (like the widely known Philadelphia chromosome). In studies published this year, Ben’s group did find rearrangements that created fusion genes, though none appeared to be transcribed.

ESP compared to CGH

Ben’s group compared their findings to competitive genome hybridization (CGH) array results and found a “statistically significant” amount of overlap in rearrangements predicted by both methods (Agilent 244K CGH arrays and 150K ESPs). This past summer, they snagged some of our TCGA glioblastoma data and did the same comparison. In the case of GBM, Ben noted that they found far too many SV’s for them to all be somatic; more likely, most of them are germline variants. As many as 5-20% of them were known inversion polymorphisms, which also seemed high. Nevertheless, I think the audience was impressed by their methods, and my guess would be that invitations to join in the next round of TCGA analysis may be forthcoming.

Dave and Decision Trees for NGS

October 15, 2008 by Dan Koboldt

My colleague David Larson just returned from CSHL’s Personal Genomes meeting, where he presented a poster on decision-tree filtering of variant predictions from Illumina/Solexa data. I don’t know much about machine learning, but I can see that it offers a useful approach in at least one aspect of next-generation sequencing.

From my basic understanding, a decision tree is a machine learning algorithm that you “train” on a dataset where the correct decisions are known, and then apply to another dataset in which decisions are not known.

A sample decision tree that uses weather attributes to determine if a game will be played or not. Image Credit: Wikipedia

A sample decision tree that uses weather attributes to determine if a game will be played. Credit: Wikipedia

For example, Dave’s poster described a decision tree that determines whether SNP predictions from Solexa are real (“Germline”) or false-positives (“WildType”). As a training set, Dave used ~650 SNPs whose true status had previously been determined on 3730 sequencing. For each SNP, he provided several attributes (base quality, read count, etc.) as well as the correct “answer (Germline or WildType) as determined by 3730. These inputs went into the c4.5 program which generated a decision tree to distinguish Germline from Wildtype based on these characteristics.

Dave applied the decision tree to whole-genome Solexa data for an individual that we recently sequenced to over 10x coveraged with Solexa fragmented reads. Maq had predicted ~5 million SNPs; the decision tree filter cut this number in half. Even more promising, Dave’s decision tree filter isolated a substantially better data set. Over 90% of the SNPs detected by array-based genotyping were among the Germline-classified SNPs. Concordance with dbSNP, which is one of our measures of specificity, was over 80% the last I heard.

It occurred to me that the decision tree approach has numerous applications for next-generation sequencing analysis. It could be used to distinguish true variants from false positives, or somatic mutations from germline variants. Decision trees might also be informative for short read alignments, where a number of attributes (read length, alignment score, alignment quality, mismatches, etc.) could be used to determine whether or not a read was correctly placed.

After talking with Dave, I spent half a day building decision trees that might be useful for 454 variant detection. One thing I realized very quickly is the importance of the training data set. First, I tried a training set of ~75 variants sequenced by 3730. This was way too small, yielding a tree with one decision (allele frequency) to classify the data. Then, I tried a training set of ~400,000 454 read alignments with several attributes. This was far too much, yielding a massive tree with hundreds of branches. Also, I worry about the correctness of the “answers” in my data sets. While 3730 sequencing is a gold standard, it also has a tendency to miss certain kinds of variants, which might be detected in 454. Real variants, labeled as Wild-Type in the training data set. I think I’ll have to find a larger, more reliable training set before decision trees bear fruit for 454 variant detection.

« Previous Page