Short Read Aligners: Maq, Eland, and Others

May 14, 2008 by dkoboldt

This month I’ve come across some interesting statistics on the performance of Maq, Eland, and other short-read alignment tools as applied to Illumina/Solexa data. I took note because these programs are finally being evaluated against appropriate data sets, as opposed to simulated reads or tiny genomes. First the disclaimers: all of these numbers came from people other than myself (see Credits, below), so please forgive any inaccuracies. Also, this entry reflects my personal second-hand impressions of the different alignment tools, and should not be considered an endorsement or criticism of the different alignment tools by the WashU GC.

Short-Read Data Sets at the WashU Genome Center

One of our data sets includes 100+ Solexa runs (non-paired) from the genomic DNA of a single individual. We’ve applied a number of alignment tools to these data: Eland (part of the Illumina suite), Maq (free/open source), SX Oligo Search (proprietary), SlimSearch (proprietary), and even BLAT. Our group (Medical Genomics) is currently leaning toward Maq for read mapping and SNP discovery purposes. There’s recently been a new release of Maq (0.6.5) which seems to run substantially faster:

Metric	Maq 0.6.3	Maq 0.6.5
Average alignment time for normal runs	17.7 hours	9.1 hours
Max alignment time for a normal run	240 hours	28.8 hours
Total number of jobs	2168	1467
Jobs that took longer than 1 day	443	3

The developer of Maq, Heng Li, presented a poster describing the Maq algorithm at CSHL last week and also gave a small workshop talk on issues in short read mapping. He sent these links out to the Maq user list along with a benchmarking comparison of various read mapping tools.

Heng Li’s Comparison of Short-Read Aligners

For the comparison, Heng generated 1 million simulated read-pairs from chromosome X. The numbers themselves are a bit mind-boggling, but fortunately he summarized the results with these notes:

Eland: eland is definitely the fastest, much faster than all the competitors. What is more important, eland gives the number of alternative places, which makes it possible for you to get further information about the repetitive structures of the genome and to select reads that can be really confidently mapped. In addition, with the help of additional scripts, Eland IS able to map reads longer than 32bp. Eland is one of the best software I ever used. It would be even superior if Tony could make it easier to use for a user, like me, who wants to run eland independently of the GAPipeline.

RMAP: the strength of rmap is to use base qualities to improve the alignment accuracy. I believe it can produce better alignment than maq -se because maq trades accuracy for speed at this point (technically it is a bit hard to explain here). Nonetheless, I think rmap would be more popular if its authors could add support for fastq-like quality string which is now the standard in both Illumina and the Sanger Institute (although maybe not elsewhere). rmap supports longer reads, which is also a gain. Furthermore, I did learn a lot from its way to count the number of mismatches.

SOAP: soap is a versatile program. It supports iterative-trimmed alignment, long reads, gapped alignment, TAG alignment and PE mode. Its PE mode is easier to use than eland. In principle, soap and eland should give almost the same number of wrong alignments. However, soap gives 442 more wrong alignments. Further investigation shows that most of these 442 wrong ones are flagged as R? (repeat) by eland.

SHRiMP: Actually I was not expecting that a program using seeding +Smith-Waterman could be that fast. So far as I know, all the other software in the list do not do Smith-Waterman (maq does for PE data only), which is why they are fast. SHRiMP’s normodds score has similar meaning to mapping quality. Such score helps to determine whether an alignment is reliable. The most obvious advantage is SHRiMP can map long reads (454/capillary) with the standard gapped alignment. If you only work with small genomes, SHRiMP is a worthy choice. I think SHRiMP would be better if it could make use of paired end information; it would be even better if it could calculate mapping quality. The current normodds score helps but is not exactly the mapping quality. In addition, I also modified probcalc codes because in 1.04 underflow may occur to long reads and leads to “nan” normodds. However, although my revision fixes the underflow, it may lead to some inaccurate normodds.

Maq: at the moment maq is easier to use than eland. Supporting SNP calling is maq’s huge gain. Its paired end mode is also highly helpful to recover some repetitive regions. Maq’s random mapping, which is frequently misused by users who have not noticed mapping qualities, may be useful to some people, too, and at least it helps to call SNPs at the verge of repeats.

What a nice guy! Here he is, comparing his own tool against several competitors and he manages to praise the strengths of each one. That takes humility.

More Comments from Heng Li

Ken Chen, a colleague of mine, happened to discuss the benchmarking with Heng at Cold Spring Harbor. According to his evaluation, the current version of recently-published SOAP may be somewhat buggy (it had more mapping errors and crashed on paired-end alignment), but is nevertheless promising because it supports gapped alignment and longer reads. Paired-end alignment is perhaps Maq’s greatest strength; the alignment error rate from Maq for paired-end data is significantly reduced. Heng also mentioned that the upcoming new release of Eland will support longer read lengths (>32 bp) and will also calculate mapping quality scores.

Unbiased Comparisons of Short-Read Aligners

In summary, there are a number of competing tools for short read alignment, each with its own set of strengths, weaknesses, and caveats. It’s hard to trust any benchmarking comparison on tools like these because usually, it’s the developers of one of the tools that publish them. Here’s an idea: what if NHGRI, Illumina, or another group put together a short-read-aligning contest? They generate a few short-read data sets: real, simulated, with/without errors, with/without SNPs and indels, etc. Then, the developers of each aligner are invited to throw their best efforts at it. Every group submits the results to a DCC, which analyzes the results in a simple, unbiased way: # of reads placed correctly/incorrectly. # of SNPs/indels detected, missed, or false-positives. The results are published on a web site or in the literature for all to see. Yeah, I know, there are hurdles, like the fact that most proprietary tool developers would probably chicken out of an unbiased head-to-head comparison, given the stakes. But wouldn’t it be nice to know the results? Unless that happens, however, I think Heng’s analysis is about as unbiased as can be.

Credits

WashU GC Maq version comparisons were sent out by Jim Eldred on 5/01/2008. Heng Li’s benchmarking comparison was sent to the Maq user list on 5/12/2008. Additional comments from Heng Li were reported by Ken Chen on 5/12/2008.

Landmark Study in Structural Variation

May 1, 2008 by dkoboldt

At last, some results from Evan Eichler’s SV project! The results of the first phase of the “Human Genome Structural Variation Project” were presented in today’s issue of Nature. I’ve been cognizant of this project for a couple of years and eager for the results, as it is really the first large-scale, sequence-based study of copy number and structural variants. As it happens, our Genome Center played a big role in the sequencing, and two of our researchers (Tina Graves and Rick Wilson) are among the authors.

In fairness, however, I should disclose another thing about the Evan Eichler project.

In 2006, just after three simultaneous papers in Nature Genetics brought structural variation to the forefront, my former lab began working on a grant proposal. In it, we proposed to mine existing trace data (from NCBI’s Trace Archive) for putative structural variants in the human genome. We were developing a sequence-based approach to identify reads spanning insertions, deletions, duplications, inversions, and translocations. It was an ambitious project and the timing was perfect, but, unfortunately, NIH sent our proposal back twice. Unscored. Later, I learned that a group headed by Evan Eichler pretty much locked up U.S. funding for this research in the form of a $40 million grant. With all of the NIH eggs in a single basket, I thought to myself, they had better deliver.

It looks like I won’t be disappointed. Dr. Eichler and colleagues constructed whole-genome libraries of ~1 million fosmids for each of 8 individuals whose samples were used in the HapMap Project. Four were African, two were CEPH European, one was Japanese, and one Han Chinese. They used the fosmid end sequencing approach (described by Tuzun et al, 2004) in which you sequence the ends of each fosmid and map the sequences to the reference genome. Altogether, about 6.1 million end-sequence pairs (ESPs) were uniquely mapped to the genome. Of these some 76,767 (~1.26%) were discordant by alignment distance or orientation, indicating a possible underlying structural variant. It’s a big paper (the Supplemental File alone was 57 pages) so I’ll hit you with the take-homes:

The human genome is still incomplete. The fosmid approach yielded a number of novel sequences not present in the current human genome assembly. The sequences range in size from 2-130 kbp, were randomly distributed among genic/nongenic regions, and were often (40% of the time) copy-number variable. There’s still more human genome out there, folks.
Current CNV databases are inflated. The higher resolution of sequence-based approaches served to refine the location of many previously-reported CNVs. Generally speaking, CNVs on a haplotype are smaller (in bp) than copy-number-variable regions as a whole, meaning that fewer genes are affected. Array-cGH platforms, which to-date are among the most widely-used approaches to assess CNV, greatly exaggerate the size of these variants. And less than half of the SVs in this study cannot adequately be genotyped with existing platforms.
NAHR is the driving force behind most structural variation. About half of the SVs detected in this study were flanked by segmental duplications. This is nothing new, but it somewhat contradicts the study by Korbel and colleagues last year, which used 454 paired-end sequencing and reported that NAHR-mediated events were rare. Eichler et. al. suggest that the association of NAHR with segmental duplications and the difficulties of short sequence reads to resolve such regions might together mean that next-generation ESP approaches are missing a lot of variation. I’m not sure I agree, but that’s what they said.

Overall it seems like an impressive publication. They planned and executed a very careful study, and as a result, we learned quite a bit about the landscape of structural variation in the human genome. Not bad, Dr. Eichler.

Happy DNA Day!

April 25, 2008 by dkoboldt

April 25th is National DNA Day in the U.S., an occasion that commemorates discovery of DNA’s structure by Watson and Crick (published in Nature on April 25th, 1953) and the completion of the Human Genome Project fifty years later in April 2003. NHGRI has set up a web site for DNA day featuring a welcome video from Francis Collins and a live chatroom where anyone can chat with “leading researchers in the field of genetics.” Strangely I haven’t yet gotten the call.

The WashU Genome Center outreach office is also sponsoring a number of activities. A number of DNA Day Ambassadors are visiting local high schools in St. Louis. A symposium on the medical campus will have student poster sessions and a seminar, “The Human Genome Sequence: A Foundation for Biological Inquiry”, given by our co-director Elaine Mardis.

By chance I happened to Google the King and Wilson 1975 paper the other day, and came across a very interesting site of Landmarks in the History of Genetics. While not up-to-date, it’s a nice story of key events in DNA’s history (and their implications) since 1745. Of course there’s Darwin’s publication of The Origin of Species in 1859, and Mendel’s Experiments in Plant Hybridisation just six years later. I recognize the name of Erwin Chargaff, whose insights (1950) into the relative incidence of A, C, G, and T nucleotides was not random, but perhaps a kind of code. Two years later came the Hershey and Chase experiments, which showed that viruses infect host cells by injecting their DNA, while the proteins generally remain outside the cell. Of course we know Watson and Crick (1953), as well as King and Wilson (1975). How about Barbara McClintock, whose discovery of transposable elements in maize in the 1940’s was not fully recognized for decades?

That rather sounds like Mendel, doesn’t it? I wonder, how many other important discoveries in biology have already been made, but not yet appreciated. It’s something to think about. Happy DNA Day everyone!

Cis-regs and Functional Noncoding Variation

April 24, 2008 by dkoboldt

On Tuesday I attended a very interesting thesis defense by Scott Doniger, a student in Justin Fay’s lab. I admit, I was lured in by the thesis title, “Comparing and Contrasting Cis-regulatory Sequences to Identify Functional Noncoding Sequence Variation.” While I do not know Scott personally, I’m certainly familiar with Justin Fay’s work on positive and negative selection in the human genome. His paper, in fact, is the foundation of my work on signatures of natural selection and the SNPseek project.

Scott proved a confident and articulate speaker, and laid the groundwork for his thesis by presenting three convincing motivations for this work:

The regulatory hypothesis of evolution. Despite the obvious phenotypic diversity of species on this planet, the DNA sequence diversity is surprisingly limited. More than twenty-five years before the completion of the human genome sequence, King and Wilson [1975] found that the chimpanzee and human genomes diverged by only 1.6%. From this seminal paper came the idea that regulation of gene expression, not differences in DNA sequence, drove phenotypic divergence.
The functional relevance of noncoding sequences. Despite the traditional view that functional variants in humans alter protein-coding sequence, it is becoming clear that the genetics underlying many traits extend into noncoding DNA, particularly for complex phenotypes like disease susceptibility and drug response.
The availability of numerous genome sequences. Draft genome sequences for at least 27 vertebrate species have been completed to date, and their availability has spurred wide interest in the field of comparative genomics.

Scott’s work is based on the reasonable premise that functional noncoding sequences are subject to purifying selection (fewer changes tolerated over time), and thus they should be conserved between genomes that share common ancestry. Thus, comparative genomics serves to guide us to functional variants, as SNPs in constrained positions are more likely to be deleterious. This works well for coding sequences in both humans and yeast (the Fay lab model organism). Scott looked at the 9 known quantitative trait nucleotides (QTNs) in yeast and sure enough, 8 of them were SNPs in highly conserved amino acid positions. Gravy.

Because deep sequence conservation approaches might not work for noncoding SNPs, they focused on a few closely related species of yeast, identifying 2,106 variant positions (13% of the total) that fell within conserved transcription factor binding sites (TFBS’s). Of those, 615 (29%) appear to be deleterious based on their conserved-nucleotide model. If I can extrapolate, by their approach about 3.8% of the SNPs between closely related yeast species are likely to be functional.

The Model-Free Approach: PhyloNet-SNP

All of Scott’s work to this point relies on having good annotations of cis-regulatory TFBS’s in your genome of interest. Because you can’t always count on that, they developed a “model-free” approach to evaluating SNPs. With some help from Gary Stormo’s group, they devised an algorithm (PhyloNet-SNP) that uses each SNP +/- 20 bp of flanking sequence in each direction as a query sequence to identify those within multi-copy conserved elements of a genome. By this approach, ~15% of the SNPs in their model system were called as functional.

The Experimental Backup: Allele-specific Expression

The brief wet-lab portion of the thesis work was an allele-specific expression experiment where the ability of SNPs to alter gene expression levels was evaluated in vivo. Among randomly-chosen SNPs about 8% had a regulatory effect. However, using sequence conservation and/or PhyloNet-SNP to select SNPs brought this up to 25%, suggesting that the conservation approach yields a three-fold enrichment of SNPs that affect gene expression.

At the conclusion, Scott admitted that while comparative genomics does help identify functional sequences and variation, it doesn’t explain everything. Indeed, recent findings from the ENCODE project cast doubt on whether many conserved noncoding sequences are important at all. Yet until we have a better understanding of the dark matter of the human genome, using sequence conservation to identify SNPs of interest seems like a good way to go.