Short Read Aligners and Variant Detection

November 6, 2009 by Dan Koboldt

In recent weeks I’ve had conversations with many people in the NGS community who are attempting to call variants, accurately, in Illumina/Solexa data. Part of it stems from VarScan, my SNP and indel caller for next-gen sequencing data that works with Bowtie, Novoalign, cross_match, and other aligners.

Another part of it stems from my involvement in 1,000 Genomes Pilot 3, for which several participants have applied their own variant detection pipelines to the same dataset. Last month, Goncalo Abecasis, with input from David Craig, Heng Li, Gerton Lunter, and Fiona Hyland, proposed an exercise comparing several read mappers on real and simulated ABI SOLiD and Illumina/Solexa data. The initial list of aligners – Maq, BWA, Stampy, BFAST, BioScope, and KARMA – demonstrated just how rapidly the field has grown since my aligner comparison last year at AGBT. I’d looked at Maq and BFAST, and knew about (but hadn’t tried) BWA, but the others on the list (Stampy, BioScope, and KARMA) were ones I’d never heard of.

I proposed adding three aligners to the list: Bowtie and Novoalign for Illumina data, and SHRiMP for SOLiD data. My suggestions were politely declined by Richard Durbin (WTSI), who said “In our hands Bowtie doesn’t seem accurate enough for variant calling. It is a great tool for fast assignment of reads for some other purposes. Novoalign is accurate and good, but perhaps a little slow. SHRiMP is also I think very slow.”

Personally, I think that Bowtie works very well for variant calling, I know of several groups who are using it for that exact purpose. And while Novoalign *is* a bit slow, in my experience it’s just as fast as Maq, one of the two aligners out of Durbin’s lab that were already on the list. Of course, Maq remains the most widely used tool for Illumina data (for now), and that’s an important consideration. Most NGS analysts know and love Maq as much as I do.

Balancing Speed and Sensitivity

However, these assessments bring into focus the key issue surrounding short read alignment for variant detection – finding the balance between speed and sensitivity. Bowtie and Novoalign exemplify this well. Bowtie is ultrafast – the fastest short read aligner I’ve used – and maps an entire lane (~15m reads) in just 1-2 hours. Yet in my experience, it places slightly fewer reads than BWA/Maq. And it performs only ungapped alignments, so indels won’t be detected. In contrast, Novoalign typically maps more reads than Maq and BWA, seems very accurate, and remains one of the few aligners to allow gaps in fragment-end reads. In general, my comparisons demonstrated that Novoalign speed is comparable to Maq on typical datasets. However, longer reads and lower-quality data can make Novoalign very slow indeed. The ultimate short read aligner, in my opinion, would have Bowtie-like speed, Novoalign-like sensitivity, and the widespread community support that Maq enjoys.

Ask the Guru: Heng Li

Heng Li, who led development of both Maq and BWA, told me that he’s not worried about sensitivity. “Most aligners nowadays are sensitive enough,” Heng wrote to me in an e-mail this week. “For detecting variations, specificity is of more importance. Nonetheless, how much wrong alignments may contribute to wrong SNPs is an open question. As long as alignment errors are random, more wrong alignments may not necessarily lead to worse SNP calls.” Clearly, he has already given some thought to these issues. If we’re lucky, Heng Li may begin to address these open questions in his new post at the Broad Institute.

Underlying Causes of False Positives

Read mis-alignment would not be a serious problem if it occurred randomly across the genome. The trouble is that wrong alignments don’t seem to be random, at least in my experience. In projects like TCGA Ovarian, we see numerous false positives (particularly in tumors) that seem to arise from read mis-alignment. These typically manifest as clusters of variants, often present in each of a subset of reads whose true alignment is probably a paralogous region of the genome. It’s also possible that they’re caused by an indel, which (as Kiran Garimella of the Broad Institute recently showed) sometimes manifest as clusters of substitutions at several positions near one another. We can aggressively filter these by looking for clusters of predicted SNPs, but even better would be to remove the mis-alignments before variant calling even begins.

Read Mis-Alignment and Paired-End Sequencing

Here at WashU, we have a growing concern that the alignment scores for short reads are continually over-estimated. Often our manual reviewers find that reads supporting false-positives have mate pairs that align to a different chromosome altogether. In the absence of translocation events, when this occurs, one of the two reads is incorrectly placed, and any variant it supports is probably not real. Personally, I’d rather remove both reads in such situations, and rely on correctly mapped read pairs for detection of small variants.

The pervasive spread of paired-end sequencing is beginning to reveal just how often short aligners can get it wrong. The corollary here is that taking read pair information into account during alignment is of critical importance, and those hopeful short read aligners that don’t do it yet (crossmatch, for example) are destined for inferiority.

High-Throughput Sequencing: Speed Matters

Yet what I’m learning from discussions with others in the community – particularly the growing surge of users making the leap from Maq to BWA – is that speed matters. With Illumina machines cranking out 20 gigabases in a single run, and projects like the 1,000 Genomes generating terabytes of sequence over the course of months, we can’t afford to be using the slower aligners, no matter their sensitivity. At worst, we might apply a two-stage approach to alignment that rapidly maps reads that precisely match the reference, and passes only the variant reads to a more sensitive aligner for mapping.

Of course, as a colleague of mine recently joked, by the time we write the perfect aligner, Pac Bio will have come along and sequenced the entire genome, kilobases at a time.

Comments

Mitch Skinner says

November 6, 2009 at 9:52 am

Is it really that much of a joke? I’d certainly be happy to leave the short read stuff behind.

Actually, once longer-read (pac bio, oxford nanopore, whatever) sequencing becomes common, will there still be situtations where short-read sequencing is the best tool?
Martin Gollery says

November 6, 2009 at 11:32 am

I agree, if mate pairs are on different locations, they should both be removed.

In other news, perhaps we ought to accelerate Novoalign on a GPGPU- spending a few hundred bucks on a good graphics card or a few thousand on a Tesla workstation might be a better way to go than sacrificing quality.

-Marty
Steven Salzberg says

November 6, 2009 at 1:45 pm

Although I have great respect for Richard Durbin, I think his comments about Bowtie are inaccurate and a bit irresponsible. Why am I not surprised that he is pushing the aligners out of his own lab (Maq and BWA) for the 1000 Genomes Project, for which he is the Chair of the Steering Committee?

In our hands, Bowtie is just as accurate as the other tools – more so, in many cases – and many times faster. (And note that “our” includes the developers of Bowtie – I freely admit our bias.) But you shouldn’t believe me – or Richard. Read our published papers, or do your own benchmarks. And for even more impressive results, try the new CrossBow system, a cloud-computing SNP caller based on Bowtie and SOAPsnp.
lh3 says

November 6, 2009 at 1:51 pm

A lot of comments…

1. The novoalign developers claim that the speed of novoalign has been considerably improved for long reads, although I have not tried since then. I did not recommend novoalign for 1000 genomes project (G1K) not because it is slow but because its free version does not support multithreading. Novoalign uses >6.8GB memory as I remember. It is fine if it is parallelized, but will be a big trouble at Sanger where each CPU core typically has 2GB memory.

2. Stampy is developed by Gerton who is very careful about accuracy and indels. I have not tried it, but I guess it may achieve comparable speed and accuracy to novoalign, probably with smaller memory footprint.

3. For the published NA18507 genome sequenced by Illumina, both Tony Cox and I called SNPs from properly paired reads only. I forget how the results compared with and without this requirement, though.

4. While whether wrong alignments is a major source of false SNPs is an open question, it seems clear to me that wrong alignments will cause a lot of toubles for people who look for structural variations (SVs).

5. I see gapped alignment becomes more and more important with increasing read length. Probably 10-20% of short variants are indels and some of them may have more significant influence on gene function than SNPs. Also, both false negative and false positive rates of SNPs probably increase with longer reads if an aligner is not capable of gapped alignment.

6. I think for end users, the development of short read specific aligners is largely done. Although we continue to see journal papers on such aligners, they are usually of more theoretical importance than practical. It takes time for an aligner to be matured and get heavily used by people. Maq/bwa, soap/soap2, bowtie and novoalign were all developed when short-read data just became available.

7. In my view, developers who want to program something that are practically useful to end users should focus on long read alignment and long read de novo assembly. The problems there are different from those for short read alignment/assembly. Although for long reads we have good aligners like ssaha2/blat and assemblers like Celera assembler/Arachne, they could probably be improved given our accumulated knowledge in the last few years.
lh3 says

November 6, 2009 at 2:23 pm

@Steven & Dan

I should largely take the blame if anything Richard said about aligners in the last three years is wrong. Most of benchmarks are done by me so far and he can only make conclusions from what he sees.

I cannot say much about how bowtie is compared with bwa in accuracy for obvious reasons, but I do think the sequencing community (not people who develop cDNA-to-genome mappers) does not have the tradition to carefully evaluate the specificity of read alignments. Most of time, we only compare speed and sensitivity simply because they are easier to measure. After 8 years, do we know which aligner is more accurate on capillary/454 read mapping, SSAHA2 or BLAT?
Vincent Plagnol says

November 7, 2009 at 7:13 am

I am personally a very happy user of novoalign. Not working in a genome center, which is I the case for most scientists, speed is not an issue and novoalign runs overnight on one lane of data which is all I need. I do use the commercial version though and the nodes on my computing cluster have a lot of RAM, so that probably helps.

it also has added features useful for RNA mapping. In particular differential penalty when both reads in a pair map to the same transcript, as opposed to mapping to different transcripts. And it does a very good job at picking up indels.
Jonathan says

November 10, 2009 at 2:28 am

While it uses a different technology compared to bowtie, bwa and the like – and still has the issue of the non-integrated variant caller/mapping results file:

Did you have a look at ‘Segemehl’?
‘Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures’ (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2730575/)

It is highly accurate and quite fast!
MB says

November 11, 2009 at 10:24 am

There are actually several problems with the Segemehl paper:

1) As a paper accepted in August 2009, it uses very old version of bowtie and bwa.

2) It compares aligners on chr21 only (Figure 3). While the speed of maq/soap/patman is linear in the reference length, bowtie and bwa are better. They are mainly fast given very long reference.

3) Bowtie should be faster than bwa. It is not in their benchmark largely because “-all” is in use, which is discouraged by the bowtie developers.

4) The authors are probably counting non-unqiue reads. That is why bwa and maq have poor accuracy in Figure 3B.

5) Segemehl uses a lot of memory. Most aligners can be accelerated if we allow to use 100GB memory.

6) Apparently Segemehl does not support paired-end alignment.
Bob Carpenter says

January 25, 2010 at 1:27 pm

Many of the packages you describe do exactly what you suggest: try exact matches first, and when that fails, try more mismatches. You see this in Bowtie’s backtracking strategy and in BWA’s explicit exploration in order of 0-mismatches, 1-mismatches, etc. (see section 2.5 of Li and Durbin’s “Fast and accurate short read alignment” paper).

Which system will be fastest will depend on the size of the genome, the number of reads, and very importantly, the accuracy settings on the packages (e.g. how many mismatches to allow in the first N bases; maximum Smith-Waterman distance, etc.) and what the package computes (e.g. indels vs. same-length matches).

As for measuring precision, that’s pretty much impossible without knowing true mappings, which people certainly try with clone libraries. What you can always do is measure which package is computing its own model of scoring accurately (as opposed to making search errors because of heavy-weighted seeds, greedy choices in search, too few allowed mismatches in the prefix, etc.).