Comments on: Short Read Aligners and Variant Detection

By: Bob Carpenter

Bob Carpenter — Mon, 25 Jan 2010 20:27:20 +0000

Many of the packages you describe do exactly what you suggest: try exact matches first, and when that fails, try more mismatches. You see this in Bowtie’s backtracking strategy and in BWA’s explicit exploration in order of 0-mismatches, 1-mismatches, etc. (see section 2.5 of Li and Durbin’s “Fast and accurate short read alignment” paper).

Which system will be fastest will depend on the size of the genome, the number of reads, and very importantly, the accuracy settings on the packages (e.g. how many mismatches to allow in the first N bases; maximum Smith-Waterman distance, etc.) and what the package computes (e.g. indels vs. same-length matches).

As for measuring precision, that’s pretty much impossible without knowing true mappings, which people certainly try with clone libraries. What you can always do is measure which package is computing its own model of scoring accurately (as opposed to making search errors because of heavy-weighted seeds, greedy choices in search, too few allowed mismatches in the prefix, etc.).

By: MB

MB — Wed, 11 Nov 2009 17:24:10 +0000

There are actually several problems with the Segemehl paper:

1) As a paper accepted in August 2009, it uses very old version of bowtie and bwa.

2) It compares aligners on chr21 only (Figure 3). While the speed of maq/soap/patman is linear in the reference length, bowtie and bwa are better. They are mainly fast given very long reference.

3) Bowtie should be faster than bwa. It is not in their benchmark largely because “-all” is in use, which is discouraged by the bowtie developers.

4) The authors are probably counting non-unqiue reads. That is why bwa and maq have poor accuracy in Figure 3B.

5) Segemehl uses a lot of memory. Most aligners can be accelerated if we allow to use 100GB memory.

6) Apparently Segemehl does not support paired-end alignment.

By: Jonathan

Jonathan — Tue, 10 Nov 2009 09:28:19 +0000

While it uses a different technology compared to bowtie, bwa and the like – and still has the issue of the non-integrated variant caller/mapping results file:

Did you have a look at ‘Segemehl’?
‘Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures’ (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2730575/)

It is highly accurate and quite fast!

By: Vincent Plagnol

Vincent Plagnol — Sat, 07 Nov 2009 14:13:57 +0000

I am personally a very happy user of novoalign. Not working in a genome center, which is I the case for most scientists, speed is not an issue and novoalign runs overnight on one lane of data which is all I need. I do use the commercial version though and the nodes on my computing cluster have a lot of RAM, so that probably helps.

it also has added features useful for RNA mapping. In particular differential penalty when both reads in a pair map to the same transcript, as opposed to mapping to different transcripts. And it does a very good job at picking up indels.

By: lh3

lh3 — Fri, 06 Nov 2009 21:23:25 +0000

@Steven & Dan

I should largely take the blame if anything Richard said about aligners in the last three years is wrong. Most of benchmarks are done by me so far and he can only make conclusions from what he sees.

I cannot say much about how bowtie is compared with bwa in accuracy for obvious reasons, but I do think the sequencing community (not people who develop cDNA-to-genome mappers) does not have the tradition to carefully evaluate the specificity of read alignments. Most of time, we only compare speed and sensitivity simply because they are easier to measure. After 8 years, do we know which aligner is more accurate on capillary/454 read mapping, SSAHA2 or BLAT?

By: lh3

lh3 — Fri, 06 Nov 2009 20:51:23 +0000

A lot of comments…

1. The novoalign developers claim that the speed of novoalign has been considerably improved for long reads, although I have not tried since then. I did not recommend novoalign for 1000 genomes project (G1K) not because it is slow but because its free version does not support multithreading. Novoalign uses >6.8GB memory as I remember. It is fine if it is parallelized, but will be a big trouble at Sanger where each CPU core typically has 2GB memory.

2. Stampy is developed by Gerton who is very careful about accuracy and indels. I have not tried it, but I guess it may achieve comparable speed and accuracy to novoalign, probably with smaller memory footprint.

3. For the published NA18507 genome sequenced by Illumina, both Tony Cox and I called SNPs from properly paired reads only. I forget how the results compared with and without this requirement, though.

4. While whether wrong alignments is a major source of false SNPs is an open question, it seems clear to me that wrong alignments will cause a lot of toubles for people who look for structural variations (SVs).

5. I see gapped alignment becomes more and more important with increasing read length. Probably 10-20% of short variants are indels and some of them may have more significant influence on gene function than SNPs. Also, both false negative and false positive rates of SNPs probably increase with longer reads if an aligner is not capable of gapped alignment.

6. I think for end users, the development of short read specific aligners is largely done. Although we continue to see journal papers on such aligners, they are usually of more theoretical importance than practical. It takes time for an aligner to be matured and get heavily used by people. Maq/bwa, soap/soap2, bowtie and novoalign were all developed when short-read data just became available.

7. In my view, developers who want to program something that are practically useful to end users should focus on long read alignment and long read de novo assembly. The problems there are different from those for short read alignment/assembly. Although for long reads we have good aligners like ssaha2/blat and assemblers like Celera assembler/Arachne, they could probably be improved given our accumulated knowledge in the last few years.

By: Steven Salzberg

Steven Salzberg — Fri, 06 Nov 2009 20:45:35 +0000

Although I have great respect for Richard Durbin, I think his comments about Bowtie are inaccurate and a bit irresponsible. Why am I not surprised that he is pushing the aligners out of his own lab (Maq and BWA) for the 1000 Genomes Project, for which he is the Chair of the Steering Committee?

In our hands, Bowtie is just as accurate as the other tools – more so, in many cases – and many times faster. (And note that “our” includes the developers of Bowtie – I freely admit our bias.) But you shouldn’t believe me – or Richard. Read our published papers, or do your own benchmarks. And for even more impressive results, try the new CrossBow system, a cloud-computing SNP caller based on Bowtie and SOAPsnp.

By: Martin Gollery

Martin Gollery — Fri, 06 Nov 2009 18:32:01 +0000

I agree, if mate pairs are on different locations, they should both be removed.

In other news, perhaps we ought to accelerate Novoalign on a GPGPU- spending a few hundred bucks on a good graphics card or a few thousand on a Tesla workstation might be a better way to go than sacrificing quality.

-Marty

By: Mitch Skinner

Mitch Skinner — Fri, 06 Nov 2009 16:52:57 +0000

Is it really that much of a joke? I’d certainly be happy to leave the short read stuff behind.

Actually, once longer-read (pac bio, oxford nanopore, whatever) sequencing becomes common, will there still be situtations where short-read sequencing is the best tool?