I’ve always been a fan of Bowtie, one of the first algorithms to leverage Burrows-Wheeler Transform for short read alignment. When I first encountered it in 2008, it was incredibly fast. Faster than Maq and Novoalign, two of the early popular algorithms for read mapping. Perhaps more importantly, it was ultra memory-efficient, enabling one to map millions of reads on a typical desktop computer. You’d still need the technical expertise to do anything with the alignments, but hey, it was a start. I liked it enough that the first version of VarScan included support for native Bowtie alignment formats (this was before the widespread adoption of SAM/BAM format).
Early Bowtie Aligner Limitations
Despite these features, Bowtie had a few limitations: First, it required all reads to have the same length and had an upper read length maximum that made it essentially incompatible with Roche/454 data. This wasn’t a big problem, because there were other aligners for 454 data that could handle its moderate level of throughput.
Even though it was faster, Bowtie was less suitable for paired-end data than Maq because it didn’t leverage the mate pairing information to improve alignment – it simply attempted to map each read in the mate pair independently, then went back to calculate the distance between them. This was kind of a bummer, but still made Bowtie quite suitable for fragment-end data which had the majority in 2008.
Another Bowtie limitation was that it didn’t align reads with gaps. In other words, if a read contained an insertion or deletion relative to the reference sequence, Bowtie wouldn’t map it. Side note: This also would have prevented Bowtie from working on Roche/454 data (and later IonTorrent data) due to the known homopolymer-associated sequencing errors. At the time, however, everyone was still struggling with SNP detection in next-gen sequencing data, so ungapped alignments weren’t a dealbreaker.
Indels and Gapped Alignments
In time, though, as our capability to detect insertion/deletion variants (indels) increased — due to algorithmic developments as well as longer reads — gapped alignment became more and more important. Benjamin Langmead, the developer and first author, once mentioned to me that it was the most-requested feature for Bowtie. The demand undoubtedly continued to increase as aligners such as BWA offered similar speed and memory performance, while making efforts to align reads across gaps. In paired-end data with one read anchored, BWA will even perform a more sensitive Smith-Waterman alignment to align its mate while allowing gaps. There was also Novoalign, a commercial aligner, which seemed the most sensitive to gaps in reads according to findings by Heng Li, myself, and others.
Interestingly, the Pindel algorithm, which identifies indels by splitting up the unmapped mate in a read pair where only one read mapped, nicely complements this limitation. In fact, the original Bowtie software paired with Pindel seems like it would be a powerful combination for efficient read mapping with indel detection.
Bowtie 2: Fast Alignment with Gaps
Several subsequent releases of Bowtie addressed some of the early limitations, and continued to increase its performance. And finally, we got the gapped alignment feature we were waiting for in Bowtie 2, which was just published in Nature Methods.
In the publication, Langmead and Salzberg describe a sort of hybrid algorithm that allows efficient gapped alignment of short reads. It essentially has four steps to it:
- “Seed” substrings, which are short segments that are likely to have unique matches in the genome, are extracted from each read
- Seeds are aligned to the reference genome in ungapped fashion using the compressed index.
- Seed placements in the genome are prioritized to find the most likely map location(s)
- Seeds are extended into full alignments (allowing gaps) with a hardware-accelerated dynamic programming algorithm
Here, Bowtie leverages the speed of its “full-text minute index” for ungapped alignment to rapidly place seed segments without gaps, and then an accelerated algorithm to do the full read alignment with gaps. According to the authors, it’s a combination that allows for high speed, sensitivity, and accuracy.
The ability of this new Bowtie algorithm to align with gaps will also aid RNA-Seq analysis using the TopHat package, which utilizes Bowtie as its core aligner, because the gaps that are present in mature mRNA are likely to be better handled.
Bottom line, even if you’re using something else to align reads right now, Bowtie might be worth a look.
Download Bowtie 2: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
References
Langmead B, & Salzberg SL (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9 (4), 357-9 PMID: 22388286