As our genome center makes the tradition from capillary-based to massively parallel sequencing platforms, the development of automated pipelines for data processing has become a high priority. Last week we had a visit from Illumina’s informatics group to discuss several issues related to the GA (Solexa) platform, including image compression, data storage, workflow informatics, etc. There was also talk of a downstream analysis tool, called Bullfrog, that will perform SNP/indel/SV detection (though I got the impression that the software’s nowhere near release at present).
But Illumina is not the only platform, and Eland is certainly not the only aligner. Thus we’ve formed a focus group to evaluate the different programs for sequence alignment and variant detection in next-generation sequence data. We met last week and put together a list of aligners that work with Illumina (Solexa) and/or Roche (454) data. We also compiled, separately, a shorter list of external and internal programs that do SNP and indel detection on either platform. Some programs, like Maq, were in both lists because they do alignments and SNP detection. Some tools are feasible for short (Solexa-length) reads but not long (454-length) reads, and vice versa. In the end we had a list of 15 different aligners for Illumina/Roche data. Some are good, some are bad, and some we simply don’t know.
We agreed that the plan was to evaluate each aligner on the same data set, but decisions on which data set to use, and how to compare the different aligners, were matters of more intense debate. Should we work with human data, or focus on less complex genomes like C. elegans or E. coli? Performance metrics like CPU time, memory usage, disk space, and cost (some are non-free) are obvious points for comparison, but what about alignment accuracy? We need some way to determine if a read placement on the genome is correct or erroneous. How do we know? The question of alignment “truth” and how to determine it was not an easy one to answer.
After an hour of discussion, we tentatively agreed on a dataset – Illumina PE runs on the first human samples that we’ve already sequenced in-house for the 1000 Genomes Project. These runs come from one of the HapMap Project trios, which means that we can validate our SNP detection results against the known HapMap genotypes that were generated on a variety of platforms (and predominantly by other centers). Also, the 1000 Genomes Project DCC will be performing its own evaluation of alignment tools and sequence analysis using the same data, so we can compare notes.
We put together a short list, by platform, of the aligners to evaluate first. Some decisions here were easy – we’re obviously going to look at Maq and Eland for Solexa data, and we’re already evaluating BLAT and cross_match on some of our 454 data. Other decisions were more difficult – should we evaluate RMAP, whose authors [allegedly] don’t plan to continue development? What about SX OligoSearch, which we can currently only run on Itanium servers? We eventually had five or six aligners per platform that made the short list. This week, we’re putting together the data, and next week, the real work begins.
My friends at GTO picked up this entry in their Daily Scan: http://www.genome-technology.com/issues/blog/general/147733-1.html
They noted that the aligners we’re testing “include Maq, Eland, BLAT, and cross_match.” In fairness, we’re also looking at a few others in the first evaluation. Here’s the list of prioritized aligners by sequencing platform:
Illumina/Solexa Data
-Maq
-Eland
-New CrossMatch
-SlimSearch
-Novocraft
-Mosaik
Roche/454 Data
-BLAT
-New CM
-Mosaik
-GsMapper
-SynaSearch
if possible, please include SeqMap(http://biogibbs.stanford.edu/~jiangh/SeqMap/) into the list. It works like ELand, but can do 3 or more mismatches and also ins/del.
I’ve also heard Shrimp (http://compbio.cs.toronto.edu/shrimp/) gives good results (I am not affiliated and have not even tried it yet). I know people who are using RMAP, so i think it’s worth including.
Fairly a good list! In addition, it seems to me that ZOOM (http://www.bioinformaticssolutions.com/products/ph/ZOOM.pdf) is another worthy candidate. It is developed by the group of people who wrote PatternHunter. In my view, it is a strong group.
ZOOM will be acadmic free and is going to be released in a few weeks, according to an email sent by that group.
In case anybody’s interested in trying novocraft, the aligners (novopaired + novoalign) is free for use an can be downloaded at http://www.novocraft.com
We believe that we have a very strong assembler for short reads.
You can get the white paper of our website http://www.clcbio.com