Random Post: NextGen Aligner Focus Group
RSS 2.0
  • Home
  • About
  • Aligners
  • Genomes
  • Subscribe
  • VarScan
  •  

    Short Read Aligners and Variant Detection

    November 6th, 2009

    In recent weeks I’ve had conversations with many people in the NGS community who are attempting to call variants, accurately,  in Illumina/Solexa data.  Part of it stems from VarScan, my SNP and indel caller for next-gen sequencing data that works with Bowtie, Novoalign, cross_match, and other aligners.

    Another part of it stems from my involvement in 1,000 Genomes Pilot 3, for which several participants have applied their own variant detection pipelines to the same dataset.  Last month, Goncalo Abecasis, with input from David Craig, Heng Li, Gerton Lunter, and Fiona Hyland, proposed an exercise comparing several read mappers on real and simulated ABI SOLiD and Illumina/Solexa data.  The initial list of aligners – Maq, BWA, Stampy, BFAST, BioScope, and KARMA – demonstrated just how rapidly the field has grown since my aligner comparison last year at AGBT.  I’d looked at Maq and BFAST, and knew about (but hadn’t tried) BWA, but the others on the list (Stampy, BioScope, and KARMA) were ones I’d never heard of.

    I proposed adding three aligners to the list: Bowtie and Novoalign for Illumina data, and SHRiMP for SOLiD data.  My suggestions were politely declined by Richard Durbin (WTSI), who said “In our hands Bowtie doesn’t seem accurate enough for variant calling.  It is a great tool for fast assignment of reads for some other purposes.  Novoalign is accurate and good, but perhaps a little slow. SHRiMP is also I think very slow.”

    Personally, I think that Bowtie works very well for variant calling, I know of several groups who are using it for that exact purpose. And while Novoalign *is* a bit slow, in my experience it’s just as fast as Maq, one of the two aligners out of Durbin’s lab that were already on the list.  Of course, Maq remains the most widely used tool for Illumina data (for now), and that’s an important consideration.  Most NGS analysts know and love Maq as much as I do.

    Balancing Speed and Sensitivity

    However, these assessments bring into focus the key issue surrounding short read alignment for variant detection – finding the balance between speed and sensitivity.  Bowtie and Novoalign exemplify this well.  Bowtie is ultrafast – the fastest short read aligner I’ve used – and maps an entire lane (~15m reads) in just 1-2 hours.  Yet in my experience, it places slightly fewer reads than BWA/Maq.  And it performs only ungapped alignments, so indels won’t be detected.  In contrast, Novoalign typically maps more reads than Maq and BWA, seems very accurate, and remains one of the few aligners to allow gaps in fragment-end reads.  In general, my comparisons demonstrated that Novoalign speed is comparable to Maq on typical datasets.  However, longer reads and lower-quality data can make Novoalign very slow indeed.  The ultimate short read aligner, in my opinion, would have Bowtie-like speed, Novoalign-like sensitivity, and the widespread community support that Maq enjoys.

    Ask the Guru: Heng Li

    Heng Li, who led development of both Maq and BWA, told me that he’s not worried about sensitivity. “Most aligners nowadays are sensitive enough,” Heng wrote to me in an e-mail this week.  “For detecting variations, specificity is of more importanceNonetheless, how much wrong alignments may contribute to wrong SNPs is an open question. As long as alignment errors are random, more wrong alignments may not necessarily lead to worse SNP calls.“  Clearly, he has already given some thought to these issues.  If we’re lucky, Heng Li may begin to address these open questions in his new post at the Broad Institute.

    Underlying Causes of False Positives

    Read mis-alignment would not be a serious problem if it occurred randomly across the genome.  The trouble is that wrong alignments don’t seem to be random, at least in my experience.  In projects like TCGA Ovarian, we see numerous false positives (particularly in tumors) that seem to arise from read mis-alignment.  These typically manifest as clusters of variants, often present in each of a subset of reads whose true alignment is probably a paralogous region of the genome.  It’s also possible that they’re caused by an indel, which (as Kiran Garimella of the Broad Institute recently showed) sometimes manifest as clusters of substitutions at several positions near one another.  We can aggressively filter these by looking for clusters of predicted SNPs, but even better would be to remove the mis-alignments before variant calling even begins.

    Read Mis-Alignment and Paired-End Sequencing

    Here at WashU, we have a growing concern that the alignment scores for short reads are continually over-estimated.  Often our manual reviewers find that reads supporting false-positives have mate pairs that align to a different chromosome altogether.  In the absence of translocation events, when this occurs, one of the two reads is incorrectly placed, and any variant it supports is probably not real.  Personally, I’d rather remove both reads in such situations, and rely on correctly mapped read pairs for detection of small variants.

    The pervasive spread of paired-end sequencing is beginning to reveal just how often short aligners can get it wrong.  The corollary here is that taking read pair information into account during alignment is of critical importance, and those hopeful short read aligners that don’t do it yet (crossmatch, for example) are destined for inferiority.

    High-Throughput Sequencing: Speed Matters

    Yet what I’m learning from discussions with others in the community – particularly the growing surge of users making the leap from Maq to BWA – is that speed matters.  With Illumina machines cranking out 20 gigabases in a single run, and projects like the 1,000 Genomes generating terabytes of sequence over the course of months, we can’t afford to be using the slower aligners, no matter their sensitivity.  At worst, we might apply a two-stage approach to alignment that rapidly maps reads that precisely match the reference, and passes only the variant reads to a more sensitive aligner for mapping.

    Of course, as a colleague of mine recently joked, by the time we write the perfect aligner, Pac Bio will have come along and sequenced the entire genome, kilobases at a time.

    AddThis Social Bookmark Button

    NGS Informatics: Hail to the Chief

    September 17th, 2009

    Bio-IT World’s Kevin Davies has a nice interview with David Dooling, who heads informatics here at the Genome Center and still finds time for his PolITiGenomics blog.  Dooling joined the center in 2001, as the Human Genome Project was wrapping up.  Now, he oversees about half of our informatics group – including IT personnel as well as the developers of our LIMS and automated data pipelines.

    All three groups, now that I think about it, have had to address significant challenges during our transition to a next-generation sequencing center.  Our LIMS deals with tens of millions of transactions per month, with a back-end database whose tables sometimes have billions of records.  Our automated pipeline (or APIPE) group develops all of the data pipelines that make whole-genome sequencing feasible – primary data analysis, alignment, coverage reporting, mutation detection, etc.  And the IT group must address the exponentially growing needs of data transfer and compute time for all of it – not an easy job.

    Despite these monumental tasks, under the leadership of David and others we’re currently “on a good path” to handle the current generation of sequencing tools.  Of course, that may change in the next couple of years, when technologies like Pac Bio’s SMRT platform begin cranking out single-molecule sequences 1,000 bases long or longer.

    In-House and Open Source

    Bio-IT World is heavily read by providers of commercial informatics tools, and this is reflected somewhat in the interview.  Davies often asks whether we’re working with any specific vendors, or considering any commercial tools.  Often enough we are – certainly for storage and data transfer systems, things that can’t be built from the ground up.  Yet whenever possible, we opt for the open-source solution.  Every workstation here, for example, is Linux.  We have but one Windows PC, and it’s not allowed to connect to the internet.  Most of our LIMS system and many of our in-house tools were written in Perl.

    A Tough Nut for Commercial Vendors

    There are, of course, commercial alternatives to anything.  Yet vendors face significant hurdles in marketing products to large genome centers.  The tools that we use are often highly customized, and must continually evolve to address new technological developments.  Take aligners for example.  In the early days of Illumina sequencing, we licensed some commercial software – SLIMsearch and SXOG, for example – because there simply were no good alternatives to ELAND.  Then Maq came along, offering better functionality and performance in a free and open source program (offered, no less, by our trusted friends across the pond).  Exorbitantly priced licenses, needless to say, were quickly not renewed.

    Now there are numerous commercial solutions, and we’re often wooed by companies like CLC bio.  Yet for every commercial aligner there’s half a dozen free/open-source alternatives, developed by academic groups that we respect and trust (Maq/BWA from Sanger, Bowtie from UMD, etc.), and many of these tools are pretty damn good.  A commercial option would have to be so incredible, so vastly superior to what’s currently available for us to consider a paid license.  With Bowtie and BWA mapping lanes of 15 million reads in just a couple of hours, the bar is already set pretty high.

    Outsourcing Sequencing?

    David offers, I think, a polite response to the question of whether we’d ever outsource our sequencing to a third party.  Personally, I can offer two reasons why this will probably never happen.  First, we’re already pretty happy with Illumina, a platform that can deliver whole human genomes at high coverage in just a few weeks.  All available evidence suggests that throughput will only continue to grow, and before long I expect we’ll be doing a genome on a single flowcell or less.  Of course, cost is a consideration (Illumina runs aren’t cheap).  It’s very possible that a company like Complete Genomics might be able to offer similar yields at a substantially reduced cost.  We do use companies like IDT and Agilent, for example, to synthesize oligo sequences that we might make in house.  They can make them cheaper, and faster, than we can.

    There is a second, and perhaps more compelling reason to keep sequencing in-house – because we’re in the business of research, and data is precious.  With our current capacity we can track the progress of sequencing runs in real-time, monitor error rates and alignment rates, and assess results the moment data is off of machines.  We maintain a forensics-lab-like “chain of custody” on the data from start to finish.  Doing so offers a certain sense of security, and confidence, when we use the results to tackle some of the most fundamental questions in biology.

    AddThis Social Bookmark Button

    Maq, BWA, and Bowtie Compared

    July 30th, 2009

    Until recently, Maq has provided the central alignment/assembly/variant-detection functionality for our Illumina pipeline.  As technologies and algorithms evolve, however, we continue to investigate possible improvements.  Heng Li’s sequel to Maq, called BWA, utilizes the incredibly fast Burrows-Wheeler indexing algorithm to speed up alignment time by orders of magnitude.  Also, BWA generates alignments in SAM/BAM format by default, which is convenient for our large-scale sequencing projects where BAM files are becoming the standard format.

    These features, along with our impression that Heng Li and company do not plan future updates to Maq, lead me to infer that BWA is the heir-apparent for our Illumina pipeline.  Before the transition, however, we must compare Maq results with BWA results on the same dataset, to identify any differences that may affect downstream analysis.  Also, we are continuing to evaluate other aligners, especially Bowtie, which offer comparable or even better speed at short read alignment.

    Test Data: WGS and Targeted Sequencing of a Single Sample

    We have a sample in-house for which we performed whole genome sequencing (WGS) and subsequently validated numerous novel variants.  We also performed capture-based targeted resequencing (Illumina 2x75bp PE) of 6,000 genes in the same sample.  To compare the performance of BWA, Maq, and Bowtie, we aligned the capture data with each tool separately, and looked at about a dozen sites where we’d validated novel variants from WGS.

    Sensitivity – Total Reads Mapped

    Here’s a histogram of the read depth at each of the 12 variant sites by aligner:

    bwa-maq-bowtie-coverage

    These results surprised me.  Based on previous experience, I’d guessed that Maq would yield the highest depth, followed by BWA, and then Bowtie.  Instead, with one exception, it was the other way around – Bowtie was more sensitive than BWA, which in turn was more sensitive than Maq.  Yet these differences were relatively minor; overall, the coverage seems very comparable across all three aligners.  I think that’s good news.

    Variant Frequency by Read Count

    Next, we looked at the observed variant frequencies, calculated as the relative fraction of reads supporting reference or variant alleles.

    bwa-maq-bowtie-varfreq

    When it comes to variant frequency, Maq and BWA yield almost identical results (despite slight coverage disparities).  Bowtie yielded slighly higher frequencies in some cases, slightly lower frequencies in others.  Again, these were very minor differences from three very different alignment algorithms, suggesting that each of them yields fairly robust results.

    Farewell to Maq

    Unfortunately, the results of my analysis do not bode well for Maq, only because Maq took a few days to align data that BWA and Bowtie processed in a matter of hours.  So which Burrows-Wheeler aligner will prevail?  It’s difficult to say.  As far as SNP detection goes, BWA and Bowtie seem comparable.

    AddThis Social Bookmark Button