Random Post: Marco Island Meeting Preview
RSS 2.0
  • Home
  • About
  • Aligners
  • Genomes
  • Subscribe
  • VarScan
  •  

    NGS Informatics: Hail to the Chief

    September 17th, 2009

    Bio-IT World’s Kevin Davies has a nice interview with David Dooling, who heads informatics here at the Genome Center and still finds time for his PolITiGenomics blog.  Dooling joined the center in 2001, as the Human Genome Project was wrapping up.  Now, he oversees about half of our informatics group – including IT personnel as well as the developers of our LIMS and automated data pipelines.

    All three groups, now that I think about it, have had to address significant challenges during our transition to a next-generation sequencing center.  Our LIMS deals with tens of millions of transactions per month, with a back-end database whose tables sometimes have billions of records.  Our automated pipeline (or APIPE) group develops all of the data pipelines that make whole-genome sequencing feasible – primary data analysis, alignment, coverage reporting, mutation detection, etc.  And the IT group must address the exponentially growing needs of data transfer and compute time for all of it – not an easy job.

    Despite these monumental tasks, under the leadership of David and others we’re currently “on a good path” to handle the current generation of sequencing tools.  Of course, that may change in the next couple of years, when technologies like Pac Bio’s SMRT platform begin cranking out single-molecule sequences 1,000 bases long or longer.

    In-House and Open Source

    Bio-IT World is heavily read by providers of commercial informatics tools, and this is reflected somewhat in the interview.  Davies often asks whether we’re working with any specific vendors, or considering any commercial tools.  Often enough we are – certainly for storage and data transfer systems, things that can’t be built from the ground up.  Yet whenever possible, we opt for the open-source solution.  Every workstation here, for example, is Linux.  We have but one Windows PC, and it’s not allowed to connect to the internet.  Most of our LIMS system and many of our in-house tools were written in Perl.

    A Tough Nut for Commercial Vendors

    There are, of course, commercial alternatives to anything.  Yet vendors face significant hurdles in marketing products to large genome centers.  The tools that we use are often highly customized, and must continually evolve to address new technological developments.  Take aligners for example.  In the early days of Illumina sequencing, we licensed some commercial software – SLIMsearch and SXOG, for example – because there simply were no good alternatives to ELAND.  Then Maq came along, offering better functionality and performance in a free and open source program (offered, no less, by our trusted friends across the pond).  Exorbitantly priced licenses, needless to say, were quickly not renewed.

    Now there are numerous commercial solutions, and we’re often wooed by companies like CLC bio.  Yet for every commercial aligner there’s half a dozen free/open-source alternatives, developed by academic groups that we respect and trust (Maq/BWA from Sanger, Bowtie from UMD, etc.), and many of these tools are pretty damn good.  A commercial option would have to be so incredible, so vastly superior to what’s currently available for us to consider a paid license.  With Bowtie and BWA mapping lanes of 15 million reads in just a couple of hours, the bar is already set pretty high.

    Outsourcing Sequencing?

    David offers, I think, a polite response to the question of whether we’d ever outsource our sequencing to a third party.  Personally, I can offer two reasons why this will probably never happen.  First, we’re already pretty happy with Illumina, a platform that can deliver whole human genomes at high coverage in just a few weeks.  All available evidence suggests that throughput will only continue to grow, and before long I expect we’ll be doing a genome on a single flowcell or less.  Of course, cost is a consideration (Illumina runs aren’t cheap).  It’s very possible that a company like Complete Genomics might be able to offer similar yields at a substantially reduced cost.  We do use companies like IDT and Agilent, for example, to synthesize oligo sequences that we might make in house.  They can make them cheaper, and faster, than we can.

    There is a second, and perhaps more compelling reason to keep sequencing in-house – because we’re in the business of research, and data is precious.  With our current capacity we can track the progress of sequencing runs in real-time, monitor error rates and alignment rates, and assess results the moment data is off of machines.  We maintain a forensics-lab-like “chain of custody” on the data from start to finish.  Doing so offers a certain sense of security, and confidence, when we use the results to tackle some of the most fundamental questions in biology.

    AddThis Social Bookmark Button

    Maq, BWA, and Bowtie Compared

    July 30th, 2009

    Until recently, Maq has provided the central alignment/assembly/variant-detection functionality for our Illumina pipeline.  As technologies and algorithms evolve, however, we continue to investigate possible improvements.  Heng Li’s sequel to Maq, called BWA, utilizes the incredibly fast Burrows-Wheeler indexing algorithm to speed up alignment time by orders of magnitude.  Also, BWA generates alignments in SAM/BAM format by default, which is convenient for our large-scale sequencing projects where BAM files are becoming the standard format.

    These features, along with our impression that Heng Li and company do not plan future updates to Maq, lead me to infer that BWA is the heir-apparent for our Illumina pipeline.  Before the transition, however, we must compare Maq results with BWA results on the same dataset, to identify any differences that may affect downstream analysis.  Also, we are continuing to evaluate other aligners, especially Bowtie, which offer comparable or even better speed at short read alignment.

    Test Data: WGS and Targeted Sequencing of a Single Sample

    We have a sample in-house for which we performed whole genome sequencing (WGS) and subsequently validated numerous novel variants.  We also performed capture-based targeted resequencing (Illumina 2x75bp PE) of 6,000 genes in the same sample.  To compare the performance of BWA, Maq, and Bowtie, we aligned the capture data with each tool separately, and looked at about a dozen sites where we’d validated novel variants from WGS.

    Sensitivity – Total Reads Mapped

    Here’s a histogram of the read depth at each of the 12 variant sites by aligner:

    bwa-maq-bowtie-coverage

    These results surprised me.  Based on previous experience, I’d guessed that Maq would yield the highest depth, followed by BWA, and then Bowtie.  Instead, with one exception, it was the other way around – Bowtie was more sensitive than BWA, which in turn was more sensitive than Maq.  Yet these differences were relatively minor; overall, the coverage seems very comparable across all three aligners.  I think that’s good news.

    Variant Frequency by Read Count

    Next, we looked at the observed variant frequencies, calculated as the relative fraction of reads supporting reference or variant alleles.

    bwa-maq-bowtie-varfreq

    When it comes to variant frequency, Maq and BWA yield almost identical results (despite slight coverage disparities).  Bowtie yielded slighly higher frequencies in some cases, slightly lower frequencies in others.  Again, these were very minor differences from three very different alignment algorithms, suggesting that each of them yields fairly robust results.

    Farewell to Maq

    Unfortunately, the results of my analysis do not bode well for Maq, only because Maq took a few days to align data that BWA and Bowtie processed in a matter of hours.  So which Burrows-Wheeler aligner will prevail?  It’s difficult to say.  As far as SNP detection goes, BWA and Bowtie seem comparable.

    AddThis Social Bookmark Button

    SAM, BAM, Thank You Ma’am

    July 24th, 2009

    Genome centers around the world have at last found something that we agree on.  It seems like only a few months ago that I first heard of SAM (Sequence Alignment/Map) and his brother BAM (binary/compressed SAM).  Yet those who produce next-gen sequencing data are rapidly adopting this universal short read data format as the de facto standard.

    The Need for a Standard Short Read Data Format

    From my work with short read aligners over the past year, it was quite obvious that we needed a standard format.  More than a dozen alignment tools for next-gen data have been released – Maq, Bowtie, Novoalign, SeqMap, BFAST, and SHRiMP, just to name a few – and every one of them has its own custom format.  The closest thing to a standard was Maq’s MAP format, a non-human-readable file that reached prominence only because Maq was the most widely used tool out there.

    NGS Community Looks to Heng Li

    The need for a standard is almost certainly not the only reason for the rapid adoption of the SAM/BAM specification.  Another key factor is Heng Li, of Richard Durbin’s lab at Sanger.  His work in developing Maq, and his reputation as a scientist, established a wide credibility for Heng in the NGS community.  While Maq is, and continues to be, a powerful tool for analyzing Illumina and ABI SOLiD data, newer algorithms that leveraged Burrows-Wheeler Transformation (BWT) proved orders of magnitude faster at mapping reads. Bowtie was one of the first; soon after the makers of SOAP released SOAP2, a revamped version of their short read aligner that uses Burrows-Wheeler and is no longer open source.

    Heng’s group soon developed a sequel to Maq, called BWA, that leverages this algorithm to speed up read alignments.  It also produces BAM format output by default, a key feature that encouraged friends of Maq to make the switch.  Richard Durbin’s group worked with a number of institutions to develop the SAM/BAM specification.  Perhaps even more important was the development (and recent publication) of SAMtools, a suite of programs for manipulating SAM/BAM files and calling variants.  By working with a standard alignment format, SAMtools can serve as a generic, aligner-agnostic variant detection pipeline.

    SAM/BAM Convergence

    Adoption of SAM/BAM by the sequencing community has been surprisingly rapid and widespread.  A few short read aligners like Maq and Bowtie already have tools for converting their native formats to SAM.  The latest Novoalign has a SAM output option built-in.  There’s even a SAM/BAM converter for 454 data, which sounds like a dicey prospect given the homopolymer undercalling/overcalling problem.  Large-scale sequencing efforts like the 1,000 Genomes Project and The Cancer Genome Atlas now consider BAM the standard format for data exchange.  Many variant calling algorithms, like those developed at the Broad Institute, now accept BAM input as well.

    The rush to make SAM/BAM a standard surprised me, but I don’t think it’s a bad thing. We needed a way to share NGS data that could be easily compressed, but also had a human-readable form.  My only concern is this: many people may not realize that the data in a SAM/BAM file is aligner-dependent. There’s no such thing as a SAM file for an Illumina lane – instead, there’s on file for Maq alignments, one file for Bowtie, one file for BWA, etc.  The good news, however, is that now we will have a centralized format for direct head-to-head comparisons of short read aligners.

    AddThis Social Bookmark Button

    Variant Detection in Massively Parallel Sequencing

    June 22nd, 2009

    There is at last some evidence that I do real scientific work, and real scientific writing outside of Massgenomics.  Online today at Bioinformatics is our publication of variant detection in massively parallel sequencing of individual and pooled samples.  In it we present VarScan, the culmination of my work over the past two years to develop algorithms for calling SNPs and indels in deep resequencing data from the 454 and Illumina/Solexa platforms.  It’s one of the many cancer genomics tools that we have developed and made available to the research community, and the first to be published.

    VarScan variant detection in massively parallel sequencing

    The Challenge

    With next-generation sequencing, the problem of accurately calling variants faces two key challenges.  First, to distinguish between true variants and false positives in shorter, more error-prone sequencing reads.  Second, to handle the sheer volume of data, which might be millions of reads from a single run.  The 454 and Illumina platforms present their own particularly difficulties as well.  On the 454, it’s homopolymers – stretches of A’s, C’s, G’s, or T’s four or more bases long.  The exact number of bases in a homopolymeric stretch becomes difficult to resolve by pyrosequencing, and as a result, 454 sequencing tends to overcall or undercall bases in these regions.  This presents alignment difficulties and numerous artifact insertions/deletions relative to the reference sequence.  On the Illumina platform, reads are shorter, and the base qualities drop sharply after 28 or so bases.  Until recent improvements that have yielded 50, 75, and 100 bp read lengths, alignment of Illumina reads was particularly difficult given this base quality paradigm.

    Targeted Resequencing, Pooled Samples

    For whole genome sequencing of individual samples, we’ve used Maq with some success to map, assemble, and call variants from Illumina/Solexa reads.  However, for various projects we’re interested in sequencing only selected regions (targeted by PCR or capture), often in many samples.  Deep sequencing of pooled samples is one area, unfortunately, where Maq and RunMapper/Newbler continue to struggle.  Here, the individual read counts supporting reference and variant alleles become important in distinguishing true variants and assessing their frequencies in a pool.  Also, due to the high-priority of say, validating somatic variants in a tumor sample with 454 sequencing, we wanted something that could leverage faster aligners – parallelized BLAT for 454 and Bowtie for Illumina.

    VarScan: SNP and Indel Calling in BLAT, Bowtie, Novoalign…

    To address these challenges we developed VarScan, a program for calling SNPs and indels in next-gen sequencing data.  VarScan accepts read alignments in several native formats – BLAT and Newbler for 454 data, Bowtie, cross_match, and Novoalign for Illumina/Solexa data.  Given a file of read alignments, VarScan detects sequence variants, combines them by position and type, and then computes the read counts, average base quality, and number of strands supporting each allele.  As a final step, one can use VarScan to limit variant calls to a set of targeted positions, which removes variants arising from spurious read alignments.  Variant calls can also be filtered by base quality, read count, and variant frequency, which is useful for isolating variants at frequencies of say 1% or higher in a pool of samples.

    Proof of Principle: Individual 454 and Pooled Illumina Sequencing

    To demonstrate VarScan’s capabilities, we used data from a collaboration with Stephen Daiger and colleagues, who are resequencing candidate genes in a cohort of families with Retinitis Pigmentosa (RP).  The Genome Center had designed 1,000 amplicons targeting the exons of dozens of genes, and obtained PCR products for these for each sample.  Samples were sequenced individually on the 454 platform to ~70x coverage.  Then, samples from 42 individuals were pooled and sequenced on the Illumina platform to extremely deep coverage (~6000x, or ~125x per sample).  We used BLAT and Bowtie to align the 454 and Illumina reads, respectively, and applied VarScan to the resulting alignments.

    Sensitivity, Specificity, SNPs, and Indels

    I won’t ruin the whole story for you, but suffice it to say that most of the SNPs called by VarScan in the 454 sequencing were present in dbSNP or supported by VarScan’s Illumina/Solexa calls, or both.  Incredibly, when we plotted the estimated allele frequencies from deep Illumina sequencing of the pool to the known frequencies from individual 454 results, the correlation was extremely high (>0.95, Pearson’s).  This demonstrates VarScan’s capabilities not only to detect variants with good sensitivity/specificity, but to accurately estimate their frequency in a pool.  To evaluate VarScan’s performance on insertion/deletion (indel) variants, we identified 77 high-confidence short (1-5 bp) indels in the 454 data and attempted to validate them with Illumina data.  Since Maq, Bowtie, and most other Illumina aligners don’t allow gaps, we used another aligner, Novoalign, which claims to have higher read placement sensitivity and aligns reads with gaps in single-end libraries.  We used VarScan to call indels in the Novoalign output, and we were able to validate around 60% of the high-confidence indels from 454 with Novoalign alignments of Illumina data.  Taken together, these results support the utility of VarScan for variant detection in next-gen sequencing, and demonstrate the feasibility of large-scale mutational profiling by deep resequencing of pooled samples.

    References

    Koboldt, D., Chen, K., Wylie, T., Larson, D., McLellan, M., Mardis, E., Weinstock, G., Wilson, R., & Ding, L. (2009). VarScan: Variant detection in massively parallel sequencing of individual and pooled samples Bioinformatics DOI: 10.1093/bioinformatics/btp373

    AddThis Social Bookmark Button