February 22nd, 2010

The Advances in Genome Biology and Technology (AGBT) meeting begins this week at Marco Island. I’ll be there to present a poster on our somatic mutation detection pipeline, and also to learn about what’s to come in next-generation and next-next-generation sequencing.
Some of the companies are already ramping up. Last week Pac Bio announced the intial members of their partnership program to provide complete solutions for single molecule real-time sequencing. Microfluidics company Caliper Life Sciences formed a scientific advisory board for next-gen sequencing that included WashU’s own Vince Magrini. Other companies – Illumina, Complete Genomics, and RainDance Technologies, for example – are hosting workshops or other events at AGBT.
AGBT Sessions Not To Miss
Day 1 of the meeting will be very strong, with opening remarks from Len Pennacchio (JGI), Kelly Frazer (UCSD) on genomic enrichment, Mike Snyder (Stanford) on paired-ends for SVs/assembly, and Barbara Wold on ChIP-Seq. On Day 2, Stacey Gabriel of the Broad Institute will discuss applications of new sequencing technology to medical and cancer genetics. Carlos Bustamante of Stanford will present the complete genome sequencing and analysis of African-American and Mexican-American individuals. WashU’s David Wang will give a talk on metagenomic approaches to pathogen discovery.
Some friends of mine are giving talks later that evening. Jeff Reid (Baylor College of Medicine) has what looks to be a very interesting talk on miRNA precursor variants in schizophrenia. Daniel MacArthur, of Sanger and Genetic Future fame, will present “Loss-of-Function Mutations in Healthy Human Genomes,” likely based on his work with the 1,000 Genomes Project.
Cancer Genomics and Sequencing
I’m very excited about an entire session devoted to cancer genomics. Elliott Margulies (NHGRI) will discuss the sequencing and analysis of a melanoma genome. In what may be the first application of single-molecule sequencing to cancer, the sequencing of Ewing’s Sarcoma on a Heliscope instrument will be presented by Timothy Triche of Childrens Hospital Los Angeles. Two speakers from BC Cancer Agency will discuss rearrangements in follicular lymphoma and capture/transcriptome sequencing in lung cancer.
Whole Genome Sequencing
There are to be big-picture sequencing talks as well. Genome center co-director Elaine Mardis will present “Single Molecule Sequencing to Detect and Characterize Somatic Mutations in Cancer Genomes.” Stan Nelson of UCLA will give a talk, presumably on his group’s recent publication – whole genome sequencing of a glioblastoma cell line on ABI SOLiD.
I’ll be there, and posting regular updates, as the latest and greatest in sequencing technologies unfolds at Marco Island.

1 Comment
|
Uncategorized
|
Permalink
Posted by Dan Koboldt
February 11th, 2010
Later this month, I’ll present our work on detecting somatic mutations using capture and Illumina sequencing at the Advances in Genome Biology and Technology meeting on Marco Island. Using an internally developed solution-phase capture technology (Washington University Capture, or WUCap), we selectively targeted coding regions of 6,000 genes in tumors and matched controls from 94 patients with ovarian cancer and sequenced them on the Illumina GAIIx.
Capture Somatic Mutation Detection Pipeline
My group developed a high-throughput, automated pipeline that identifies mutations and determines their somatic status (Germline, Somatic, or LOH) in large-scale capture datasets, using this one as our test case. Given BAM files for a tumor sample and its matched control, our pipeline does the following:
- Identifies variants (SNPs and indels) in each of the matched samples
- Determines somatic status for each variant using probability (glfSomatic) or statistical (VarScan) methods.
- Generates a list of putative somatic mutations.
- Removes known germline variants using dbSNP, the 1,000 Genomes Project, and other sources.
- Annotates the filtered variants with gene structure and conservation information.
- Divides annotated variants into tiers according to predicted function class.
- Segregates the variants in each tier into high, moderate, and low confidence groups according to their supporting evidence.
The above is a simplified representation. In fact, the pipeline control module itself contains 28 sub-processes, and that number is still growing.
Application to TCGA-Ovarian Capture Data
When we applied our pipeline to TCGA Ovarian data, we predicted thousands of putative somatic mutations across the 94 patients. Manual review, additional filters, and validation efforts whittled that list down to just over 1,000 validated somatic mutations to date.
Our collaborators at the Broad Institute and Baylor College of Medicine are also sequencing TCGA Ovarian samples using their own capture methods. All three centers have exchanged datasets a couple of times now. We’ve applied our capture somatic variant detection pipeline to data from both other centers with promising results. I’m not sure if I’ll be able to show any of their data in my poster, but the results suggest that our approach is applicable to other capture methods and sequencing platforms.
For more, you’ll have to find my poster at Marco Island.

2 Comments
|
Uncategorized
|
Permalink
Posted by Dan Koboldt
February 4th, 2010
Accurate variant detection in massively parallel sequencing data is a significant bioinformatics challenge. Not only do new sequencers offer unprecedented breadth (whole genome) and depth (30x or more), but they suffer coverage biases and error rates that make variant calling difficult. Last year, we published VarScan, our in-house algorithm for SNP and indel detection on next-generation platforms. NGS analysis has changed somewhat since that time; SAM/BAM format was widely adopted, for one thing, and data throughput has skyrocketed.
To address these issues as well as the requests of many users, we have released VarScan 2 on SourceForge.net. The new version features many improvements and enhancements, including:
- SAM/BAM compatibility. Rather than reading various native alignment formats, VarScan now accepts as input the “pileup” format of SAMtools. Since most widely used aligners can be converted to SAM format or output it directly, this makes VarScan compatible with a wide array of tools.
- Java implementation. To increase speed and performance in the face of ever-increasing sequencing throughput, we’ve implemented the new VarScan in Java. This also means that it can run on any operating system through the Java virtual machine (VM).
- New filtering and comparison tools. VarScan 2 now has commands to limit variants to a list of positions or chromosomal regions, which is useful for targeted sequencing projects. It also has a comparison tool that intersects or merges two sets of variants.
- Somatic variant detection. This is the flagship feature of VarScan 2 – given pileup files from a tumor sample and matched control (normal), VarScan calls variants (SNPs and indels) and determines their somatic status (Germline, Somatic, LOH) using heuristic and statistical approaches.
Software and Documentation on SourceForge.net
VarScan joins the ranks of some of the most widely used tools for NGS analysis – Bowtie, Maq, BWA, SAMtools, and Picard – that are hosted on sourceforge.net. The download page, user’s manual, and Java documentation for VarScan are already online. There’s a new wiki site and discussion forum for VarScan as well, to help us developers keep in touch with users and the NGS community. The project page will have information about known issues and new software releases.
You’ll find it all at http://varscan.sourceforge.net.
References
Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK, & Ding L (2009). VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics (Oxford, England), 25 (17), 2283-5 PMID: 19542151

3 Comments
|
Uncategorized
|
Permalink
Posted by Dan Koboldt