Whole Genome Sequencing Diagnostics

March 24, 2010 by Dan Koboldt

This month in the New England Journal of Medicine, James Lupski and colleagues sequenced the complete genome of an individual with familial Charcot-Marie Tooth (CMT) disease. The “individual” is Lupski himself – he not only led the study, but served as patient zero. From conversations with some of my colleagues at Baylor, it’s clear that Dr. Lupski has devoted much of his career to understanding CMT disease; his association with one of the big three genome centers was the driving force behind this project.

The clinical background is rather interesting. Dr. Lupski and three of his siblings were diagnosed with CMT that appeared to be autosomal recessive, since their parents and grandparents were unaffected. Intriguingly, however, their father and paternal grandmother both shared a less severe disorder, axonal neuropathy, that would later prove to arise from haploinsufficiency in the CMT disease-causing gene.

Why Must There Always Be A Problem?

CMT, like many Mendelian disorders, has proven to be a genetically heterogeneous disease. It can segregate in an autosomal dominant, recessive, or X-linked manner. Single nucleotide polymorphisms (SNPs) and/or copy number variants (CNVs) at some 39 loci confer susceptibility to the disease. However, when tested for some of the more common CMT gene mutations (PMP22, MPZ, PRX, GDAP1, and EGR2), the causative variant for the Lupski pedigree was not found.

Although his institution (BCM) is a leader in exome capture and sequencing, Lupski and colleagues decided on a whole-genome sequencing approach. It was probable, but not certain, that the disease-causing variant was in a coding region, and it might also have been a copy number change which wouldn’t be detected using a capture approach. Thus, with about four runs on a Life Technologies SOLiD instrument, the Lupski genome at 30x was revealed.

Finding the Causative Variants

There were something like 3.4 million SNPs and small indels, which is right on the money for WGS of a single individual, but terribly daunting when searching for a single mutation. The authors whittled down the list of suspects in a series of steps: first they isolated intragenic variants (1.17 million), then prioritized nonsynonymous variants (9,069), and finally cross-referenced these with the ~40 genes linked to CMT (54 coding SNPs). Intriguingly, two of the 54 SNPs were in SH3TC2, a gene previously implicated in CMT in eastern European families. One, carried by Dr. Lupski’s mother, was a known nonsense mutation. The other, carried by his father and paternal grandmother, was a novel missense change that segregated with axonal neuropathy in the pedigree.

Sequencing and Disease Pedigrees

The authors rightly conclude that this study demonstrates the diagnostic power of whole genome sequencing, and that “as a practical matter, the identification of rare, heterogeneous alleles by means of whole-genome sequencing may be the only way to definitively determine genetic contributions to the associated clinical phenotypes.” Another important take-home message from this study, however, is the critical importance of large, well-characterized pedigrees for the study of inherited disease. Indeed, in the absence of exhaustive functional validation, the best way to confirm that you’ve found the disease-causing mutation is to show that it segregates with the phenotype in the rest of the pedigree.

Perhaps the most important aspect of this study is the venue – the New England Journal – because it, like our study of AML2 last year, demonstrates the power of next-generation sequencing to a different audience: the clinicians and medical practitioners who interact directly with patients.

References

Lupski JR, Reid JG, Gonzaga-Jauregui C, Rio Deiros D, Chen DC, Nazareth L, Bainbridge M, Dinh H, Jing C, Wheeler DA, McGuire AL, Zhang F, Stankiewicz P, Halperin JJ, Yang C, Gehman C, Guo D, Irikat RK, Tom W, Fantin NJ, Muzny DM, & Gibbs RA (2010). Whole-Genome Sequencing in a Patient with Charcot-Marie-Tooth Neuropathy. The New England journal of medicine PMID: 20220177

AGBT: Focus on Cancer Genomics

February 26, 2010 by Dan Koboldt

As usual, the quality of the scientific presentations at this meeting has been outstanding. The weather, too, has improved at last:

p_00014

There are too many to cover (or even attend) completely, but one area of interest with a strong focus this year is cancer genomics. Yesterday during plenary sessions, Stacey Gabriel of the Broad Institute of MIT and Harvard presented sequencing of multiple myeloma, a liquid tumor affecting 50,000 people in the US. Around 5,200 gigabases of sequence was generated across 26 tumor samples and matched controls, yielding ~30x average depth per genome. Their mutation detection pipeline achieved an admirable validation rate for somatic SNVs (95%). Short indels were more challenging (~50% validated), and candidate rearrangements even more so (30-50% validated). However, their study validated ~40 somatic mutations per tumor, implicating known MM genes (NRAS, KRAS, TP53) as well as novel ones (DIS3, FAM46C).

Elliott Margulies on Melanoma

Last night, there was a concurrent session devoted to cancer genomics. Eliott Margulies (NIH/NHGRI) led the lineup with his work sequencing the tumor genome and matched normal of a melanoma patient. Using the Illumina platform (2×100 bp), his group achieved 36x and 43x haploid coverage for tumor and normal, respectively, with ~99% of the genome covered by at least one read. Much of the talk was devoted to their analysis pipeline, summarized as:

Initial alignment of Illumina reads with ELAND
Partitioning the reads into “genome” bins of several kilobases
Local realignment with cross_match in highly parallelized fashion
SNV calling with their “Most Probable Genotype” (MPG) method
Removal of variants with any evidence in the Germline, or ones in dbSNP

The 175,768 novel tumor-specific SNVs were classified as coding (807) or noncoding (174,961). Some 513 of 807 coding variants were nonsynonymous. Of these, 101 were selected for validation; 84 got validation results and 75 somatic coding mutations (89%) were confirmed. Unsurprisingly, Dr. Margulies used his group’s expertise in comparative genomics to closely examine the noncoding variants as well. His group recently annotated “Chai” regions of the human genome, which bear evidence of evolutionary constraint that suggest functional relevance. Some 10,285 of the 174,961 fell within Chai regions, and among them were ~2,000 variants predicted to dramatically alter the local structure of DNA (suggesting regulatory changes).

Sequencing Pre- and Post-Treatment Lung Cancer

Ian Bosdet of BC Cancer Agency presented some very interesting work on mutational profiling of pre- and post-treatment lung cancer tumors. His group had the opportunity to participate in a clinical trial at BCCA in which carefully-selected, treatment-naive NSCLC patients underwent a standard therapeutic program. First, each patient underwent a pre-treatment evaluation and biopsy. Next, they received erlotinib (an EGFR inhibitor) until the disease inevitably progressed. Then, another biopsy that was sent for pathology review, as well as DNA/RNA extraction for sequencing. Transcriptome sequencing yielded some interesting findings. For example, the expression of one gene (IER5L or IER5C, it’s hard to read my own handwriting) was highly expressed in smokers that did not respond to treatment. A screen of unmapped transcript reads against viral genomes revealed the presence of Epstein-Barr Virus transcripts in one tumor that was later re-classified as EBV-positive lymphadenocarcinoma (?).

Mutational profiling for three patients was obtained via exome capture (Agilent) and sequencing of normal, pre-treatment tumor, and post-treatment tumor samples. Somatic mutations in PHACTR2 were seen only in pre-treatment samples. Mutations in a few genes (PRMT10, RanBP2) were found at both times, but a few (YY1AP1, SNX9) were only present after treatment, suggesting a role for these genes in progressive disease.

Mutation Detection in Capture at AGBT

February 11, 2010 by Dan Koboldt

Later this month, I’ll present our work on detecting somatic mutations using capture and Illumina sequencing at the Advances in Genome Biology and Technology meeting on Marco Island. Using an internally developed solution-phase capture technology (Washington University Capture, or WUCap), we selectively targeted coding regions of 6,000 genes in tumors and matched controls from 94 patients with ovarian cancer and sequenced them on the Illumina GAIIx.

Capture Somatic Mutation Detection Pipeline

My group developed a high-throughput, automated pipeline that identifies mutations and determines their somatic status (Germline, Somatic, or LOH) in large-scale capture datasets, using this one as our test case. Given BAM files for a tumor sample and its matched control, our pipeline does the following:

Identifies variants (SNPs and indels) in each of the matched samples
Determines somatic status for each variant using probability (glfSomatic) or statistical (VarScan) methods.
Generates a list of putative somatic mutations.
Removes known germline variants using dbSNP, the 1,000 Genomes Project, and other sources.
Annotates the filtered variants with gene structure and conservation information.
Divides annotated variants into tiers according to predicted function class.
Segregates the variants in each tier into high, moderate, and low confidence groups according to their supporting evidence.

The above is a simplified representation. In fact, the pipeline control module itself contains 28 sub-processes, and that number is still growing.

Application to TCGA-Ovarian Capture Data

When we applied our pipeline to TCGA Ovarian data, we predicted thousands of putative somatic mutations across the 94 patients. Manual review, additional filters, and validation efforts whittled that list down to just over 1,000 validated somatic mutations to date.

Our collaborators at the Broad Institute and Baylor College of Medicine are also sequencing TCGA Ovarian samples using their own capture methods. All three centers have exchanged datasets a couple of times now. We’ve applied our capture somatic variant detection pipeline to data from both other centers with promising results. I’m not sure if I’ll be able to show any of their data in my poster, but the results suggest that our approach is applicable to other capture methods and sequencing platforms.

For more, you’ll have to find my poster at Marco Island.

VarScan 2 Released on SourceForge

February 4, 2010 by Dan Koboldt

Accurate variant detection in massively parallel sequencing data is a significant bioinformatics challenge. Not only do new sequencers offer unprecedented breadth (whole genome) and depth (30x or more), but they suffer coverage biases and error rates that make variant calling difficult. Last year, we published VarScan, our in-house algorithm for SNP and indel detection on next-generation platforms. NGS analysis has changed somewhat since that time; SAM/BAM format was widely adopted, for one thing, and data throughput has skyrocketed.

To address these issues as well as the requests of many users, we have released VarScan 2 on SourceForge.net. The new version features many improvements and enhancements, including:

SAM/BAM compatibility. Rather than reading various native alignment formats, VarScan now accepts as input the “pileup” format of SAMtools. Since most widely used aligners can be converted to SAM format or output it directly, this makes VarScan compatible with a wide array of tools.
Java implementation. To increase speed and performance in the face of ever-increasing sequencing throughput, we’ve implemented the new VarScan in Java. This also means that it can run on any operating system through the Java virtual machine (VM).
New filtering and comparison tools. VarScan 2 now has commands to limit variants to a list of positions or chromosomal regions, which is useful for targeted sequencing projects. It also has a comparison tool that intersects or merges two sets of variants.
Somatic variant detection. This is the flagship feature of VarScan 2 – given pileup files from a tumor sample and matched control (normal), VarScan calls variants (SNPs and indels) and determines their somatic status (Germline, Somatic, LOH) using heuristic and statistical approaches.

Software and Documentation on SourceForge.net

VarScan joins the ranks of some of the most widely used tools for NGS analysis – Bowtie, Maq, BWA, SAMtools, and Picard – that are hosted on sourceforge.net. The download page, user’s manual, and Java documentation for VarScan are already online. There’s a new wiki site and discussion forum for VarScan as well, to help us developers keep in touch with users and the NGS community. The project page will have information about known issues and new software releases.

You’ll find it all at http://varscan.sourceforge.net.

References

Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK, & Ding L (2009). VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics (Oxford, England), 25 (17), 2283-5 PMID: 19542151

« Previous Page