Virtual Normals for Somatic Mutation Detection

In cancer genomics, we typically identify somatic alterations by sequencing DNA from both a tumor and a matched normal “control” sample from the same patient. The Cancer Genome Atlas and other large-scale efforts to characterize tumor genomes have typically used this approach, because it allows mutation callers (like VarScan 2) to distinguish between inherited variation and acquired (somatic) mutations.

Discriminating between acquired (somatic) mutations and inherited (germline) variants is critically important, because:

The vast majority of sequence variants in a tumor genome (typically >99%) are inherited germline variants.
Patterns of somatic mutations offer insight into tumor biology, clonality, and molecular subtype.
Some somatic mutations may render tumors vulnerable to certain therapies.

Unfortunately, sequencing a matched normal sample is not always possible. Often, matched DNA is simply not available because the patient is no longer alive, reachable, and/or willing to participate in further studies. Other times, it’s a budget decision, since sequencing a matched normal sample for each tumor costs twice as much as sequencing single samples.

dbSNP Filtering Is Not A Solution

Some researchers have proposed sequencing tumor samples only, and then using public sequence databases like dbSNP to exclude likely germline variants. There are two fundamental problems with this approach:

False positives. A modest proportion of variants in every individual’s genome (~4-5%) are rare/private variants not yet represented in public databases. That means 150,000 germline variants will fail to be excluded by this approach.
False negatives. Perhaps even more worrisome is the fact that dbSNP contains a number of somatic mutations. Recurrent mutations in common cancer types are therefore vulnerable to being filtered out.

When it comes to non-SNV variation — small indels and structural variation (SV), the results are even more disastrous because of how poorly and inaccurately such variants are represented in public databases.

Because of these limitations, I’m often skeptical of cancer studies that draw conclusions about somatic mutations when matched normals are not available.

Sequenced Cohorts as Virtual Normals

An article just out at Genome Research describes a rather innovative method for discriminating germline and somatic mutations using virtual normals. The authors propose to use whole-genome sequencing data from hundreds of healthy individuals as a “virtual normal” (VN) for somatic mutation calling when no matched normal was available. Admittedly, this approach will not be able to remove rare and private germline variants, but it has some key advantages.

First, it’s a relatively pure way to identify and remove germline variants without relying on a public database of dubious quality. Second, it allows one to remove many of the artifacts present in somatic mutation calls (e.g. false positives due to paralogous alignment, homopolymer-associated errors, etc.). Third, by matching the technology and variant detection algorithms, it’s possible to maximize the discrimination power of a large set of normals.

As a proof of principle, the authors examine the performance of different normal sets on 4 tumor-normal pairs sequenced on the Complete Genomics (CGI) platform. They assembled “virtual normals” were assembled from two publicly-available WGS datasets:

433 individuals sequenced by CGI for the company’s 2010 paper
498 samples sequenced on Illumina HiSeq by the GoNL Consortium.

Before we continue, I should point out some important caveats when interpreting the results of this study.

The results are based on a very small number of tumor-normal pairs [n=4] that were sequenced using only one technology (Complete Genomics). A technology that many (including me) would deem inferior to state-of-the-art Illumina sequencing.
The authors did not experimentally validate somatic mutations, but relied on external .Somatic mutations were not independently validated. Rather, the authors relied on external sources (e.g. COSMIC) to identify which somatic mutations had been validated.

Still, the results were encouraging. Using virtual normals removed 96% of the germline events that were removed by true matched normals, and another ~8% of variants that likely represent false positives or missed germline events. An important strength of this paper was that authors considered small indels and SVs, which are common types of alterations in cancer but often difficult to accurately detect.

Despite the limitations of this study, I think it represents a promising approach yet for improving somatic mutation detection, even when a matched normal was sequenced. The authors couldn’t show the benefit of this scenario in their study, because they treated the maximally-filtered set (after VN, MN, and database filtering) as the truth set. Yet I think this might be an important alternate application of this method, because the benefits are largely the same: it’ll remove technology-specific artifacts and the occasional germline variant that slips past somatic mutation callers.

References
Hiltemann S, Jenster G, Trapman J, van der Spek P, & Stubbs A (2015). Discriminating somatic and germline mutations in tumor DNA samples without matching normals. Genome research PMID: 26209359