Virtual Normals for Somatic Mutation Detection

In cancer genomics, we typically identify somatic alterations by sequencing DNA from both a tumor and a matched normal “control” sample from the same patient. The Cancer Genome Atlas and other large-scale efforts to characterize tumor genomes have typically used this approach, because it allows mutation callers (like VarScan 2) to distinguish between inherited variation and acquired (somatic) mutations.

Discriminating between acquired (somatic) mutations and inherited (germline) variants is critically important, because:

  1. The vast majority of sequence variants in a tumor genome (typically >99%) are inherited germline variants.
  2. Patterns of somatic mutations offer insight into tumor biology, clonality, and molecular subtype.
  3. Some somatic mutations may render tumors vulnerable to certain therapies.

Unfortunately, sequencing a matched normal sample is not always possible. Often, matched DNA is simply not available because the patient is no longer alive, reachable, and/or willing to participate in further studies. Other times, it’s a budget decision, since sequencing a matched normal sample for each tumor costs twice as much as sequencing single samples.

dbSNP Filtering Is Not A Solution

Some researchers have proposed sequencing tumor samples only, and then using public sequence databases like dbSNP to exclude likely germline variants. There are two fundamental problems with this approach:

  • False positives. A modest proportion of variants in every individual’s genome (~4-5%) are rare/private variants not yet represented in public databases. That means 150,000 germline variants will fail to be excluded by this approach.
  • False negatives. Perhaps even more worrisome is the fact that dbSNP contains a number of somatic mutations. Recurrent mutations in common cancer types are therefore vulnerable to being filtered out.

When it comes to non-SNV variation — small indels and structural variation (SV), the results are even more disastrous because of how poorly and inaccurately such variants are represented in public databases.

Because of these limitations, I’m often skeptical of cancer studies that draw conclusions about somatic mutations when matched normals are not available.

Sequenced Cohorts as Virtual Normals

An article just out at Genome Research describes a rather innovative method for discriminating germline and somatic mutations using virtual normals. The authors propose to use whole-genome sequencing data from hundreds of healthy individuals as a “virtual normal” (VN) for somatic mutation calling when no matched normal was available. Admittedly, this approach will not be able to remove rare and private germline variants, but it has some key advantages.

First, it’s a relatively pure way to identify and remove germline variants without relying on a public database of dubious quality. Second, it allows one to remove many of the artifacts present in somatic mutation calls (e.g. false positives due to paralogous alignment, homopolymer-associated errors, etc.). Third, by matching the technology and variant detection algorithms, it’s possible to maximize the discrimination power of a large set of normals.

As a proof of principle, the authors examine the performance of different normal sets on 4 tumor-normal pairs sequenced on the Complete Genomics (CGI) platform. They assembled “virtual normals” were assembled from two publicly-available WGS datasets:

  • 433 individuals sequenced by CGI for the company’s 2010 paper
  • 498 samples sequenced on Illumina HiSeq by the GoNL Consortium.

Before we continue, I should point out some important caveats when interpreting the results of this study.

  1. The results are based on a very small number of tumor-normal pairs [n=4] that were sequenced using only one technology (Complete Genomics). A technology that many (including me) would deem inferior to state-of-the-art Illumina sequencing.
  2. The authors did not experimentally validate somatic mutations, but relied on external .Somatic mutations were not independently validated. Rather, the authors relied on external sources (e.g. COSMIC) to identify which somatic mutations had been validated.

Still, the results were encouraging. Using virtual normals removed 96% of the germline events that were removed by true matched normals, and another ~8% of variants that likely represent false positives or missed germline events. An important strength of this paper was that authors considered small indels and SVs, which are common types of alterations in cancer but often difficult to accurately detect.

Despite the limitations of this study, I think it represents a promising approach yet for improving somatic mutation detection, even when a matched normal was sequenced. The authors couldn’t show the benefit of this scenario in their study, because they treated the maximally-filtered set (after VN, MN, and database filtering) as the truth set. Yet I think this might be an important alternate application of this method, because the benefits are largely the same: it’ll remove technology-specific artifacts and the occasional germline variant that slips past somatic mutation callers.

Hiltemann S, Jenster G, Trapman J, van der Spek P, & Stubbs A (2015). Discriminating somatic and germline mutations in tumor DNA samples without matching normals. Genome research PMID: 26209359

New Insights into Human De Novo Mutations

De novo mutations — sequence variants that are present in a child but absent from both parents — are an important source of human genetic variation. I think it’s reasonable to say that most of the 3-4 million variants in any individual’s genome arose, once upon a time, as de novo mutations in his or her ancestors. In the past few years, whole-genome sequencing (WGS) studies performed in families (especially parent-child trios) have offered some revelations about de novo mutations and their role in human disease, notably that:

A recent study in Nature Genetics provides the largest survey of de novo mutations to date. Laurent Francioli et al identified de novo mutations in 250 Dutch families that were sequenced to ~13x coverage as part of the Genome of the Netherlands (GoNL) project. Their findings confirm much of the observations from previous smaller studies, and offer some new insights into the patterns of de novo mutations throughout the human genome.

Identification of de novo Mutations

To make any global observations about de novo mutations, one generally needs unbiased whole-genome sequencing data for an individual and both parents. Even with those in hand, accurate identification of de novo mutations is challenging because they’re so exquisitely rare. Since the sequencing coverage in this study is a little bit light (13x, whereas most studies shoot for ~30x), I had some initial concerns about whether or not the mutation calls might hold up under scrutiny.

Delving into the online methods, I learned that the samples underwent Illumina paired-end sequencing (2x91bp, insert size 500bp). Alignment and variant calling followed GATK best practices (v2), and the mutations were called with the trio-aware GATK PhaseByTransmission. Next, the authors used a machine learning classifier trained on 592 true positive and 1,630 false positive de novo calls that had been validated experimentally. The net result was 11,020 high-confidence mutations in the 269 children, with an estimated a 92.2% accuracy.

The numbers are about right: if 92.2% of the calls are real, that’s 10,160 true mutations, or ~37.7 mutations per child. That’s very close to the estimated ~38 per genome. In other words, without experimentally validating all 11,000 mutations (an expensive and laborious task), this is as good as it gets.

Parent-of-Origin and Replication Timing

de novo mutations and paternal age

Credit: Francioli et al, Nature 2015

The authors first examined whether the location of the observed mutations was correlated with any epigenetic variables. There was no significant correlation for most of the variables examined (chromatin accessibility, histone modifications, and recombination rate).  With a linear regression model, they noted a significant association between replication timing and paternal age: mutations in the offspring of younger fathers (<28 years old) were strongly enriched in late-replicating regions, whereas mutations in offspring of older fathers were not.

To dig deeper, the authors looked at 2,621 mutations that could be unambiguously assigned to maternal or paternal origin. The method for this isn’t documented in the online methods, but presumably they looked for instances in which a mutation was in the same read or read pair as a variant unique to one parent. Notably, 1,991 of those origin-inferred mutations (76%) came from the father. After controlling for the number imbalance, the replication-timing-with-parent-age correlation was significant only for mutations of paternal origin.

This makes a certain kind of sense, since the stem cells in the paternal germ line undergo continuous cell division throughout a man’s life, whereas a woman is born with all of the eggs she’ll ever have.

The correlation between paternal age and replication timing is important from a reproductive health perspective, because late-replicating regions have lower gene density and expression levels than early ones. Since the mutations in offspring of younger fathers tend to occur in these regions, they’re less likely to have a functional impact. In support of this idea, on average, the offspring born to 40-year-old fathers had twice as many genic mutations as offspring born to 20-year-old fathers.

In other words, mutations in the offspring of older fathers are not only more numerous, but also more likely to have functional consequences.

Mutations in Functional Regions

Notably, the de novo mutation rates in this study were higher in exonic regions regardless of the paternal age. Overall, 1.22% of mutations were exonic, an enrichment of 28.7% over simulated models of random mutation distribution. Mutations were also enriched in DNase I hypersensitive sites (DHSs), which represent likely regulatory regions. The source of this “functional enrichment” likely has to do with sequence context: mutations often occur at CpG dinucleotides, which are themselves more prevalent in exons and DHSs.

Recent studies of somatic mutations in tumor cells revealed a fascinating phenomenon: a reduction in the mutation rate of highly transcribed regions, likely attributed to the fidelity conferred by transcription-coupled DNA repair mechanisms. In the current study of de novo mutations, however, the mutation rate in transcribed regions and DHSs did not appear to be reduced.

The implication here might be that transcription-coupled repair has less of an impact on de novo mutations, though the authors note that their study was only powered to detect a substantial difference (>17%) in mutation rate. That’s understandable, because while the individuals examined here harbored ~40 mutations genome-wide, a tumor specimen might have tens of thousands of somatic mutations (i.e. much better power to detect subtle differences in mutation rate).

Clustered de novo Mutations

One of the most interesting observations in this study was a clustering effect of de novo mutations. If all things were random, given the size of the genome (3.2 billion base pairs) and the number of mutations per individual (~40), we expect them to be pretty far apart. As in, one every 80 million base pairs.

Instead, the authors observed 78 instances in which there were “clusters” of 2-3 mutations within a 20kb window in the same individual. The 161 mutations involved showed no significant differences from the non-clustered mutations with regard to recombination rate (p=0.52) or replication timing (p=0.059), though I should point out that the latter might be approaching an interesting p-value.

Interestingly, however, the clustered mutations exhibited an unusual mutational spectrum, with a strong enrichment for C->G transversions compared to non-clustered mutations (p=1.8e-13).

Mutation spectrum of de novo mutations

Francioli et al, Nature 2015

Based on the nucleotide context, the authors suggest that a new mutational mechanism may be at work involving cytosine deamination of single-stranded DNA (presumably during replication). I don’t have strong enough chemistry to understand the proposed mechanism, but agree that this unusual pattern merits some more investigation.

Francioli LC, Polak PP, Koren A, Menelaou A, Chun S, Renkens I, Genome of the Netherlands Consortium, van Duijn CM, Swertz M, Wijmenga C, van Ommen G, Slagboom PE, Boomsma DI, Ye K, Guryev V, Arndt PF, Kloosterman WP, de Bakker PI, & Sunyaev SR (2015). Genome-wide patterns and properties of de novo mutations in humans. Nature genetics, 47 (7), 822-6 PMID: 25985141

How to Succeed at Clinical Genome Sequencing

Whole-genome sequencing holds enormous potential to improve the diagnosis and treatment of human diseases. Although this approach is the only way to capture the complete spectrum of genetic variation, its application in clinical settings has been slow compared to more targeted strategies (i.e. panel and exome sequencing). Everyone talks about cost as the main contributing factor for this, but compared to routine clinical genetics testing it’s actually inexpensive. Let’s be honest: the challenges of detecting and interpreting variants outside the exome is another consideration.

Before WGS is adopted as a routine clinical tool, we will need to demonstrate its diagnostic yield for patients with likely-yet-undiagnosed genetic disorders in a medical setting. A recent paper from across the pond offers a promising start. Jenny C. Taylor et al report preliminary results from the WGS500 program, which hopes to sequence the genomes of 500 patients with diverse genetic disorders who are referred by medical specialists.

So far, they’ve sequenced 217 individuals (156 probands plus some family members) to ~30x haploid coverage. About 21% of cases ended up with a confirmed genetic diagnosis; this goes up to 34% for Mendelian disorders and 57% for family trios. Their report highlights some of the factors influencing success, and offers some important guidelines for other groups hoping to adopt clinical WGS.

1. Joint variant calling improves accuracy

The researchers used a two-step variant calling strategy: first, identify genetic variants in all samples individually, and then perform join consensus calling in all individuals at all variable sites. We use this strategy ourselves, because it has some important benefits:

  • Recovering variants that were “missed” in certain individuals due to coverage or allele representation
  • Reducing the rate of Mendelian inconsistencies among family members
  • Removing the vast majority of artifactual de novo mutation calls

Specifically, the researchers found that joint calling reduced brought the number of de novo coding mutations in trios from ~32.1 per child to the more realistic ~2.1 per child.

2. Filtering with variant databases is important

Private and rare variants represent one of the biggest challenges in human genetics. Every individual harbors a few hundred thousand variants that have never been seen before. This is problematic for clinical genome sequencing, since we expect that most of the pathogenic mutations that cause rare genetic disorders are also quite rare. These blend in, for lack of a better word, with the many rare-but-neutral variants in each genome.

The catalogues of human genetic variation that are generated by sequencing approaches (1000 Genomes, ESP, etc.) can help, since most of the individuals enrolled in those studies do not have severe genetic disorders. Thus, for severe and highly penetrant genetic disorders at least, these large catalogues help us identify and remove variants that have been seen before in presumably-healthy individuals.

You might ask, why don’t we just use dbSNP? It has all of the variants, right? The problem is that dbSNP is too inclusive: it contains variants from places like OMIM, which generally are not found in healthy individuals. It also has some number of somatic mutations from tumor genomes that were submitted before the COSMIC repository existed. In other words, one would have to use extreme care when filtering against dbSNP so that true pathogenic variants aren’t accidentally removed.

Another important strategy described in this paper is the use of internal data (from undiseased control samples) to filter sets of candidate causal variants. This is advantageous because the sequencing technology and variant calling are the same. In this study, the authors found that the vast majority of rare/novel variants that passed external filters could be discounted using data from other WGS500 samples.

3. Leverage multiple sources of annotation

Variant annotation — that is, predicting the likely functional impact of a sequence variant — remains an imperfect art. For any given variant, the annotation can change depending on the transcript database (NCBI or Ensembl), software tool (e.g. VEP versus ANNOVAR), or prioritization strategy. This inconsistency problem is probably the worst for loss-of-function variants, which are precisely the ones that interest us most in clinical sequencing.

In this study, for example, there was only 44% agreement on loss-of-function variant annotations between NCBI and Ensembl transcript sets. VEP and ANNOVAR only agreed on 66% of LOF annotations even when using the same transcript database. The most common discrepancies in this category were splicing variants, which (in my opinion) are better identified by VEP than ANNOVAR.

4. Genetic evidence over biological plausibility

To identify candidate disease-causing variants, the authors used a combination of predicted functional impact, frequency in the population, and transmission within a family. If that sounds familiar, it’s because these are three of the four variant prioritization strategies described in the MendelScan paper. The authors here also leveraged “statistical evidence of association” when multiple independent cases for the same disorder were available, which is a fancy way of saying that they looked for recurrently mutated genes.

It’s always tempting, in studies like these, to eyeball a list of candidate causal variants and pick out the ones that seem most biologically plausible. We all love looking at the table of variants and pointing out our “favorite” genes. The authors did some work to demonstrate why that’s a dangerous game to play.

For example, they compiled a list of 83 genes linked to X-linked mental retardation (XLMR). Some 30 of 109 males cases with this phenotype (28%) carried at least one novel missense variant at a conserved residue in one of those genes. Yet only two of those were ultimately deemed to be pathogenic.

Interestingly, the authors found that as the strength of gene candidacy increased, the number of putative pathogenic variants actually decreased. It needs to be said, however, that patients with easily-obtained genetic diagnoses probably didn’t make it into this program. Granted, WGS did uncover a small number of genetic testing “misses” — four cases (2.5% of the cohort) were negative for a clinical genetic test but actually had a causal variant in the tested gene — but we should keep in mind that clear pathogenic mutations in known disease genes are almost certainly underrepresented in this study.

5. WGS reveals candidate pathogenic regulatory variants

One of the big selling points of WGS is that it can detect large-scale and/or complex variation, as well as variants in noncoding regulatory regions. The challenge, of course, is that such variants are often difficult to detect (SVs) or prove as causal (regulatory variants). In this study, the authors leveraged the discovery power of WGS to identify candidate regulatory variants for two conditions.

One was a complex rearrangement in a patient with X-linked hyperparathyroidism involving a deletion on the X-chromosome and insertion of 50kb of sequence from chromosome 2. This occurred about 80 kb downstream of SOX3, a strong candidate gene for the condition. It’s the perfect example of a variant that would never be detected b y exome or targeted sequencing.

The other candidate pathogenic regulatory variant was a single base change at a conserved position in the 5′ UTR of EPO. This gene encodes erythropoietin, an essential factor for red blood cell formation. Whole-genome sequencing revealed the presence of that variant in two unrelated families with erythrocytosis, and in both of them, it segregated with the disease. This is a particularly compelling finding since increased levels of erythropoietin cause higher blood cell mass, which is a hallmark of erythrocytosis.

Findings like these, however anecdotal they may seem, add to the growing body of evidence that WGS (and not more targeted approaches) is the way to go for clinical sequencing.

6. Secondary incidental findings are rare

Another argument against whole genome (or even whole-exome) sequencing is the concern about incidental findings which might be unrelated to the referring diagnosis, but nevertheless represent important medical information that should be returned to the patient. At the moment, the American College of Medical Genetics has a very narrow view of the types of incidental findings that are returnable.

In other words, most cases undergoing clinical WGS won’t have a secondary finding under the current guidelines.

In support of this notion, while the authors of this study identified 32 variants in 18 genes on the “ACMG list” of 56 genes, a detailed literature review and curation removed all but 6. And the evidence supporting most of those as pathogenic is clinically weak. The strongest incidental finding (in my opinion) was a BRCA2 nonsense mutation; the rest had conflicting reports in ClinVar or were observed at appreciable frequencies in public databases.

So that’s 1 out of 156 cases with a bona fide incidental finding. Also known as a very small minority.

7. Collaboration is required

The authors make what I think is a very useful point in their discussion:

The identification of pathogenic variants, the exclusion of potential candidate variants and the identification of incidental findings relied on close collaboration between analysts, scientists knowledgeable about the disease and genes, and clinicians with expertise in the specific disorders.

In other words, a multi-disciplinary team with different branches of expertise (genetics, bioinformatics, clinical care, etc.) will almost certainly be required to achieve the full diagnostic potential of clinical genome sequencing.


Taylor JC, Martin HC, Lise S, Broxholme J, et al (2015). Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nature Genetics, 47 (7), 717-26 PMID: 25985138

The Updated Catalogue of Retinal Disease Genes

As you might guess, I’m keenly interested in the genetics of retinal diseases like retinitis pigmentosa and macular degeneration. It’s therefore a thrill when there’s an update to RetNet — the database of genes and loci causing retinal disease — that includes one of our recent discoveries.

For the last few years, we’ve been working with Steve Daiger, Sara Bowne, and Lori Sullivan at the University of Texas, Houston to find new genes for retinitis pigmentosa (RP), a retinal degenerative disorder affecting about 1 in 5,000 individuals in the United States. The disease usually manifests in childhood or adolescence with night blindness, followed by progressive loss of peripheral vision and eventually central vision.

RP is a Mendelian disorder (i.e. caused by mutations passed from one or both parents to a child) but is incredibly heterogeneous: it can be inherited in dominant, recessive, or X-linked fashion. About 20 genes have been linked to the dominant form, and if you screen them (e.g. with a capture panel) in a newly-diagnosed patient, you find the causal mutation about 50-75% of the time. Steve’s group has spent the last 20 years building a sample cohort of families in the other 25%.

As part of our collaboration, we sequenced the exomes of several individuals from a large dominant RP pedigree. It was so large that we actually treated it as two distinct families, because we thought there were two genes. But our variant analysis of the exome data revealed that there was one variant that was present in every affected, absent from every unaffected, and as-yet-unknown to dbSNP. A promising lead, but there were two issues:

  1. The variant was homozygous in one of the affected individuals, which is generally unexpected for rare dominant Mendelian disorders.
  2. The variant’s gene was hexokinase 1 (HK1), which catalyzes phosphorylation of glucose to glucose-6-phosphate and has no obvious connection to retina function.

The gene was highly expressed in the retina, which is consistent with many known RP genes. The final piece of evidence came from laborious screening of HK1‘s exons in hundreds of families from the Daiger cohort. That turned up a second family with the same exact disease-causing mutation. Our publication of HK1 last year established it as a new disease gene for dominant RP and suggests a new pathway (glycolysis) that may be involved in retinal disease.

Growth of Known Retinal Disease Genes

Here’s the latest content of RetNet, with numbers compared to the last release (end of 2014).

  • 278 total retinal-disease genes have been mapped (up from 261).
  • 238 have been identified at a DNA level (up from 221).

At least 25% of RetNet genes are associated with complex developmental and/or cerebellar diseases that include incidental retinal findings.  One reason for their inclusion is that many also have mutations with ocular findings only.  However, panel screening of these genes is likely to detect mutations with severe non-ocular consequences.

The First Noncoding-RNA Retinal Disease Gene

This release of RetNet includes the first entry of a non-coding RNA gene associated with retinal disease. Conte et al applied linkage mapping and exome sequencing of a five-generation British family with dominant retinal degeneration and bilateral iris coloboma (“holes in the iris”). They identified a variant in the seed region of MIR204, a micro-RNA gene at chr9q21.12, which segregated with disease.

Subsequent experimental work demonstrated that mir204 plays a role in ocular development and that the variant allele severely altered its targeting abilities.  Very cool stuff.

The Award for Disease Diversity Goes to…

The PRPH2 at RetNet gene, which encodes peripherin (a protein in rod photoreceptor outer segments) was cloned in 1990. Over the last 25 years, mutations in that gene have been linked to:

  • Dominant retinitis pigmentosa (accounts for 5% of cases);
  • Dominant macular dystrophy;
  • Dominant cone-rod dystrophy
  • Dominant central areolar choroidal dystrophy
  • Dominant adult vitelliform macular dystrophy
  • Recessive Leber congenital amaurosis

It’s also been linked to a super-rare digenic form of retinal disease: heterozygous mutations in PRPH2 and another gene (ROM1) in the same individual can cause retinitis pigmentosa.

Other RetNet Highlights

Here are some of the other recent findings that made the latest release of RetNet which highlight the complexity of retinal disease genetics.

Syndromic Retinal Disease

Often, retinal disease manifests as one of several symptoms in a rare genetic syndrome. For example:

  • HGSNAT (8p11.21).  Recessive HGSNAT mutations cause non-syndromic RP but other mutations cause Sanfilippo syndrome, a mucopolysaccharidosis with central nervous system degeneration and retinal dystrophy.  The protein, lysosomal N-acetyltransferase, acetylates heparin and heparan sulfate. 
  • IFT172 (2p33.3).  Recessive mutations in IFT172 cause a range of disorders including non-syndromic RP, and Bardet-Biedl, Jeune or Mainzer-Saladino syndromes.  The protein is involved in intraflagellar transport and, as with the other IFT proteins, is a cause of variable ciliopathies.
  • LAMA1 (18p11.31-p11.23).  Mutations in LAMA1 cause recessive Poretti-Boltshauser syndrome with variable developmental abnormalities of the brain and retina.  The protein is a laminin which have critical roles in embryogenesis. 
  • NR2F1 (5q15).  Mutations in NR2F1 cause dominant optic atrophy with intellectual disability and developmental delay, also known as Bosch-Boonstra optic atrophy.  The protein is a nuclear receptor involved in optic nerve and cerebellar development. 
  • PNPLA6 (19p13.2).  Recessive mutations in PNPLA6 cause variable disorders, such as Boucher-Neuhauser, Oliver-McFarlane or Gordon Holmes syndromes, involving spinocerebellar ataxia, hypogonadism and chorioretinal dystrophy.  The protein is involved in phosphatidylcholine metabolism. 


Genes Linked to Retinal Disease

Other genes updated in this release of RetNet are linked primarily to retinal disease, rather than a constellation of symptoms. For example:

  • DHX38 (16q22.2).  A homozygous missense mutation in DHX38 causes recessive RP and macular coloboma in a consanguineous family.  The protein is a pre-RNA splicing helicase 
  • DRAM2 (1p13.3).  DRAM2 mutations in several families cause recessive, adult-onset retinal dystrophy with early macular involvement.  DRAM2 codes for a transmembrane protein which initiates autophagy with a role in photoreceptor disc recycling.  
  • KIZ (20p11.23).  Mutations in KIZ cause recessive rod cone dystrophy and may account for 1% of recessive RP patients in some populations.  The protein is centrosome-associated as are other ciliopathy proteins. 
  • RDH11 (14q24.1).  RDH11 mutations cause recessive RP with developmental abnormalities in an Italian-American family.  The protein plays a role in oxidizing 11-cis-retinol to 11-cis-retinal in the visual cycle. 
  • TTLL5 (14q24.3).  Mutations in TTLL5 cause recessive cone and cone-rod dystrophies.  The protein is a tubulin glutamylase found in photoreceptor cilia and sperm flagella. 

The diversity of phenotypes, pathways, and gene functions associated with retinal disease continues to astonish me. As usual, we’ve made remarkable progress but there’s more work to do.