New Insights into Human De Novo Mutations

De novo mutations — sequence variants that are present in a child but absent from both parents — are an important source of human genetic variation. I think it’s reasonable to say that most of the 3-4 million variants in any individual’s genome arose, once upon a time, as de novo mutations in his or her ancestors. In the past few years, whole-genome sequencing (WGS) studies performed in families (especially parent-child trios) have offered some revelations about de novo mutations and their role in human disease, notably that:

A recent study in Nature Genetics provides the largest survey of de novo mutations to date. Laurent Francioli et al identified de novo mutations in 250 Dutch families that were sequenced to ~13x coverage as part of the Genome of the Netherlands (GoNL) project. Their findings confirm much of the observations from previous smaller studies, and offer some new insights into the patterns of de novo mutations throughout the human genome.

Identification of de novo Mutations

To make any global observations about de novo mutations, one generally needs unbiased whole-genome sequencing data for an individual and both parents. Even with those in hand, accurate identification of de novo mutations is challenging because they’re so exquisitely rare. Since the sequencing coverage in this study is a little bit light (13x, whereas most studies shoot for ~30x), I had some initial concerns about whether or not the mutation calls might hold up under scrutiny.

Delving into the online methods, I learned that the samples underwent Illumina paired-end sequencing (2x91bp, insert size 500bp). Alignment and variant calling followed GATK best practices (v2), and the mutations were called with the trio-aware GATK PhaseByTransmission. Next, the authors used a machine learning classifier trained on 592 true positive and 1,630 false positive de novo calls that had been validated experimentally. The net result was 11,020 high-confidence mutations in the 269 children, with an estimated a 92.2% accuracy.

The numbers are about right: if 92.2% of the calls are real, that’s 10,160 true mutations, or ~37.7 mutations per child. That’s very close to the estimated ~38 per genome. In other words, without experimentally validating all 11,000 mutations (an expensive and laborious task), this is as good as it gets.

Parent-of-Origin and Replication Timing

de novo mutations and paternal age

Credit: Francioli et al, Nature 2015

The authors first examined whether the location of the observed mutations was correlated with any epigenetic variables. There was no significant correlation for most of the variables examined (chromatin accessibility, histone modifications, and recombination rate).  With a linear regression model, they noted a significant association between replication timing and paternal age: mutations in the offspring of younger fathers (<28 years old) were strongly enriched in late-replicating regions, whereas mutations in offspring of older fathers were not.

To dig deeper, the authors looked at 2,621 mutations that could be unambiguously assigned to maternal or paternal origin. The method for this isn’t documented in the online methods, but presumably they looked for instances in which a mutation was in the same read or read pair as a variant unique to one parent. Notably, 1,991 of those origin-inferred mutations (76%) came from the father. After controlling for the number imbalance, the replication-timing-with-parent-age correlation was significant only for mutations of paternal origin.

This makes a certain kind of sense, since the stem cells in the paternal germ line undergo continuous cell division throughout a man’s life, whereas a woman is born with all of the eggs she’ll ever have.

The correlation between paternal age and replication timing is important from a reproductive health perspective, because late-replicating regions have lower gene density and expression levels than early ones. Since the mutations in offspring of younger fathers tend to occur in these regions, they’re less likely to have a functional impact. In support of this idea, on average, the offspring born to 40-year-old fathers had twice as many genic mutations as offspring born to 20-year-old fathers.

In other words, mutations in the offspring of older fathers are not only more numerous, but also more likely to have functional consequences.

Mutations in Functional Regions

Notably, the de novo mutation rates in this study were higher in exonic regions regardless of the paternal age. Overall, 1.22% of mutations were exonic, an enrichment of 28.7% over simulated models of random mutation distribution. Mutations were also enriched in DNase I hypersensitive sites (DHSs), which represent likely regulatory regions. The source of this “functional enrichment” likely has to do with sequence context: mutations often occur at CpG dinucleotides, which are themselves more prevalent in exons and DHSs.

Recent studies of somatic mutations in tumor cells revealed a fascinating phenomenon: a reduction in the mutation rate of highly transcribed regions, likely attributed to the fidelity conferred by transcription-coupled DNA repair mechanisms. In the current study of de novo mutations, however, the mutation rate in transcribed regions and DHSs did not appear to be reduced.

The implication here might be that transcription-coupled repair has less of an impact on de novo mutations, though the authors note that their study was only powered to detect a substantial difference (>17%) in mutation rate. That’s understandable, because while the individuals examined here harbored ~40 mutations genome-wide, a tumor specimen might have tens of thousands of somatic mutations (i.e. much better power to detect subtle differences in mutation rate).

Clustered de novo Mutations

One of the most interesting observations in this study was a clustering effect of de novo mutations. If all things were random, given the size of the genome (3.2 billion base pairs) and the number of mutations per individual (~40), we expect them to be pretty far apart. As in, one every 80 million base pairs.

Instead, the authors observed 78 instances in which there were “clusters” of 2-3 mutations within a 20kb window in the same individual. The 161 mutations involved showed no significant differences from the non-clustered mutations with regard to recombination rate (p=0.52) or replication timing (p=0.059), though I should point out that the latter might be approaching an interesting p-value.

Interestingly, however, the clustered mutations exhibited an unusual mutational spectrum, with a strong enrichment for C->G transversions compared to non-clustered mutations (p=1.8e-13).

Mutation spectrum of de novo mutations

Francioli et al, Nature 2015

Based on the nucleotide context, the authors suggest that a new mutational mechanism may be at work involving cytosine deamination of single-stranded DNA (presumably during replication). I don’t have strong enough chemistry to understand the proposed mechanism, but agree that this unusual pattern merits some more investigation.

Francioli LC, Polak PP, Koren A, Menelaou A, Chun S, Renkens I, Genome of the Netherlands Consortium, van Duijn CM, Swertz M, Wijmenga C, van Ommen G, Slagboom PE, Boomsma DI, Ye K, Guryev V, Arndt PF, Kloosterman WP, de Bakker PI, & Sunyaev SR (2015). Genome-wide patterns and properties of de novo mutations in humans. Nature genetics, 47 (7), 822-6 PMID: 25985141

How to Succeed at Clinical Genome Sequencing

Whole-genome sequencing holds enormous potential to improve the diagnosis and treatment of human diseases. Although this approach is the only way to capture the complete spectrum of genetic variation, its application in clinical settings has been slow compared to more targeted strategies (i.e. panel and exome sequencing). Everyone talks about cost as the main contributing factor for this, but compared to routine clinical genetics testing it’s actually inexpensive. Let’s be honest: the challenges of detecting and interpreting variants outside the exome is another consideration.

Before WGS is adopted as a routine clinical tool, we will need to demonstrate its diagnostic yield for patients with likely-yet-undiagnosed genetic disorders in a medical setting. A recent paper from across the pond offers a promising start. Jenny C. Taylor et al report preliminary results from the WGS500 program, which hopes to sequence the genomes of 500 patients with diverse genetic disorders who are referred by medical specialists.

So far, they’ve sequenced 217 individuals (156 probands plus some family members) to ~30x haploid coverage. About 21% of cases ended up with a confirmed genetic diagnosis; this goes up to 34% for Mendelian disorders and 57% for family trios. Their report highlights some of the factors influencing success, and offers some important guidelines for other groups hoping to adopt clinical WGS.

1. Joint variant calling improves accuracy

The researchers used a two-step variant calling strategy: first, identify genetic variants in all samples individually, and then perform join consensus calling in all individuals at all variable sites. We use this strategy ourselves, because it has some important benefits:

  • Recovering variants that were “missed” in certain individuals due to coverage or allele representation
  • Reducing the rate of Mendelian inconsistencies among family members
  • Removing the vast majority of artifactual de novo mutation calls

Specifically, the researchers found that joint calling reduced brought the number of de novo coding mutations in trios from ~32.1 per child to the more realistic ~2.1 per child.

2. Filtering with variant databases is important

Private and rare variants represent one of the biggest challenges in human genetics. Every individual harbors a few hundred thousand variants that have never been seen before. This is problematic for clinical genome sequencing, since we expect that most of the pathogenic mutations that cause rare genetic disorders are also quite rare. These blend in, for lack of a better word, with the many rare-but-neutral variants in each genome.

The catalogues of human genetic variation that are generated by sequencing approaches (1000 Genomes, ESP, etc.) can help, since most of the individuals enrolled in those studies do not have severe genetic disorders. Thus, for severe and highly penetrant genetic disorders at least, these large catalogues help us identify and remove variants that have been seen before in presumably-healthy individuals.

You might ask, why don’t we just use dbSNP? It has all of the variants, right? The problem is that dbSNP is too inclusive: it contains variants from places like OMIM, which generally are not found in healthy individuals. It also has some number of somatic mutations from tumor genomes that were submitted before the COSMIC repository existed. In other words, one would have to use extreme care when filtering against dbSNP so that true pathogenic variants aren’t accidentally removed.

Another important strategy described in this paper is the use of internal data (from undiseased control samples) to filter sets of candidate causal variants. This is advantageous because the sequencing technology and variant calling are the same. In this study, the authors found that the vast majority of rare/novel variants that passed external filters could be discounted using data from other WGS500 samples.

3. Leverage multiple sources of annotation

Variant annotation — that is, predicting the likely functional impact of a sequence variant — remains an imperfect art. For any given variant, the annotation can change depending on the transcript database (NCBI or Ensembl), software tool (e.g. VEP versus ANNOVAR), or prioritization strategy. This inconsistency problem is probably the worst for loss-of-function variants, which are precisely the ones that interest us most in clinical sequencing.

In this study, for example, there was only 44% agreement on loss-of-function variant annotations between NCBI and Ensembl transcript sets. VEP and ANNOVAR only agreed on 66% of LOF annotations even when using the same transcript database. The most common discrepancies in this category were splicing variants, which (in my opinion) are better identified by VEP than ANNOVAR.

4. Genetic evidence over biological plausibility

To identify candidate disease-causing variants, the authors used a combination of predicted functional impact, frequency in the population, and transmission within a family. If that sounds familiar, it’s because these are three of the four variant prioritization strategies described in the MendelScan paper. The authors here also leveraged “statistical evidence of association” when multiple independent cases for the same disorder were available, which is a fancy way of saying that they looked for recurrently mutated genes.

It’s always tempting, in studies like these, to eyeball a list of candidate causal variants and pick out the ones that seem most biologically plausible. We all love looking at the table of variants and pointing out our “favorite” genes. The authors did some work to demonstrate why that’s a dangerous game to play.

For example, they compiled a list of 83 genes linked to X-linked mental retardation (XLMR). Some 30 of 109 males cases with this phenotype (28%) carried at least one novel missense variant at a conserved residue in one of those genes. Yet only two of those were ultimately deemed to be pathogenic.

Interestingly, the authors found that as the strength of gene candidacy increased, the number of putative pathogenic variants actually decreased. It needs to be said, however, that patients with easily-obtained genetic diagnoses probably didn’t make it into this program. Granted, WGS did uncover a small number of genetic testing “misses” — four cases (2.5% of the cohort) were negative for a clinical genetic test but actually had a causal variant in the tested gene — but we should keep in mind that clear pathogenic mutations in known disease genes are almost certainly underrepresented in this study.

5. WGS reveals candidate pathogenic regulatory variants

One of the big selling points of WGS is that it can detect large-scale and/or complex variation, as well as variants in noncoding regulatory regions. The challenge, of course, is that such variants are often difficult to detect (SVs) or prove as causal (regulatory variants). In this study, the authors leveraged the discovery power of WGS to identify candidate regulatory variants for two conditions.

One was a complex rearrangement in a patient with X-linked hyperparathyroidism involving a deletion on the X-chromosome and insertion of 50kb of sequence from chromosome 2. This occurred about 80 kb downstream of SOX3, a strong candidate gene for the condition. It’s the perfect example of a variant that would never be detected b y exome or targeted sequencing.

The other candidate pathogenic regulatory variant was a single base change at a conserved position in the 5′ UTR of EPO. This gene encodes erythropoietin, an essential factor for red blood cell formation. Whole-genome sequencing revealed the presence of that variant in two unrelated families with erythrocytosis, and in both of them, it segregated with the disease. This is a particularly compelling finding since increased levels of erythropoietin cause higher blood cell mass, which is a hallmark of erythrocytosis.

Findings like these, however anecdotal they may seem, add to the growing body of evidence that WGS (and not more targeted approaches) is the way to go for clinical sequencing.

6. Secondary incidental findings are rare

Another argument against whole genome (or even whole-exome) sequencing is the concern about incidental findings which might be unrelated to the referring diagnosis, but nevertheless represent important medical information that should be returned to the patient. At the moment, the American College of Medical Genetics has a very narrow view of the types of incidental findings that are returnable.

In other words, most cases undergoing clinical WGS won’t have a secondary finding under the current guidelines.

In support of this notion, while the authors of this study identified 32 variants in 18 genes on the “ACMG list” of 56 genes, a detailed literature review and curation removed all but 6. And the evidence supporting most of those as pathogenic is clinically weak. The strongest incidental finding (in my opinion) was a BRCA2 nonsense mutation; the rest had conflicting reports in ClinVar or were observed at appreciable frequencies in public databases.

So that’s 1 out of 156 cases with a bona fide incidental finding. Also known as a very small minority.

7. Collaboration is required

The authors make what I think is a very useful point in their discussion:

The identification of pathogenic variants, the exclusion of potential candidate variants and the identification of incidental findings relied on close collaboration between analysts, scientists knowledgeable about the disease and genes, and clinicians with expertise in the specific disorders.

In other words, a multi-disciplinary team with different branches of expertise (genetics, bioinformatics, clinical care, etc.) will almost certainly be required to achieve the full diagnostic potential of clinical genome sequencing.


Taylor JC, Martin HC, Lise S, Broxholme J, et al (2015). Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nature Genetics, 47 (7), 717-26 PMID: 25985138

The Updated Catalogue of Retinal Disease Genes

As you might guess, I’m keenly interested in the genetics of retinal diseases like retinitis pigmentosa and macular degeneration. It’s therefore a thrill when there’s an update to RetNet — the database of genes and loci causing retinal disease — that includes one of our recent discoveries.

For the last few years, we’ve been working with Steve Daiger, Sara Bowne, and Lori Sullivan at the University of Texas, Houston to find new genes for retinitis pigmentosa (RP), a retinal degenerative disorder affecting about 1 in 5,000 individuals in the United States. The disease usually manifests in childhood or adolescence with night blindness, followed by progressive loss of peripheral vision and eventually central vision.

RP is a Mendelian disorder (i.e. caused by mutations passed from one or both parents to a child) but is incredibly heterogeneous: it can be inherited in dominant, recessive, or X-linked fashion. About 20 genes have been linked to the dominant form, and if you screen them (e.g. with a capture panel) in a newly-diagnosed patient, you find the causal mutation about 50-75% of the time. Steve’s group has spent the last 20 years building a sample cohort of families in the other 25%.

As part of our collaboration, we sequenced the exomes of several individuals from a large dominant RP pedigree. It was so large that we actually treated it as two distinct families, because we thought there were two genes. But our variant analysis of the exome data revealed that there was one variant that was present in every affected, absent from every unaffected, and as-yet-unknown to dbSNP. A promising lead, but there were two issues:

  1. The variant was homozygous in one of the affected individuals, which is generally unexpected for rare dominant Mendelian disorders.
  2. The variant’s gene was hexokinase 1 (HK1), which catalyzes phosphorylation of glucose to glucose-6-phosphate and has no obvious connection to retina function.

The gene was highly expressed in the retina, which is consistent with many known RP genes. The final piece of evidence came from laborious screening of HK1‘s exons in hundreds of families from the Daiger cohort. That turned up a second family with the same exact disease-causing mutation. Our publication of HK1 last year established it as a new disease gene for dominant RP and suggests a new pathway (glycolysis) that may be involved in retinal disease.

Growth of Known Retinal Disease Genes

Here’s the latest content of RetNet, with numbers compared to the last release (end of 2014).

  • 278 total retinal-disease genes have been mapped (up from 261).
  • 238 have been identified at a DNA level (up from 221).

At least 25% of RetNet genes are associated with complex developmental and/or cerebellar diseases that include incidental retinal findings.  One reason for their inclusion is that many also have mutations with ocular findings only.  However, panel screening of these genes is likely to detect mutations with severe non-ocular consequences.

The First Noncoding-RNA Retinal Disease Gene

This release of RetNet includes the first entry of a non-coding RNA gene associated with retinal disease. Conte et al applied linkage mapping and exome sequencing of a five-generation British family with dominant retinal degeneration and bilateral iris coloboma (“holes in the iris”). They identified a variant in the seed region of MIR204, a micro-RNA gene at chr9q21.12, which segregated with disease.

Subsequent experimental work demonstrated that mir204 plays a role in ocular development and that the variant allele severely altered its targeting abilities.  Very cool stuff.

The Award for Disease Diversity Goes to…

The PRPH2 at RetNet gene, which encodes peripherin (a protein in rod photoreceptor outer segments) was cloned in 1990. Over the last 25 years, mutations in that gene have been linked to:

  • Dominant retinitis pigmentosa (accounts for 5% of cases);
  • Dominant macular dystrophy;
  • Dominant cone-rod dystrophy
  • Dominant central areolar choroidal dystrophy
  • Dominant adult vitelliform macular dystrophy
  • Recessive Leber congenital amaurosis

It’s also been linked to a super-rare digenic form of retinal disease: heterozygous mutations in PRPH2 and another gene (ROM1) in the same individual can cause retinitis pigmentosa.

Other RetNet Highlights

Here are some of the other recent findings that made the latest release of RetNet which highlight the complexity of retinal disease genetics.

Syndromic Retinal Disease

Often, retinal disease manifests as one of several symptoms in a rare genetic syndrome. For example:

  • HGSNAT (8p11.21).  Recessive HGSNAT mutations cause non-syndromic RP but other mutations cause Sanfilippo syndrome, a mucopolysaccharidosis with central nervous system degeneration and retinal dystrophy.  The protein, lysosomal N-acetyltransferase, acetylates heparin and heparan sulfate. 
  • IFT172 (2p33.3).  Recessive mutations in IFT172 cause a range of disorders including non-syndromic RP, and Bardet-Biedl, Jeune or Mainzer-Saladino syndromes.  The protein is involved in intraflagellar transport and, as with the other IFT proteins, is a cause of variable ciliopathies.
  • LAMA1 (18p11.31-p11.23).  Mutations in LAMA1 cause recessive Poretti-Boltshauser syndrome with variable developmental abnormalities of the brain and retina.  The protein is a laminin which have critical roles in embryogenesis. 
  • NR2F1 (5q15).  Mutations in NR2F1 cause dominant optic atrophy with intellectual disability and developmental delay, also known as Bosch-Boonstra optic atrophy.  The protein is a nuclear receptor involved in optic nerve and cerebellar development. 
  • PNPLA6 (19p13.2).  Recessive mutations in PNPLA6 cause variable disorders, such as Boucher-Neuhauser, Oliver-McFarlane or Gordon Holmes syndromes, involving spinocerebellar ataxia, hypogonadism and chorioretinal dystrophy.  The protein is involved in phosphatidylcholine metabolism. 


Genes Linked to Retinal Disease

Other genes updated in this release of RetNet are linked primarily to retinal disease, rather than a constellation of symptoms. For example:

  • DHX38 (16q22.2).  A homozygous missense mutation in DHX38 causes recessive RP and macular coloboma in a consanguineous family.  The protein is a pre-RNA splicing helicase 
  • DRAM2 (1p13.3).  DRAM2 mutations in several families cause recessive, adult-onset retinal dystrophy with early macular involvement.  DRAM2 codes for a transmembrane protein which initiates autophagy with a role in photoreceptor disc recycling.  
  • KIZ (20p11.23).  Mutations in KIZ cause recessive rod cone dystrophy and may account for 1% of recessive RP patients in some populations.  The protein is centrosome-associated as are other ciliopathy proteins. 
  • RDH11 (14q24.1).  RDH11 mutations cause recessive RP with developmental abnormalities in an Italian-American family.  The protein plays a role in oxidizing 11-cis-retinol to 11-cis-retinal in the visual cycle. 
  • TTLL5 (14q24.3).  Mutations in TTLL5 cause recessive cone and cone-rod dystrophies.  The protein is a tubulin glutamylase found in photoreceptor cilia and sperm flagella. 

The diversity of phenotypes, pathways, and gene functions associated with retinal disease continues to astonish me. As usual, we’ve made remarkable progress but there’s more work to do.

6 Realities of Genomic Research

The rise of next-generation sequencing has worked wonders for the field of genetics and genomics. It’s also generated a considerable amount of hype about the power of genome sequencing, particularly the possibility of individualized medicine based on genetic information. The rapid advances in technology — most recently, the Illumina X Ten system — have made heretofore impossible large-scale whole-genome sequencing studies feasible. I’ve already written about some of the possible applications of inexpensive genome sequencing.

I’m as excited about this as anyone (with the possible exception of Illumina). Even so, we should keep in mind that not everything is unicorns and rainbows when it comes to genomic research. Here are some observations I’ve made about sequencing-empowered genomic research over the past few years.

1. There is never enough power

“Power” is a term that’s being discussed more and more as we plan large-scale sequencing studies of common disease. In essence, it answers the question, “What fraction of the associated variants can we detect with this study design, given the number of samples, inheritance pattern, penetrance, etc.?” Several years ago, when ambitious genome-wide association studies (GWAS) became feasible, there was a hope that much of the heritability of common disease could be attributed to common variants with minor allele frequencies of, say 5% or more.

If that were true, it was very good news, because:

  1. We could test such variants in large cohorts using high-density SNP arrays, which are inexpensive
  2. Our power to detect associations was high because many samples in each cohort would carry the variants
  3. Associated common variants would “explain” susceptibility in more individuals, narrowing the scope of follow-up.

GWAS efforts have revealed thousands of replicated genetic associations. However, it’s clear that a signification proportion of common disease risk comes from rare variants, which might be specific to an individual, family, or population. To achieve power to detect association for these rare variants, you need massive sample sizes (10,000 or more). You also need to use sequencing, since many of these will not be on SNP arrays (even exome-chip); some might have never been seen before.

Despite the falling costs of sequencing, cohorts of that size require a considerable investment.

2. There will be errors, both human and technical

If the power calculations call for sequencing 10,000 samples, you’d better pad that number in the production queue. Some samples will fail due to technical reasons, such as library failure or contamination. Others may fall victim to human or machine errors. We can address some failures (such as a sample swap) with computational approaches, but others will mean that a sample gets excluded.

The challenge of a 10,000 sample study is that, even with very low error/failure rates, the number of samples that must ultimately be excluded from the study might be a little shocking.

3. Signal to noise problems increase

One of the greatest advantages of whole genome sequencing is that it’s an unbiased survey of genetic variation. It lets us search for associations without any underlying assumptions like “associated variants must be in coding regions.” One potential disadvantage is that we’ll be looking at 3-4 million sequence variants in every genome.

Classic GWAS approaches rely on SNP arrays, which interrogate (on average) 700,000 to 1 million carefully selected, validated, assayable markers. The call rates on those platforms are usually >99%. Now we’re talking about genome-wide sequencing and variant detection. It means we’ll most likely be able to detect variants that contribute to disease risk, but we’ll also have to examine millions of variants that have no effect on it.


In contrast, a candidate gene study or even exome sequencing has the benefit of pre-selecting regions most likely to harbor functional variants. Not only are there fewer variants, but all things being equal they’re more likely to be relevant because they affect proteins.

4.  We can’t predict all variant consequences

Annotation tools such as VEP and ANNOVAR have come a long way towards helping us identify computationally which variants are most likely to be deleterious. However, their annotations are based on our knowledge of the genome and its functional elements (which remains incomplete) and our best guess as to which variations cause which effects.

Outside of the coding regions, we face an even greater challenge. That’s where most human genetic variation resides, including the substantial fraction expected to play regulatory roles in the genome. Thus, understanding the mechanism by which associated variants affect disease risk will be a long and difficult prospect. It will likely cost more time and resources than finding those variants in the first place.

5. There’s always a better informatics tool

The incredible power of next-gen sequencing required a new generation of analysis tools simply to handle the new nature and vast scale of data. We’ve done well to address many of the challenges, but developing these tools takes time. Keeping them relevant is a particular struggle, because sequencing technologies continue to rapidly evolve.

I remember a meeting a few years ago when we were working on Illumina short-read sequencing (36 bp reads, possibly even single-end) and wondering if we could find a way to build 100 bp contigs. I remember thinking, if we can get to 100 bp, we’ll be home free.

The current read length on Illumina X Ten is 150 bp. The MiSeq platform (while admittedly not for whole-genome sequencing) does 250 bp. And now that still seems far too short, especially to identify structural variation and interrogate the complex regions of the human genome (repeats, HLA, etc.).

6. You can spin a story about any gene

The huge investments and advances in the field of genetics over the past 50+ years have helped us build an incredible wealth of knowledge about genes and their relationships to human health. Granted, a large number of genes have no known function. Even so, with known disease associations, expression patterns, sequence similarities, pathway membership, and other sources of data, we have a lot to work with when it comes time to explain how a gene might be involved with a certain disease.

There’s a danger in that, because it gives us enough information to spin a story about any gene. To make a plausible explanation on how variation in that gene could be involved in the phenotype of interest. Given that fact, we have to admit that databases and the literature may contain false reports. For example, a recent examination by the ClinGen consortium found that hundreds of variants listed as pathogenic in the OMIM database are now being annotated as benign or of uncertain significance by clinical laboratories.

With great power comes great responsibility, and at this moment in genomics there is no greater power than large scale whole genome sequencing.

Rehm HL, Berg JS, Brooks LD, Bustamante CD, Evans JP, Landrum MJ, Ledbetter DH, Maglott DR, Martin CL, Nussbaum RL, Plon SE, Ramos EM, Sherry ST, Watson MS, & ClinGen (2015). ClinGen–the Clinical Genome Resource. The New England journal of medicine, 372 (23), 2235-42 PMID: 26014595