Whole-genome sequencing holds enormous potential to improve the diagnosis and treatment of human diseases. Although this approach is the only way to capture the complete spectrum of genetic variation, its application in clinical settings has been slow compared to more targeted strategies (i.e. panel and exome sequencing). Everyone talks about cost as the main contributing factor for this, but compared to routine clinical genetics testing it’s actually inexpensive. Let’s be honest: the challenges of detecting and interpreting variants outside the exome is another consideration.
Before WGS is adopted as a routine clinical tool, we will need to demonstrate its diagnostic yield for patients with likely-yet-undiagnosed genetic disorders in a medical setting. A recent paper from across the pond offers a promising start. Jenny C. Taylor et al report preliminary results from the WGS500 program, which hopes to sequence the genomes of 500 patients with diverse genetic disorders who are referred by medical specialists.
So far, they’ve sequenced 217 individuals (156 probands plus some family members) to ~30x haploid coverage. About 21% of cases ended up with a confirmed genetic diagnosis; this goes up to 34% for Mendelian disorders and 57% for family trios. Their report highlights some of the factors influencing success, and offers some important guidelines for other groups hoping to adopt clinical WGS.
1. Joint variant calling improves accuracy
The researchers used a two-step variant calling strategy: first, identify genetic variants in all samples individually, and then perform join consensus calling in all individuals at all variable sites. We use this strategy ourselves, because it has some important benefits:
- Recovering variants that were “missed” in certain individuals due to coverage or allele representation
- Reducing the rate of Mendelian inconsistencies among family members
- Removing the vast majority of artifactual de novo mutation calls
Specifically, the researchers found that joint calling reduced brought the number of de novo coding mutations in trios from ~32.1 per child to the more realistic ~2.1 per child.
2. Filtering with variant databases is important
Private and rare variants represent one of the biggest challenges in human genetics. Every individual harbors a few hundred thousand variants that have never been seen before. This is problematic for clinical genome sequencing, since we expect that most of the pathogenic mutations that cause rare genetic disorders are also quite rare. These blend in, for lack of a better word, with the many rare-but-neutral variants in each genome.
The catalogues of human genetic variation that are generated by sequencing approaches (1000 Genomes, ESP, etc.) can help, since most of the individuals enrolled in those studies do not have severe genetic disorders. Thus, for severe and highly penetrant genetic disorders at least, these large catalogues help us identify and remove variants that have been seen before in presumably-healthy individuals.
You might ask, why don’t we just use dbSNP? It has all of the variants, right? The problem is that dbSNP is too inclusive: it contains variants from places like OMIM, which generally are not found in healthy individuals. It also has some number of somatic mutations from tumor genomes that were submitted before the COSMIC repository existed. In other words, one would have to use extreme care when filtering against dbSNP so that true pathogenic variants aren’t accidentally removed.
Another important strategy described in this paper is the use of internal data (from undiseased control samples) to filter sets of candidate causal variants. This is advantageous because the sequencing technology and variant calling are the same. In this study, the authors found that the vast majority of rare/novel variants that passed external filters could be discounted using data from other WGS500 samples.
3. Leverage multiple sources of annotation
Variant annotation — that is, predicting the likely functional impact of a sequence variant — remains an imperfect art. For any given variant, the annotation can change depending on the transcript database (NCBI or Ensembl), software tool (e.g. VEP versus ANNOVAR), or prioritization strategy. This inconsistency problem is probably the worst for loss-of-function variants, which are precisely the ones that interest us most in clinical sequencing.
In this study, for example, there was only 44% agreement on loss-of-function variant annotations between NCBI and Ensembl transcript sets. VEP and ANNOVAR only agreed on 66% of LOF annotations even when using the same transcript database. The most common discrepancies in this category were splicing variants, which (in my opinion) are better identified by VEP than ANNOVAR.
4. Genetic evidence over biological plausibility
To identify candidate disease-causing variants, the authors used a combination of predicted functional impact, frequency in the population, and transmission within a family. If that sounds familiar, it’s because these are three of the four variant prioritization strategies described in the MendelScan paper. The authors here also leveraged “statistical evidence of association” when multiple independent cases for the same disorder were available, which is a fancy way of saying that they looked for recurrently mutated genes.
It’s always tempting, in studies like these, to eyeball a list of candidate causal variants and pick out the ones that seem most biologically plausible. We all love looking at the table of variants and pointing out our “favorite” genes. The authors did some work to demonstrate why that’s a dangerous game to play.
For example, they compiled a list of 83 genes linked to X-linked mental retardation (XLMR). Some 30 of 109 males cases with this phenotype (28%) carried at least one novel missense variant at a conserved residue in one of those genes. Yet only two of those were ultimately deemed to be pathogenic.
Interestingly, the authors found that as the strength of gene candidacy increased, the number of putative pathogenic variants actually decreased. It needs to be said, however, that patients with easily-obtained genetic diagnoses probably didn’t make it into this program. Granted, WGS did uncover a small number of genetic testing “misses” — four cases (2.5% of the cohort) were negative for a clinical genetic test but actually had a causal variant in the tested gene — but we should keep in mind that clear pathogenic mutations in known disease genes are almost certainly underrepresented in this study.
5. WGS reveals candidate pathogenic regulatory variants
One of the big selling points of WGS is that it can detect large-scale and/or complex variation, as well as variants in noncoding regulatory regions. The challenge, of course, is that such variants are often difficult to detect (SVs) or prove as causal (regulatory variants). In this study, the authors leveraged the discovery power of WGS to identify candidate regulatory variants for two conditions.
One was a complex rearrangement in a patient with X-linked hyperparathyroidism involving a deletion on the X-chromosome and insertion of 50kb of sequence from chromosome 2. This occurred about 80 kb downstream of SOX3, a strong candidate gene for the condition. It’s the perfect example of a variant that would never be detected b y exome or targeted sequencing.
The other candidate pathogenic regulatory variant was a single base change at a conserved position in the 5′ UTR of EPO. This gene encodes erythropoietin, an essential factor for red blood cell formation. Whole-genome sequencing revealed the presence of that variant in two unrelated families with erythrocytosis, and in both of them, it segregated with the disease. This is a particularly compelling finding since increased levels of erythropoietin cause higher blood cell mass, which is a hallmark of erythrocytosis.
Findings like these, however anecdotal they may seem, add to the growing body of evidence that WGS (and not more targeted approaches) is the way to go for clinical sequencing.
6. Secondary incidental findings are rare
Another argument against whole genome (or even whole-exome) sequencing is the concern about incidental findings which might be unrelated to the referring diagnosis, but nevertheless represent important medical information that should be returned to the patient. At the moment, the American College of Medical Genetics has a very narrow view of the types of incidental findings that are returnable.
In other words, most cases undergoing clinical WGS won’t have a secondary finding under the current guidelines.
In support of this notion, while the authors of this study identified 32 variants in 18 genes on the “ACMG list” of 56 genes, a detailed literature review and curation removed all but 6. And the evidence supporting most of those as pathogenic is clinically weak. The strongest incidental finding (in my opinion) was a BRCA2 nonsense mutation; the rest had conflicting reports in ClinVar or were observed at appreciable frequencies in public databases.
So that’s 1 out of 156 cases with a bona fide incidental finding. Also known as a very small minority.
7. Collaboration is required
The authors make what I think is a very useful point in their discussion:
The identification of pathogenic variants, the exclusion of potential candidate variants and the identification of incidental findings relied on close collaboration between analysts, scientists knowledgeable about the disease and genes, and clinicians with expertise in the specific disorders.
In other words, a multi-disciplinary team with different branches of expertise (genetics, bioinformatics, clinical care, etc.) will almost certainly be required to achieve the full diagnostic potential of clinical genome sequencing.
References
Taylor JC, Martin HC, Lise S, Broxholme J, et al (2015). Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nature Genetics, 47 (7), 717-26 PMID: 25985138