Return of Results from Next-gen Sequencing

Return of results next gen sequencing

Image credit: CDC Blogs

The rapid adoption of next-gen exome and genome sequencing for clinical use (i.e. with patient DNA) raises some difficult questions about the return of results to patients and their families. In contrast to traditional genetic testing, which usually checks for variants in specific genes, high-throughput sequencing has the potential to reveal a number of secondary findings, i.e., genetic variants with medical relevance but not related to the condition that merited the test.

Two articles in the current issue of AJHG delve into the sticky issue of incidental genetic findings. Holly Tabor et al analyzed de-identified exome data from 6,517 individuals obtained from NHLBI’s Exome Sequencing Project (ESP). They examined the burden of pathogenic variants in three sets of biomedically important genes:

  1. Genes underlying 31 Mendelian conditions, most of which are inborn errors of metabolism, recommended for newborn screening (NBS, n=39)
  2. Genes associated with the risk of age-related macular degeneration, a complex disease and the most prevalent form of vision loss (ARMD, n=17)
  3. Genes known to influence drug response, i.e. replicated pharmacogenetics hits from PharmGKB (PGx, n=14).
Variant GERP scores

Tabor et al, AJHG 2014

Looking only at SNVs called by GATK, the authors identified 10,879 variants affecting the 70 disease genes across the full ESP cohort. Unsurprisingly, filtering this set to include only variants with a high call rate that were listed in OMIM and HGMD for the correct phenotype, reduced the set by over 90%, to around 400 total variants.

Included versus Excluded Variants

Next, the authors evaluated some of the characteristics of variants that made it through to their final set, versus variants that they’d excluded. Because pathogenic mutations should be under strong purifying selection, one would expect them to be extremely rare and to occur at positions with high conservation across evolution.

The lovely violin plots at right show the mean GERP scores (a measure of conservation) for included versus excluded variants in the newborn screening (NBS), age-related macular degeneration (ARMD), and pharmacogenetics (PGx) genes examined. As the authors hoped, GERP scores were significantly higher for included versus excluded variants, particularly for the severe recessive disease genes screened for in newborns. Included ARMD and PGx variants also had higher GERP scores, but with a wider spread.

A comparison of the Polyphen-2 scores, which offer computational estimates of how damaging amino acid substitutions will be, also showed significant differences, with included variants in the NBS predicted to be far more deleterious than the excluded variants. The effect was again consistent but less striking in the ARMD/PGx sets.

Together, these patterns are consistent with the idea that mutations underlying severe, highly penetrant phenotypes (i.e. the NBS set) are more deleterious — and thus under stronger natural selection — than variants associated with complex phenotypes like ARMD and PGx.

Rate of Incidental Findings

Carrier burden recessive alleles

Tabor et al, AJHG 2014

Having established that their final set of ~400 variants was properly vetted, the authors set out to establish the burden of pathogenic mutations that might be found in any individual’s exome. The majority of included variants were rare, with MAF<0.5%.

The carrier burden in the NBS set was surprisingly high (0.57 per exome), with 45% of individuals carrying at least one allele and 11% carrying at least two alleles. If the ARMD and PGx variants were also considered, each individual carried 15.3 risk alleles on average.

These findings challenge the assumption that secondary findings (actionable results) and incidental findings (potential clinical utility) uncovered by exome or genome sequencing are rare. Indeed, a research highlight on the paper from Nature Genetics noted that the study demonstrates the “striking prevalence of actionable incidental or secondary results, including ones of direct clinical usefulness, which might be obtained in patient sequencing.”

In my next post, I’ll tackle the medical community’s opinions about sharing secondary findings, based on a recent survey of 900 genetics professionals.

Tabor HK, Auer PL, Jamal SM, Chong JX, Yu JH, Gordon AS, Graubert TA, O’Donnell CJ, Rich SS, Nickerson DA, NHLBI Exome Sequencing Project, & Bamshad MJ (2014). Pathogenic variants for mendelian and complex traits in exomes of 6,517 European and african americans: implications for the return of incidental results. American Journal of Human Genetics, 95 (2), 183-93 PMID: 25087612

Sequencing Finnish Population Isolates (SISu)

sequencing in finnish suomiIf you compare any individual’s genome to the human reference sequence, you’ll find around 3 million differences. Most of these (95%) area already known, and have been catalogued in databases like dbSNP. Many are common, and shared by 5% or more of human populations. They may still have biomedical relevance, of course; genome-wide studies of common genetic variation (GWAS studies) have found thousands of genetic loci associated with disease susceptibility and other complex traits.

But there are still huge numbers of rare (MAF<0.5%) and low-frequency (MAF<5%) genetic variants. Their contribution to human health is harder to understand, particularly because such variants:

  • Are usually not included on high-density SNP arrays
  • Occur in few individuals, and thus require large cohorts
  • Have low individual power for genetic association

One way to address the challenges of rare variants is to study them in founder populations in which such variants are more common. Ashkenazi Jews and Amish families, for example, have undergone population bottlenecks effects: a limited number of founders gave rise to the current populations.

This breeding isolation, whether cultural or geographic in nature, increases the frequency of some variants that are otherwise quite rare in broad populations.  And if those variants underlie a genetic disorder, the risk of the disease is increased. Ashkenazi Jews, for example, have increased risk of many uncommon genetic disorders.

Finland has a unique population history — a bottleneck followed by geographic isolation — the result of which is a Finnish “disease heritage”: a high incidence of 40+ Mendelian disorders. Dozens of rare Mendelian disease genes were mapped in Finns, and that knowledge is valuable for understanding disease biology. What about rare variants underlying common, complex disease? Here the Finns have an important resource: nationalized health records with decades of follow-up data.

Sequencing Initiative Suomi (SISu)

The Sequencing Initiative Suomi (SISu) aims to combine the unique population structure, the health records, and the substantial Finnish interest in genetics. The first study from SISu, just out in PLoS Genetics, compares the exomes of 3,000 Finns to an equal number of non-Finnish Europeans (NFEs). They found:

  • A depletion of “singletons” (variants only seen in one individual) in Finns: 3.7 times fewer singletons than NFEs
  • An excess of low-frequency variants (MAF 0.5-5%) in Finns relative to NFEs
  • Similar patterns of common variants between Finns and NFEs
Finnish loss of function variants

Lim et al, PLoS Genetics 2014

All of these are consistent with the expected bottleneck effect on Finnish populations. When variants were stratified by annotation (i.e. their predicted effect on genes), Finns had a higher proportion of likely-deleterious missense variants and more severe loss-of-function (LoF, or protein-truncating) variants. The average Finn had 0.160 homozygous LoF variants, whereas the average NFE had 0.095.

To determine if some of these enriched LoF variants have phenotypic effects, the authors genotyped 83 of them in 36,262 individuals from three large Finnish cohorts. Using the deep phenotype data — quantitative traits like blood pressure, lipids, etc. — they found 5 significant associations.

LPA association

LPA level (Lim et al, PLoS Genet 2014)

One of these was an association between splice site variants in the gene encoding lipoprotein A (LPA) and decreased levels of circulating lipoprotein A. As it happens, circulating LPA is a risk for coronary heart disease. Looking at the medical records showed that LPA splice variants are protective for cardiovascular disease.

This is only a proof-of-principle study, the tip of the SISu iceberg. Yet it shows the value of sequencing Finnish populations to identify rare variants contributing to complex diseases. Undoubtedly, as large-scale sequencing of Finnish cohorts continues at places like WashU and the Broad Institute, we’ll have even more power to identify genes relevant for common diseases.




Population Whole-genome Sequencing: Dutch Edition

Genomes of the netherlands

GoNL Consortium, Nat. Gen. 2014

The last time I checked, the database of human genetic variation (dbSNP) contained over 50 million unique sequence variants. And yet, as anyone who analyzes exome or whole-genome sequencing data can tell you, every individual harbors a significant number of variants (usually around 5% of single nucleotide variants, or SNVs) that dbSNP has never seen.

These “private” or rare variants undoubtedly contribute to important phenotypes, such as disease susceptibility. Non-SNV variants, like indels and structural variants, are also under-represented in public databases. The only way to fully elucidate the genetic basis of a trait is to consider all of these types of variants, and the only way to find them is by large-scale sequencing.

In this month’s issue of Nature Genetics, the Genome of the Netherlands (GoNL) Consortium reports the whole-genome sequencing of 250 Dutch families from 5 biobanks across the Netherlands. The families comprised mostly parent-child trios (n=231), along with some family quartets with monozygotic (n=11) or dizygotic (n=8) twins. All told, it was 769 individuals whose genomes were sequenced ~13x depth.

Variant Calling

Granted, this is a modest coverage depth, considering that sequencing to 30x or 40x might be considered standard. To help address this, the authors performed joint sample calling with GATK. That, along with a combination of 10 indel/SV calling tools, yielded the following:

  • 20.4 million biallelic SNVs
  • 1.2 million biallelic indels of 1-20 bp
  • 27,500 larger deletions (>20 bp)

Here’s a quick tour of the highlights in each variant class

Single Nucleotide Variants (SNVs)

Dutch genome SNPs dbSNP

GoNL Consortium, Nat. Gen. 2014

Half of the 20.4 million SNVs discovered in this study were rare, with MAF < 0.5%. The others were not quite evenly divided between low-frequency SNVs (4.0 million with MAF 0.5-5%) and common SNVs (6.2 million with MAF >5%). Altogether, there were around 7.6 million SNVs that were novel to dbSNP 137. Most of those 75%) were singletons, meaning that they were observed in just one individual. If we consider only the 500 unrelated individuals sequenced (the parents), that’s about 15,200 novel contributed variants per sequenced genome.

Among the ~2 million singletons uncovered in the European panel of a different project (1,000 Genomes), 16.5% were observed in GoNL. The authors therefore expect that a “substantial number” of singleton variants reported by these projects will be seen again as larger European cohorts are sequenced. Even so, that’s a lot of “private” variation. Remember, too, that these cohorts are from northwest Europe, arguably one of the best-characterized ancestry groups thus far.

Indels and Structural Variants

Compared to SNV calling, the detection of indels and larger SVs remains a considerable challenge. Anyone working in NGS informatics can tell you that. The authors have put forth a good effort in this arena by combining the results of 10 different variant callers: GATK UnifiedGenotyper, Pindel, 1-2-3SV, Breakdancer, DWAC, CNVnator, FACADE, MATE-CLEVER, GenomeSTRiP and SOAPdenovo. These are all different algorithms, but they boil down to five approaches for uncovering indels and SVs:

  1. Gapped reads alignments to the reference (e.g. GATK)
  2. Split reads, an approach pioneered by Pindel
  3. Paired-end read distance/orientation (e.g. BreakDancer)
  4. Overall read depth changes (e.g. CNVnator)
  5. De novo assembly of SV breakpoints.
Dutch genome SV calls

GoNL Consortium, Nat. Gen 2014

Some of the tools use one approach, while others employ multiple approaches. No single indel/SV caller has emerged as vastly superior to all others, so combining the results from a suite of different tools seems like a good strategy. The authors divided variants into three size categories (1-20 bp, 20-100 bp, and >100 bp) and kept any SV detected by at least two orthogonal tools. Their validation rate (138/144, or 96.5%) for randomly-chosen SVs of at least 20 bp is impressive.

The size distribution of consensus calls showed peaks at +/- 4 bp (microsatellite instability), ~300 bp (SINEs), and ~6 kbp (LINEs). Not remarked upon in the manuscript is the largest peak right near zero, since 1-2 bp indels are by far the most common. While 54.4% of short indels (<20 bp) were already in dbSNP, virtually all of the mid-size deletions (30-500 bp) were not (98.4%). Thus, this study helps fill an important gap in the catalogue of human sequence variation.

Functional Variation

LOF variants in dutch genomes

Rare, low freq, and common variant distributions (GoNL, Nat. Gen 2014)

Because these families were not obtained “on the basis of phenotype or disease,” their patterns of genetic variation provide a useful model for apparently healthy individuals.

Rare Loss-of-Function Variants

Among rare variants identified in this study, the authors observed an excess of nonsense SNVs and frameshift indels, consistent with the expectation that damaging variants would be under strong purifying selection.

A similar excess-of-rare-events was evident for larger deletions that removed the first exon or >50% of the coding sequence of a gene. The effect was even stronger when considering only genes in the OMIM database, reflecting strong purifying selection against structural changes in key genes.

On average, each individual in GoNL had about 60 nonsense or splice-site SNVs. Most of these, however, were common in the cohort (MAF>5%, and thus unlikely to be deleterious), which illustrates the need for cautious interpretation of apparent loss-of-function (LOF) variants. Looking at rare variants, and using synonymous SNVs as a baseline, the authors estimate that each individual might have 4-5 rare loss-of-function SNVs.

Compound Heterozygous Events

Individuals that were compound-heterozygous (i.e. one variant on each parental haplotype) for rare loss-of-function SNVs/indels/SVs were extremely rare. The authors found just 3 such instances across the cohort (an average of 0.01 events per individual). Such events are thus of considerable interest for disease studies.

Compound heterozygosity for common LOF variants should have been far more prevalent, because these are less likely to be truly deleterious. Indeed it was; there were about 3 compound heterozygous events of common LOF variants per individual. Interestingly, the 1,917 such events observed across the entire cohort were confined to 11 genes (C11orf40, DEFB126, GSTT2, HTR3D, KRTAP4-8, MS4A14, OR13C2, SIGLEC12, TRY6, VWDE, and WNK1) which all seem to have high mutation tolerance.

HGMD False Positives

The human gene mutation database (HGMD) is a commercial repository of “disease causing” variation in humans. When the authors of this study annotated variants with HGMD information, each individual harbored about 20 such variants. In other words, HGMD annotation would suggest that a large number of GoNL individuals have diseases with profound physical (or even lethal) consequences. Whoops.

It’s possible that the HGMD variants simply cause disease in non-Dutch populations, or have low penetrance. An alternative possibility is that HGMD has a lot of false positives. Among the 1,093 HGMD variants in GoNL, almost a third had MAF>1%, which is much higher than the frequency of the diseases they’re reported to cause.

De Novo Mutations

de novo mutation calling

De novo mutation calling performance, GoNL, Nat. Gen 2014

One of the most fascinating aspects of this study was the exploration of de novo mutations (variants present in a child but absent from both parents). These events are extremely rare (occurring at a rate of around 1 in 100 million bases), and identifying them absolutely requires sequencing at least three genomes: an individual and both biological parents.

Even then, they’re very difficult to find: Across the 258 independent offspring in GoNL there were 4.5 million apparent Mendelian violations. The authors applied a method (PhaseByTransmission) to refine this to around 29,162 candidate autosomal de novo mutations. That’s about 63 per offspring, far too many.

So the authors attempted to independently validate over 1,000 candidate de novo mutations by orthogonal sequencing, and found that around 50% were false positives. Some independent Complete Genomics data for 19 parents and 1 child revealed another 1,137 events that were false positives. From these 2,270 observations, the authors developed a random forest classifier to predict whether a predicted mutation would be truly de novo or not based on a number of different properties. This is something that many other groups (including ours) have done for somatic mutation calling in cancer genomes.

The classifier in this study, which had an estimated accuracy of 92%, relied primarily on factors related to the sequencing depth and read counts, which happens to be the basis for mutation detection with VarScan 2. When applied to the GoNL dataset, the classifier nominated 11,020 high confidence de novo mutations — roughly 42.7 per offspring — with a range of 18 to 74 per offspring. That’s still a bit higher than it should be, but still reasonable for downstream analysis.

Paternal Age and de novo Mutation Rate

paternal age and mutation rate

GoNL Consortium, Nat. Gen 2014

The authors observed a significant correlation between the father’s age at conception and the number of de novo mutations in the child. This is the third study to report such a trend, and the largest sample size yet. Although the ages of mother and father are highly correlated, its effect on de novo mutation rate was primarily due to paternal influence.

The authors estimate that each additional year of paternal age caused a 2.5% increase in the number of de novo mutations in the child. Under their model, about 75% of de novo mutations come from the father, and 25% from the mother. Phase analysis using read pairs (a complex process I won’t go into) revealed that 76% of de novo mutations were indeed on the paternal haplotype. So you can thank your dad for 3/4 of your de novo mutations.

De Novo Indels and SVs

The authors attempted to find de novo indels and structural variants. It didn’t go well.

In Summary

The authors have employed moderate-coverage whole genome sequencing to build a resource of 1,000 haplotypes for a small, densely-populated country in northwestern Europe. They added 7.6 million SNVs to dbSNP, and also characterized a large number of new indels and SVs. Many more studies which apply genome sequencing to large population cohorts will be necessary to fill out the catalogue of human genetic variation.

New Challenges of Next-Gen Sequencing

I first started MassGenomics in the early days of next-gen sequencing, when Illumina was called “Solexa” and came in fragment-end, 35-bp reads. Even so, the unprecedented throughput of NGS and the nature of the sequencing technology brought a whole host of difficulties to overcome, notably:

  • Bioinformatics algorithms developed for capillary-based sequencing didn’t scale.
  • Sequencing reads were shorter and more error-prone.
  • The instruments were expensive, limiting access to the technology
  • Most of the genetics/genomics/clinical community had no experience with NGS

All of these are essentially solved problems: new bioinformatics tools and algorithms were developed, the reads became longer and more accurate, benchtop sequencers and sequencing-service-providers hit the market, and NGS was widely adopted by the research community. Mission accomplished!

Yet these victories were short-lived, because we find ourselves facing new challenges. Harder challenges. Here are a few of them.

1. Data storage

You’ve probably seen the plot of Moore’s Law compared to sequencing throughput. In short, the cost of DNA sequencing has plummeted much faster than the cost of disk storage and CPU. A run on the Illumina HiSeq2000 provides enough capacity for about 48 human exomes. Even if you don’t keep the images, each exome requires about 10 gigabytes of disk space to store the bases, qualities, and alignments in compressed (BAM) format. At three runs a month, each instrument is generating 1.4 terabytes of data files. It adds up quickly.

Analysis of sequencing data — variant calling, annotation, expression analysis, genetic analysis — also requires disk space. Most non-BGI research budgets are finite, so investigators must choose between (1) deleting data, (2) spending money, or (3) holding up data production/analysis. None of those sound very appealing, do they?

2. Achieving Statistical Significance

NGS is no longer an exploratory tool, and descriptive studies reporting a dozen or a couple hundred genomes/exomes are harder and harder to publish. This is particularly true for common diseases, in which large numbers of samples are typically required to achieve statistical significance. The number 10,000 has been discussed as an appropriate number. Even if that many samples could be found, the cost of sequencing so many is substantial. If you had an Illumina X Ten system and could do whole genomes for $1,000 each (that only covers reagents, by the way), it’s still ten million dollars. That’s probably over budget for most groups, so they’ll have to take another tack:

  • Sequencing fewer samples, which will make the work harder to fund/publish
  • Combining some sequencing with follow-up genotyping, which limits the discovery power
  • Collaborating with other labs/consortia, whose sample populations, phenotypes, or study designs may vary

How many of your project planning meetings have ended with someone saying, “Well, maybe we’ll get lucky.” ?

3. Finding Samples

Getting access to large sample cohorts is another challenge. As I’ve previously written about, given the widespread availability of exome and genome sequencing, samples are the new commodity. High-quality DNA samples from informative sources — tumor tissue, diabetes patients, families with rare disorders, even healthy members of minority populations — are increasingly valuable. Why should an investigator collaborate with you, when they might send the samples off for sequencing on their own?

Sequencing samples with public funds (i.e. NIH grants) adds another layer of difficulty: all sequencing data must be submitted to public repositories. This means that the volunteer must have given informed consent not just for study but for data sharing. Local IRBs even need to sign off. The net result is that many of the samples that come to us for sequencing don’t meet the criteria, and must be returned.

4. Privacy

Even if you have an outstanding, comprehensive informed consent document, it might be difficult to get volunteers to sign it. There’s a growing public concern about the privacy of genetic information. As Yaniv Erich demonstrated by hacking the identities of CEPH sample contributors, genetic profiles obtained from SNP arrays, exome, or genome sequencing can be used to identify individual people. They also contain some very private details — like ancestry and disease risk alleles — that might be exploited, made public, or used for discrimination.

How long is it before genetic profiling replaces Google-stalking as a screening tool for job candidates or romantic interests? Thanks for coming in, Mr. Johnson. All we need now is your Facebook password and a cheek swab.

5. Functional Validation of Genomic Findings

Numerous research groups have demonstrated the immense discovery power of NGS. The mere fact that dbSNP — the NCBI database of human sequencing variation — has swelled to more than 50 million distinct variants tells us something about what pervasive genome sequencing capabilities might uncover. And yet, the variants implicated in sequencing-based studies of human disease are increasingly difficult to “sell” to peer reviewers on genetic information alone. Our inability to predict the phenotypic impact of genetic variants lurks beneath the veneer of genetic discoveries like a shark following a deep-sea trawler.

Referees of most high-impact journals want to see some form of functional validation of genomic discoveries. That’s a daunting challenge for many of us accustomed to the rapid turnaround, high-throughput nature of NGS. Most functional validation experiments are slow and laborious by comparison.

6. Translation of NGS to the Clinic

We all know that NGS is destined for the clinic. Targeted sequencing panels are already in routine use at many cancer centers; in time, this will likely become exome/genome sequencing. Possibly transcriptome (RNA-Seq) and methylome (Methyl-Seq) as well. Undiagnosed inherited diseases, and rare genetic disorders whose genetic cause is unknown, are two other common-sense applications. There are many hurdles to overcome in order to apply a new technology to patient care. CLIA/CAP certification is a complex, expensive, and time-consuming process.

The reporting is more difficult, too. Unlike the research setting in which most NGS results have arisen, a clinical setting requires very high confidence in order to report anything back to the patient or treating physician. This is a good thing, since patient care decisions might be made based on genomic findings. Yet it means that we have a considerable amount of work ahead to ensure that genomic discoveries are followed up, replicated, and otherwise vetted to the point where they can be of clinical use.

Share This on Twitter

It’s time for us to spread the word about the new challenges facing NGS.

Click to Tweet Great post by @MassGenomics on 6 new challenges of next-gen sequencing: Challenge #1: Data storage. #genomics
Click to Tweet 6 new challenges of next-gen sequencing by @MassGenomics: Challenge #2: Statistical significance. #genomics
Click to Tweet A post by @MassGenomics on 6 new challenges of next-gen sequencing: Challenge #5: Functional validation. #genomics