Variant Annotation in Coding Regions

variant annotation

McCarthy et al, Genome Biol. 2014

The analysis of NGS data comes with many challenges — data management, read alignment, variant calling, etc. — that the bioinformatics community has tackled with some success. Today I want to discuss another critical component of analysis that remains an unsolved problem: annotation of genetic variants. This process, in which we try to predict the likely functional impact of individual sequence changes, is crucial for downstream analysis. Virtually every type of genetic study — family studies of rare disorders, case-control studies, population genetics surveys — relies on annotation to identify the variants that are most likely to influence a phenotype.

A paper currently in pre-print at Genome Biology reports that choice of transcripts and software has a large effect on variant annotation.that are used. I’ll talk about some of their findings as part of a wider discussion of the variant annotation problem.

Annotation Challenges and Complexities

Even in the protein-coding portions of the genome, where we know the most about gene structure and function, predicting the impact of a single base change is not always straightforward. Here are a few of the reasons why:

  • Multiple isoforms. The ENCODE consortium’s extensive RNA sequencing revealed that the average protein-coding gene has something like 5 different isoforms with different transcription starts/stops or exon combinations. It’s very difficult to predict which of these will be active in the cell type at the time point of interest for a given disease.
  • Overlapping genes. Even if you could handle the isoforms, there are still going to be variants that affect two or more different genes. The two genes might share an exon, or it could be one gene’s exon and another’s promoter. This one-to-many relationship of variants to genes can be problematic in the many downstream pipelines that expect exactly one annotation per variant.
  • Competing annotation databases. There are at least three widely-used annotation databases (ENSEMBL, RefSeq, and UCSC) that provide a set of human transcripts for annotation purposes. Their minimum evidence requirements and curation procedures differ, so the transcript sets they provide are not the same. RefSeq release 57 (REFSEQ)  has 105,258 human transcripts, while ENSEMBL 69 (EMBL) has nearly twice that (208,677).
  • Ranking procedures. Even a variant in a single gene with one isoform can have multiple annotations: it could be a synonymous change that’s also in a splice site, or a nonsynonymous variant that disrupts the stop codon. Which annotation should be reported? “All of them” is too easy of an answer. At some point, downstream analysis may require users to make a choice.

Comparing Annotation Databases and Software Tools

In the paper I mentioned, McCarthy et al took 80 million genetic variants (SNPs and small indels) obtained from whole-genome sequencing of 274 individuals. We don’t know much about ancestry, but they include 80 patients with immune disease, 151 from families with Mendelian disorders (mostly trios), and 45 from cancer studies (germline DNA only). The authors compared variant annotations from two different tools (ANNOVAR and VEP) using the REFSEQ or EMBL transcript databases.

I’ve discussed this paper with a number of colleagues, and we share some concerns about how the comparison was conducted. Even so, I think that the work highlights some of the important differences between these tools and databases. If I distill it down to what I consider the highlights:

  • ANNOVAR annotation of 80 million variants using either REFSEQ and EMBL transcripts returned matching annotations about 84% of the time.
  • However, for variants considered “loss of function” (LOF: missense, nonsense, nonstop, frameshift, splice site), the concordance was only 44%.
  • Much of the disagreement can be attributed to EMBL having twice as many transcripts: it yielded more exonic annotations, and also annotated many variants as UTR or noncoding RNA when REFSEQ considers them noncoding.
  • VEP and ANNOVAR software tools did not always agree, even when using the same transcript set. VEP seems to provide better annotation of variants in and around splice sites.
  • There are also differences in reporting between the tools: ANNOVAR reports the most-damaging annotation for a variant, whereas VEP tends to report all annotations. This forced the authors to apply a ranking system to VEP results in order to make comparisons, and that likely caused some mismatches as well.

Is There A Right Answer?

It needs to be said that comparative studies like these are extremely difficult to undertake. It’s easy to point out the flaws, but we should still applaud the efforts of the authors, who undertook a major effort to help us better understand how annotations can differ. Variant annotation is much like variant detection, in that the quality of the results depends on:

  1. The software tools (e.g. VEP vs. ANNOVAR, VarScan vs. GATK) and their underlying algorithms
  2. The quality of the input data (e.g. read alignments for variant calling, transcript sets for annotation).

It pains me to say this, but there are limits to what we can do computationally, and we’ll almost certainly need experimental data to determine the right answer. For variant detection, that might be validation of variant calls on an orthogonal platform. For variant annotation, that might mean RNA-Seq data or proteomics approaches. This is a hard problem to solve, and these are the regions of the genome that we probably know best.

Imagine what it will take to accurately annotate variants in regulatory and noncoding regions of the genome.

References
Davis J McCarthy, Peter Humburg, Alexander Kanapin, Manuel A Rivas, Kyle Gaulton, The WGS500 Consortium, Jean-Baptiste Cazier and Peter Donnelly (2014). Choice of transcripts and software has a large effect on variant annotation Genome Medicine, 6 (26) : doi:10.1186/gm543

8 Realities of the Sequencing GWAS

8 realities of the sequencing gwasFor several years, the genome-wide association study (GWAS) has served as the flagship discovery tool for genetic research, especially in the arena of common diseases. The wide availability and low cost of high-density SNP arrays made it possible to genotype 500,000 or so informative SNPs in thousands of samples. These studies spurred development of tools and pipelines for managing large-scale GWAS, and thus far they’ve revealed hundreds of new genetic associations.

As we all know, the cost of DNA sequencing has plummeted. Now it’s possible to do targeted, exome, or even whole-genome sequencing in cohorts large enough to power GWAS analyses. While we can leverage many of the same tools and approaches developed for SNP array-based GWAS, the sequencing data comes with some very important differences.

1. There Will Be Missingness

SNP arrays are wonderful tools that typically deliver call rates (the fraction of genotypes reported) of 98-99%. Sequencing, in contrast, often varies from sample to sample in read depth and genotype quality for many, many positions.
missing gwas data
Targeted sequencing, including exome sequencing, tends to have a variable “shoulder effect” in which the read depth obtained forms a bell curve centered on each targeted region.

Thus, you get lower coverage outside of your target. Sometimes it’s enough to call variants, sometimes not. Polymorphic sites in such regions might only be called in 20% of samples, and this phenomenon is usually exacerbated by batch effects.

2. You Must Backfill Your VCF

The variant calls from next-gen sequencing are typically exchanged in VCF file format. Most variant detection tools output directly into that format, and most downstream annotation/analysis tools will read it. Importantly, variant callers typically don’t report positions that are wild-type. Otherwise the files would be huge.

It’s quite easy to take VCF files from many different samples and merge them together. VCFtools will do that for you. Here’s the problem: unless the VCFs have information about every position (including non-variant ones), any variants called in sample 1 but not sample 2 will have the “missing data” genotype in the merged VCF file.

To illustrate with a very basic example, two samples each with three heterozygous SNPs:

VCF for sample 1:
CHROM	POSN	REF	ALT	...	SAMPLE1
chr1	2004	A	G	...	0/1
chr1	5006	C	T	...	0/1
chr1	8848	T	A	...	0/1

VCF for sample 2
CHROM	POSN	REF	ALT	...	SAMPLE1
chr1	2004	A	G	...	0/1
chr1	4823	G	A	...	0/1
chr1	8848	T	A	...	0/1

Merged VCF:
CHROM	POSN	REF	ALT	...	SAMPLE1	SAMPLE2
chr1	2004	A	G	...	0/1	0/1
chr1	4823	G	A	...	./.	0/1
chr1	5006	C	T	...	0/1	./.
chr1	8848	T	A	...	0/1	0/1

Now you can see the issue. We don’t know if site 4823 in sample 1 was wild-type or simply not callable in the sequencing data. Same deal for position 5006 in sample 2. And, because of reality #1, you can’t simply guess here. You truly need to go back to the BAM file for each sample and make a consensus genotype call for each missing genotype. Our center has a pipeline for doing this in every cross-sample VCF we produce, but I’m guessing that many sequencing service providers do not.

3. Batch Effects Are Guaranteed

Next-gen sequencing is a rapidly evolving technology. Sure, one platform dominates the market, but it also comes in many different forms: GAIIx, HiSeq2000, HiSeq2500, MiSeq, HiSeq X Ten.

GC batch effectEven with one instrument, the reagent kits, software version, and run quality will affect the resulting sequencing data. Because a well-powered GWAS will require thousands of samples, the sequencing probably take long enough for something to change. There are other sources of variability, too:

  • Library protocol (insert size, selection method, enzyme choice)
  • Capture reagents. (manufacturer, version, probe type)
  • Sequencing software version
  • Aligner version and parameters

If I had to pick one thing to keep the same, I’d go with the reagent used for exome sequencing. There is no universally-accepted definition of the exome, and the hybridization capture technologies offered by leading vendors (Agilent, Nimblegen, Illumina) have key differences.

4. Expect Many Rare Variants

The genetic variants interrogated by most high-density SNP arrays are exceptional in a number of ways:

  1. Most are common with allele frequencies >1% in human populations.
  2. Most are found in multiple populations (European, African, Asian)
  3. Most are not in coding regions (the exome-chips address this somewhat).
Rare variant enrichment

Casals et al, PLoS Genetics 2013

In other words, the SNPs chosen for inclusion on commercial arrays are the best of the best: high-frequency SNPs that represent, or “tag”, the variation for a much larger region. They must be informative, since a typical array tests about 600,000 SNPs, and a typical genome harbors over 3 million.

In contrast, sequencing will reveal variants across the entire allele frequency spectrum. Most of the variants won’t even be on a commercial SNP array, and 5% or so won’t even be in dbSNP. Thus, the allele frequency spectrum is dramatically different in a sequencing study. This will affect the analysis.

5. Expect Many Ugly Variants

Another requirement for inclusion on a high-density SNP array is assayability. The genotyping technologies are best suited to bi-allelic variants in unique (non-repetitive) regions of the genome. Highly variable sequences (i.e. high density of SNPs or other variants) can alter the assay, so they’re usually avoided.

It is not fair to say that sequencing is completely unbiased by comparison, because probe selection (for targeted/exome studies) must consider properties like uniqueness and GC content. Read mapping, too, introduces bias. Even so, sequencing picks up a lot of the “ugly” variants that SNP arrays so meticulously avoid: low-complexity regions, indels, triallelic SNPs, tandem repeats, etc.

A multi-sample VCF can get pretty ugly. You might have a 2-base insertion right near a 3-base deletion with a SNP between them. You might see all four nucleotides at one SNP.

6. Not All VCFs Are Created Equal

Even VCFs which have been properly backfilled may differ slightly in form without breaking the format’s rules. That’s because the VCF specification allows some flexibility in the names of INFO and genotype fields, and what they contain. A minimalist VCF might contain genotypes with quality scores, whereas a fuller VCF could have 10-12 fields (genotype, score, depth, read counts, probabilities, filter status, strand representation) for every single genotype.

It will be important to find the right balance of keeping enough information while maintaining reasonable file sizes. Also, some VCFs will include genotypes that have been failed by a filter or QC check, and that failure may only be noted in the FT field of the genotype (note in the FILTER column). If your analyst or conversion tool isn’t aware of this, you could be pulling in a lot of bad calls.

7. Variant QC Is An Art Form

The genotype table from a GWAS is a beautiful thing. It has 600,000 or so rows, one column for every sample, and 99% of those cells have a good, clean genotype. With PLINK or any number of tools, it’s easy to zip through the table and remove the variants with missingness or Hardy-Weinberg problems.

Hardy weinberg QC

Hardy-Weinberg equilibrium

With sequencing data, one can and should still use these metrics in a dataset. But if you removed variants outside of Hardy-Weinberg equilibrium, you probably won’t have many left. Genotyping accuracy in NGS is a different ball game than SNP arrays. With sequencing it is usually quite accurate, but the accuracy isn’t constant: it’s controlled by sequencing depth and other factors. The careful heterozygote/homozygote balance prescribed by nature can shift quite easily.

There are other confounding factors, too. We recently examined ten randomly-selected Hardy-Weinberg outliers from a sequencing dataset. Three proved to be dinucleotide polymorphisms (DNPs), another couple were in close proximity to common indels (causing artifactual calls), a few were in highly repetitive sequences, and at least one seemed perfectly sound. By “sound” I mean that when I pulled up the variant calls in IGV, they looked good.

Deciding where to draw the line and include or exclude variants is not a quick and easy decision.

8. Massive Computational Burdens

This last reality isn’t surprising at all: the computational burden is huge. Because sequencing studies uncover most variants within their targets, and those variants tend to be rare, the genotype table (variants in rows, samples in columns) even for exome studies is huge.

sequencing computational burdenThe hard disk required for data storage, and the resources (memory, CPU) required for data manipulation/analysis dwarfs what’s required for SNP array data. That’s why some groups, such as our colleagues at Baylor’s Human Genome Sequencing Center, are turning to cloud computing.

The real bummer about the already-considerable computational load is this: if a sample is added to the study, it will undoubtedly contribute some new variants, which means that the entire genotype table must be re-calculated. For example, an exome study with 1,000 samples might have 60,000 variants. That’s already 60 million cells. When you add sample 1001, it might have 500 variants not seen in other samples. To “backfill” the genotypes adds half a million backfill operations, just to add one sample.

You can imagine how this begins to snowball. Finalizing the sample set before running analysis has never been more important.

On the Bright Side

These caveats of the sequencing GWAS, while important, should not detract from the advantages over SNP array-based experiments. Sequencing studies enable the discovery, characterization, and association of many forms of sequence variation — SNPs, DNPs, indels, etc. — in a single experiment. They capture known as well as unknown variants.

Sequencing also produces an archive that can be revisited and re-analyzed in the future. That’s why submitting BAM files and good clinical data to public repositories — like dbGaP — is so important. Single analyses and meta-analyses of sequencing GWAS may ultimately help us understand the contribution of all forms of genetic variation (common, rare, SNPs, indels) to important human traits.

Variant Prioritization in Rare Mendelian Disorders

Few areas of biomedical research have benefited more from next-gen sequencing than studies of rare inherited diseases. Rapid, inexpensive exome sequencing in individuals with rare, presumably-monogenic diseases has been hugely successful over the past few years. There’s been a lot of discussion in the NGS community about the analysis burden of the large-scale whole-genome sequencing that will be possible with Illumina HiSeqX Ten systems, but even exome sequencing analysis brings considerable challenges.

In the March 2014 issue of the American Journal of Human Genetics, we present a software package called MendelScan to aid the analysis of exome data in rare Mendelian disorders.

Exome Sequencing Challenges

Every individual harbors thousands of coding variants, 5-10% of which are not in public databases such as dbSNP. A study led by my friend Daniel MacArthur found that, even after correcting for annotation errors and other artifacts, the genome of a healthy individual contains ~100 loss of function coding variants. This is just one of the reasons that exome sequencing of Mendelian disorders can fail.

And they do fail. The current solve rate for incoming cases at NIH Mendelian Centers remains at around 25%. Dominant disease pedigrees remain more difficult to solve than recessive ones.

Mendelian Disorder: Retinitis Pigmentosa

Retinitis pigmentosa (RP) offers a wonderful example of challenging Mendelian disorders. It’s a “genetically heterogeneous” disease, which is a fancy way of saying that many different mutations in dozens of different genes can cause dominant, recessive, or X-linked disease. The disease affects around 1 in 3,500 individuals in the U.S., and it’s incurable.

No matter the genetic cause, the progression of RP is remarkably uniform. Basically, it’s a disease of rod photoreceptors — the light sensing kind, not the color-sensing kind — whose slow, inexorable attrition usually causes night blindness (usually apparent by adolescence) and a sustained narrowing of the visual field (tunnel vision). Most RP patients will be legally blind by the age of 40.

Mutations in about 18 different genes can cause dominant RP, which is the form that we’re studying. Routine genetic testing of common disease-causing mutations explains about 50% of incoming cases right off the bat. We’re interested in the cases that come back negative from these screens, the ones that may have rare or as-yet-unknown causal mutations.

Exome Sequencing in 24 Families

We did exome sequencing for 24 families that lacked common disease-causing mutations. The typical family had a proband, the affected parent (because it’s dominant), the unaffected parent, and a distant affected relative. Overall we did 2-7 affected and 0-2 unaffected samples per family, for a total of 91 samples. On average, in each family, we identified ~30,000 single nucleotide variants (SNVs) and 600 insertions/deletions (indels) in coding regions. That’s a lot to sort through when you’ve got 24 families.

Variant Prioritization Strategy

Based on our knowledge of dominant RP, and an analysis of 762 disease-causing mutations downloaded from HGMD, we expected that most disease-causing mutations would exhibit some key characteristics:

  1. Segregation. In dominant pedigrees with full penetrance, all affected individuals should carry the causal mutation, and none of the unaffected individuals should.
  2. Rareness. All of the mutations known to cause dominant RP are quite rare. In the HGMD set, 68% of mutations were novel to dbSNP 137, and another 21% were present only because they were pulled in from OMIM and other mutation databases.
  3. Protein impact. We expect that most (but not all) causal mutations will impact genes. When classified by current VEP annotation, most of the mutations were predicted to alter protein sequence (66%), reading frame (13.5%), splicing (4.3%), or length (6.8%).
  4. Retinal expression. Genes in which mutations cause retinal disease tend to be highly expressed in the retina. According to recent human retina RNA-seq data, about 97% of genes in RetNet (a retinal disease gene database) are in the top 50% of all genes when ranked by retinal expression.

You’ll note that there are exceptions to every rule above. That’s why we were uncomfortable with simply ruling out variants that don’t segregate perfectly or ones that appear to be synonymous. Instead, we developed a scoring algorithm to prioritize variants based on segregation, rareness, annotation, and retinal expression.

So how well does it work? In our exome dataset, 8 of 24 families harbored a likely-pathogenic mutation in a known RP gene. When we sorted the variants in those families by prioritization score, the causal mutation never ranked lower than 13. Out of 20,000+ variants. There was one exception, a family that turned out to have an error in the pedigree. The causal mutation there ranked #439 out of 26,666 SNVs, so it was still in the top 2%.

This robust performance — even in the face of incorrect assumptions — is why we prefer to prioritize rather than filter-and-remove candidate variants.

Mapping Dominant Disease Genes

MendelScan exome pedigree

Koboldt et al, AJHG 2014

Even though our scoring algorithm seemed to be working well, we still had hundreds or thousands of variants to sift through in some families. And while those pedigrees often weren’t large enough for traditional linkage analysis, we asked whether the dense information provided by exome sequencing could help nominate or exclude regions based on segregation.

Disease-causing variants usually don’t occur in isolation. They’re part of a haplotype that segregates within a family pedigree (example at right). For dominant disease, all affected individuals have one haplotype in common, the one that hosts the causal variant (denoted in black, on the left):

MendelScan RHRO and SIBD mapping

Rare Heterozygote Rule Out

That haplotype (orange) might also host other variants that aren’t disease-causing, but might still be picked up by exome sequencing. Some may be quite rare, and because they’re physically linked to the causal mutation, they’ll be heterozygous in affected individuals. There also should be no homozygous differences between pairs of affecteds (red) because all affecteds share at least one haplotype. So a cluster of shared rare (heterozygous) variants, and an absence of homozygous differences, helps us map haplotypes shared by affecteds in the pedigree. We call this rare heterozygote rule out (RHRO).

Shared IBD Analysis

Another approach would take the same principles, but use identity by descent (IBD). Since the haplotype shared by affecteds was inherited from a common ancestor, we can also search for regions that are IBD between most or all pairs of affecteds. We call this shared IBD (SIBD) analysis, and like RHRO, its discriminatory power grows with the numbers of and genetic distance between affecteds in a pedigree.

We applied these approaches to families with 3+ sequenced affecteds. When you put both mapping methods together, you get something like this:

mendelscan RP gene mapping

Koboldt et al, AJHG 2014

Some of them had orthogonal information — traditional linkage peaks or an identified pathogenic mutation — that told us where the disease-causing variant resided (blue line, above), so we could test the performance of these approaches. Our mapping methods did well: they recapitulated known linkage regions and/or captured the region of the causal mutation. And they did so with far fewer affected individuals than were used for the linkage analysis.

Our approaches also identified new candidate regions. These might be eliminated by adding more affected individuals, or they might reflection linkage that was missed by traditional approaches.

Improving Exome Analysis for Rare Disorders

So the MendelScan tool is out and freely available. We would love to have your feedback and suggestions for it! The current JAR release is v1.2.1. Let’s go forth and conquer some Mendelian disorders.

References

Koboldt DC, Larson DE, Sullivan LS, Bowne SJ, Steinberg KM, Churchill JD, Buhr AC, Nutter N, Pierce EA, Blanton SH, Weinstock GM, Wilson RK, & Daiger SP (2014). Exome-Based Mapping and Variant Prioritization for Inherited Mendelian Disorders. American journal of human genetics PMID: 24560519

Genetic Privacy and Right to Know in the Whole-Genome Era

Genetic privacy in genome era

Credit: theglobeandmail

Yesterday I attended a lively panel discussion on the topic of genetic privacy and patients’ rights to their genetic information, hosted by WashU’s Brown School of Social Work. Two panelists in particular — Laura Jean Bierut (Washington University) and Lainie Friedman Ross (University of Chicago) — offered strongly opposing views for and against the patient’s right to know.

You can listen to them debating on St. Louis Public Radio. They and the other panelists each presented for about 10 minutes, before the panel opened up to questions from the audience.

The ensuing discussions highlighted the complexity of the ethical and social issues surrounding genetic information, and made one point quite clear.

Even the experts don’t agree.

ACMG Recommendations for Incidental Findings

Last year, the American College of Medical Genetics published their recommendations concerning the return of “incidental findings” to patients who undergo whole-genome or exome sequencing in clinical settings. Specifically, the ACMG recommended that laboratories conducting clinical sequencing should seek and report mutations of specified classes in 56 genes.

This evaluation “should be performed for all clinical germline (constitutional) exome and genome sequencing, including the ‘normal’ of tumor-normal subtractive analyses in all subjects, irrespective of age, but excluding fetal samples.”

Perhaps most importantly, the ACMG recommends that mutations in these genes should be reported “regardless of the indication for which the clinical sequencing was ordered.” In other words, clinical sequencing may uncover secondary predispositions that are unrelated to the reason that you underwent sequencing in the first place.

Gene Results to be Returned

The ACMG is still taking a conservative view; they’ve largely chosen genes associated with highly penetrant, monogenic disorders. Here are some highlights:

  • • Hereditary Breast and ovarian Cancer (BRCA1, BRCA2)
  • • Li-Fraumeni Syndrome (TP53)
  • • Peutz-Jeghers Syndrome (STK11)
  • • Lynch Syndrome (MLH1, MSH2, MSH6, PMS2)
  • • Familial adenomatous polyposis (APC)
  • • Colorectal adenocarcinoma (MUTYH)
  • • Von Hippel Lindau syndrome (VHL)
  • • Multiple Endocrine Neoplasia Type 1 (MEN1)
  • • Neurofibromatosis type 2 (NF2)
  • • Marfan Syndrome (FBN1)
  • • Familial hypercholesterolemia (LDLR, APOB, PCSK9)
  • • Malignant hyperthermia susceptibility (RYR1, CACNA1S)

With the exception of BRCA1/BRCA2, these are mostly genes associated with life-threatening diseases that could occur at almost any age. Thus, clinical intervention may improve the patient’s quality of life and overall health.

Limiting Access to Genetic Information

Dr. Ross and others in the panel raised some good arguments in favor of limiting access to genetic information like the secondary findings mentioned above. Some of the more compelling ones:

  1. The right not to know. Many patients and family members are interested in getting information. But not all. Some prefer not to know about their genetic susceptibilities, especially the damning ones: the diseases without a cure.
  2. Genetic uncertainty. Some mutations in some genes have good predictive power. Many do not. There can and almost certainly will be cases of false alarm, and cases of inappropriate reassurance.
  3. Health care system burden. If everyone had access to their 23andMe “risk profile” there would be a surge in demand for all kinds of tests, as people freak out about what diseases they might have. Most of these tests would be unnecessary, and create a significant burden on the health care system.
  4. Lack of counseling. There simply are not enough clinicians and genetic counselors to meet our capacity for delivering sequenced genomes. Some worry about people getting this kind of information without the support of these qualified professionals.

Genetic Right to Know

There was an equal balance of arguments put forth for the right of individuals to their own genetic information. Full disclosure: I lean slightly toward this line of thinking, but you should know that already. Dr. Bierut argued that information wants to get out. She, and other participants in the discussion, raised some good points:

  1. Lifesaving information. There is a real possibility that genetic findings — even secondary ones — could reveal an as-yet-unseen condition for which medical intervention is possible. As a parent, I feel that if I could get such information about my children, I’d want it.
  2. Right to know. It could be argued that people have a fundamental right to information about their own genome’s content, the good and the bad. Or, looking at it another way, it does not seem the place of clinicians or genetics labs to withhold information that they uncover.
  3. Human resilience. Let’s agree that a lot of the secondary findings, especially for the ACMG’s list, are bad news. Does that have a long-term effect? There was a study of people with family history of Alzheimer’s who decided to learn their APOE status (two copies of the APOE*4 alleles increase disease risk about 12-fold). The people who got bad news did show some stress and depression, yes. But within 5 years they’d all returned to baseline.
  4. The role of clinicians. It’s the physician’s job to tell his patients things that they may not want to hear — that they’re overweight, need to cut out sodium, should be sleeping 8 hours a night. Genetic information, too, might fall into this category.

Predictive Power of Genetics

For most diseases, an individual’s risk arises from a combination of genes, environment, age, and other factors. Even for genes whose role in highly penetrant monogenic disorders has been well established, sequencing will reveal many variants of unknown significance. Our current ability to predict pathogenicity of such variants is not very good. What does that mean for a patient with two variants in a recessive disease gene?

In a high-profile examination of a sequenced genome (publicly available), the individual was homozygous for likely-pathogenic variants in two disease genes. Both were devastating, crippling diseases. It seems like that individual would be severely affected. You know whose genome it was? Dr. James Watson, the Nobel laureate and co-discoverer of DNA’s structure. This simply a vignette, but it does illustrate something that people in our field tend to accept: there’s a limit to the predictive power of genetics.

Consumer Genetic Testing

One item that many of the clinicians did seem to agree on was the FDA’s crackdown on the “health risk profile” previously available for 23andMe’s personal genetic testing service. Apparently the results being returned didn’t meet the ACMG’s proposed standards for analytical validity, clinical validity, and clinical utility. Many of those in the room didn’t like the idea of direct consumer access to genetic information, though Dr. Bierut was favorable, saying “Information wants to be free.”

Of course, I had my personal genetic testing done by 23andMe, and I’m in favor of that kind of access. But then again, I’m rather biased. What are your thoughts on genetic right to know?