The Updated Catalogue of Retinal Disease Genes

As you might guess, I’m keenly interested in the genetics of retinal diseases like retinitis pigmentosa and macular degeneration. It’s therefore a thrill when there’s an update to RetNet — the database of genes and loci causing retinal disease — that includes one of our recent discoveries.

For the last few years, we’ve been working with Steve Daiger, Sara Bowne, and Lori Sullivan at the University of Texas, Houston to find new genes for retinitis pigmentosa (RP), a retinal degenerative disorder affecting about 1 in 5,000 individuals in the United States. The disease usually manifests in childhood or adolescence with night blindness, followed by progressive loss of peripheral vision and eventually central vision.

RP is a Mendelian disorder (i.e. caused by mutations passed from one or both parents to a child) but is incredibly heterogeneous: it can be inherited in dominant, recessive, or X-linked fashion. About 20 genes have been linked to the dominant form, and if you screen them (e.g. with a capture panel) in a newly-diagnosed patient, you find the causal mutation about 50-75% of the time. Steve’s group has spent the last 20 years building a sample cohort of families in the other 25%.

As part of our collaboration, we sequenced the exomes of several individuals from a large dominant RP pedigree. It was so large that we actually treated it as two distinct families, because we thought there were two genes. But our variant analysis of the exome data revealed that there was one variant that was present in every affected, absent from every unaffected, and as-yet-unknown to dbSNP. A promising lead, but there were two issues:

  1. The variant was homozygous in one of the affected individuals, which is generally unexpected for rare dominant Mendelian disorders.
  2. The variant’s gene was hexokinase 1 (HK1), which catalyzes phosphorylation of glucose to glucose-6-phosphate and has no obvious connection to retina function.

The gene was highly expressed in the retina, which is consistent with many known RP genes. The final piece of evidence came from laborious screening of HK1‘s exons in hundreds of families from the Daiger cohort. That turned up a second family with the same exact disease-causing mutation. Our publication of HK1 last year established it as a new disease gene for dominant RP and suggests a new pathway (glycolysis) that may be involved in retinal disease.

Growth of Known Retinal Disease Genes

Here’s the latest content of RetNet, with numbers compared to the last release (end of 2014).

  • 278 total retinal-disease genes have been mapped (up from 261).
  • 238 have been identified at a DNA level (up from 221).

At least 25% of RetNet genes are associated with complex developmental and/or cerebellar diseases that include incidental retinal findings.  One reason for their inclusion is that many also have mutations with ocular findings only.  However, panel screening of these genes is likely to detect mutations with severe non-ocular consequences.

The First Noncoding-RNA Retinal Disease Gene

This release of RetNet includes the first entry of a non-coding RNA gene associated with retinal disease. Conte et al applied linkage mapping and exome sequencing of a five-generation British family with dominant retinal degeneration and bilateral iris coloboma (“holes in the iris”). They identified a variant in the seed region of MIR204, a micro-RNA gene at chr9q21.12, which segregated with disease.

Subsequent experimental work demonstrated that mir204 plays a role in ocular development and that the variant allele severely altered its targeting abilities.  Very cool stuff.

The Award for Disease Diversity Goes to…

The PRPH2 at RetNet gene, which encodes peripherin (a protein in rod photoreceptor outer segments) was cloned in 1990. Over the last 25 years, mutations in that gene have been linked to:

  • Dominant retinitis pigmentosa (accounts for 5% of cases);
  • Dominant macular dystrophy;
  • Dominant cone-rod dystrophy
  • Dominant central areolar choroidal dystrophy
  • Dominant adult vitelliform macular dystrophy
  • Recessive Leber congenital amaurosis

It’s also been linked to a super-rare digenic form of retinal disease: heterozygous mutations in PRPH2 and another gene (ROM1) in the same individual can cause retinitis pigmentosa.

Other RetNet Highlights

Here are some of the other recent findings that made the latest release of RetNet which highlight the complexity of retinal disease genetics.

Syndromic Retinal Disease

Often, retinal disease manifests as one of several symptoms in a rare genetic syndrome. For example:

  • HGSNAT (8p11.21).  Recessive HGSNAT mutations cause non-syndromic RP but other mutations cause Sanfilippo syndrome, a mucopolysaccharidosis with central nervous system degeneration and retinal dystrophy.  The protein, lysosomal N-acetyltransferase, acetylates heparin and heparan sulfate. 
  • IFT172 (2p33.3).  Recessive mutations in IFT172 cause a range of disorders including non-syndromic RP, and Bardet-Biedl, Jeune or Mainzer-Saladino syndromes.  The protein is involved in intraflagellar transport and, as with the other IFT proteins, is a cause of variable ciliopathies.
  • LAMA1 (18p11.31-p11.23).  Mutations in LAMA1 cause recessive Poretti-Boltshauser syndrome with variable developmental abnormalities of the brain and retina.  The protein is a laminin which have critical roles in embryogenesis. 
  • NR2F1 (5q15).  Mutations in NR2F1 cause dominant optic atrophy with intellectual disability and developmental delay, also known as Bosch-Boonstra optic atrophy.  The protein is a nuclear receptor involved in optic nerve and cerebellar development. 
  • PNPLA6 (19p13.2).  Recessive mutations in PNPLA6 cause variable disorders, such as Boucher-Neuhauser, Oliver-McFarlane or Gordon Holmes syndromes, involving spinocerebellar ataxia, hypogonadism and chorioretinal dystrophy.  The protein is involved in phosphatidylcholine metabolism. 

 

Genes Linked to Retinal Disease

Other genes updated in this release of RetNet are linked primarily to retinal disease, rather than a constellation of symptoms. For example:

  • DHX38 (16q22.2).  A homozygous missense mutation in DHX38 causes recessive RP and macular coloboma in a consanguineous family.  The protein is a pre-RNA splicing helicase 
  • DRAM2 (1p13.3).  DRAM2 mutations in several families cause recessive, adult-onset retinal dystrophy with early macular involvement.  DRAM2 codes for a transmembrane protein which initiates autophagy with a role in photoreceptor disc recycling.  
  • KIZ (20p11.23).  Mutations in KIZ cause recessive rod cone dystrophy and may account for 1% of recessive RP patients in some populations.  The protein is centrosome-associated as are other ciliopathy proteins. 
  • RDH11 (14q24.1).  RDH11 mutations cause recessive RP with developmental abnormalities in an Italian-American family.  The protein plays a role in oxidizing 11-cis-retinol to 11-cis-retinal in the visual cycle. 
  • TTLL5 (14q24.3).  Mutations in TTLL5 cause recessive cone and cone-rod dystrophies.  The protein is a tubulin glutamylase found in photoreceptor cilia and sperm flagella. 

The diversity of phenotypes, pathways, and gene functions associated with retinal disease continues to astonish me. As usual, we’ve made remarkable progress but there’s more work to do.

6 Realities of Genomic Research

The rise of next-generation sequencing has worked wonders for the field of genetics and genomics. It’s also generated a considerable amount of hype about the power of genome sequencing, particularly the possibility of individualized medicine based on genetic information. The rapid advances in technology — most recently, the Illumina X Ten system — have made heretofore impossible large-scale whole-genome sequencing studies feasible. I’ve already written about some of the possible applications of inexpensive genome sequencing.

I’m as excited about this as anyone (with the possible exception of Illumina). Even so, we should keep in mind that not everything is unicorns and rainbows when it comes to genomic research. Here are some observations I’ve made about sequencing-empowered genomic research over the past few years.

1. There is never enough power

“Power” is a term that’s being discussed more and more as we plan large-scale sequencing studies of common disease. In essence, it answers the question, “What fraction of the associated variants can we detect with this study design, given the number of samples, inheritance pattern, penetrance, etc.?” Several years ago, when ambitious genome-wide association studies (GWAS) became feasible, there was a hope that much of the heritability of common disease could be attributed to common variants with minor allele frequencies of, say 5% or more.

If that were true, it was very good news, because:

  1. We could test such variants in large cohorts using high-density SNP arrays, which are inexpensive
  2. Our power to detect associations was high because many samples in each cohort would carry the variants
  3. Associated common variants would “explain” susceptibility in more individuals, narrowing the scope of follow-up.

GWAS efforts have revealed thousands of replicated genetic associations. However, it’s clear that a signification proportion of common disease risk comes from rare variants, which might be specific to an individual, family, or population. To achieve power to detect association for these rare variants, you need massive sample sizes (10,000 or more). You also need to use sequencing, since many of these will not be on SNP arrays (even exome-chip); some might have never been seen before.

Despite the falling costs of sequencing, cohorts of that size require a considerable investment.

2. There will be errors, both human and technical

If the power calculations call for sequencing 10,000 samples, you’d better pad that number in the production queue. Some samples will fail due to technical reasons, such as library failure or contamination. Others may fall victim to human or machine errors. We can address some failures (such as a sample swap) with computational approaches, but others will mean that a sample gets excluded.

The challenge of a 10,000 sample study is that, even with very low error/failure rates, the number of samples that must ultimately be excluded from the study might be a little shocking.

3. Signal to noise problems increase

One of the greatest advantages of whole genome sequencing is that it’s an unbiased survey of genetic variation. It lets us search for associations without any underlying assumptions like “associated variants must be in coding regions.” One potential disadvantage is that we’ll be looking at 3-4 million sequence variants in every genome.

Classic GWAS approaches rely on SNP arrays, which interrogate (on average) 700,000 to 1 million carefully selected, validated, assayable markers. The call rates on those platforms are usually >99%. Now we’re talking about genome-wide sequencing and variant detection. It means we’ll most likely be able to detect variants that contribute to disease risk, but we’ll also have to examine millions of variants that have no effect on it.

 

In contrast, a candidate gene study or even exome sequencing has the benefit of pre-selecting regions most likely to harbor functional variants. Not only are there fewer variants, but all things being equal they’re more likely to be relevant because they affect proteins.

4.  We can’t predict all variant consequences

Annotation tools such as VEP and ANNOVAR have come a long way towards helping us identify computationally which variants are most likely to be deleterious. However, their annotations are based on our knowledge of the genome and its functional elements (which remains incomplete) and our best guess as to which variations cause which effects.

Outside of the coding regions, we face an even greater challenge. That’s where most human genetic variation resides, including the substantial fraction expected to play regulatory roles in the genome. Thus, understanding the mechanism by which associated variants affect disease risk will be a long and difficult prospect. It will likely cost more time and resources than finding those variants in the first place.

5. There’s always a better informatics tool

The incredible power of next-gen sequencing required a new generation of analysis tools simply to handle the new nature and vast scale of data. We’ve done well to address many of the challenges, but developing these tools takes time. Keeping them relevant is a particular struggle, because sequencing technologies continue to rapidly evolve.

I remember a meeting a few years ago when we were working on Illumina short-read sequencing (36 bp reads, possibly even single-end) and wondering if we could find a way to build 100 bp contigs. I remember thinking, if we can get to 100 bp, we’ll be home free.

The current read length on Illumina X Ten is 150 bp. The MiSeq platform (while admittedly not for whole-genome sequencing) does 250 bp. And now that still seems far too short, especially to identify structural variation and interrogate the complex regions of the human genome (repeats, HLA, etc.).

6. You can spin a story about any gene

The huge investments and advances in the field of genetics over the past 50+ years have helped us build an incredible wealth of knowledge about genes and their relationships to human health. Granted, a large number of genes have no known function. Even so, with known disease associations, expression patterns, sequence similarities, pathway membership, and other sources of data, we have a lot to work with when it comes time to explain how a gene might be involved with a certain disease.

There’s a danger in that, because it gives us enough information to spin a story about any gene. To make a plausible explanation on how variation in that gene could be involved in the phenotype of interest. Given that fact, we have to admit that databases and the literature may contain false reports. For example, a recent examination by the ClinGen consortium found that hundreds of variants listed as pathogenic in the OMIM database are now being annotated as benign or of uncertain significance by clinical laboratories.

With great power comes great responsibility, and at this moment in genomics there is no greater power than large scale whole genome sequencing.

References
Rehm HL, Berg JS, Brooks LD, Bustamante CD, Evans JP, Landrum MJ, Ledbetter DH, Maglott DR, Martin CL, Nussbaum RL, Plon SE, Ramos EM, Sherry ST, Watson MS, & ClinGen (2015). ClinGen–the Clinical Genome Resource. The New England journal of medicine, 372 (23), 2235-42 PMID: 26014595

Clinical Sequencing Data Sharing Is Essential

The past few decades have seen rapid advances in our knowledge of genetic diseases, which affect an estimated 25 million Americans. These advances can be quantified in things like the growth of dbSNP (now contains about 90 million validated genetic variants) and the number of Mendelian disorders understood at the genetic level (over 5,000).

Some of the factors that have contributed to this progress include:

  • Big science. Ambitious, grant-supported, international efforts like the Human Genome Project, the HapMap Project, and the Cancer Genome Atlas yielded the public resources that form the foundation of modern human genetics research. Thank you, taxpayers.
  • Technology development. Revolutionary advances in genome interrogation technologies (high density SNP arrays, whole-genome sequencing, etc.) have made large-scale genetic studies feasible, both technically and financially.
  • Study participants. It’s important to remember that most (if not all) of human genetics studies could not have happened without the patients and families who volunteered their samples, often with the knowledge that they’d get nothing in return.

The Unsolved Problem of Inherited Disease

Few areas have benefited as much from these advances as the study of rare genetic diseases. Exome sequencing has enabled the rapid genetic diagnosis of many patients, and the discovery of hundreds of new Mendelian disease genes. Yet even well-powered Mendelian disease studies can fail for a variety of reasons. There’s also a considerable gray area between success and failure: the implication of an unknown gene, or one that has never been associated with disease.

One particular challenge is that Mendelian diseases are rare by definition, and the variants definitively shown to cause them are rarer still. As a result, many variants detected in clinical sequencing project end up with the label variant of unknown significance, or VUS. Even when given a classification, some variants are interpreted differently by different clinical laboratories.

As discussed in a report at the New England Journal of Medicine this week, another thing that has hampered our ability to discover and annotate clinically-relevant genetic variation is the “silo effect” — in which research groups (both commercial and academic) maintain private databases of clinical sequencing results. A great example of this is Myriad Genetics, a company that’s probably sitting on the largest database of BRCA1/2 mutations in the world.

The problem, of course, is that not all of the clinical datasets for a given disease or gene ends up in the same silo. Thus, researchers in group A might have a promising new disease gene that researchers in group B have also identified in a different family kindred. If those datasets were shared, rather than kept isolated, these groups could cross-validate with one another and the research community as a whole would benefit.

Data Sharing in ClinVar

The NIH’s Clinical Genome Resource program (ClinGen) hopes to address some of these issues by developing community resources to understand our understanding of genomic variation and improve its use in clinical care. The cornerstone of this effort is ClinVar, a database of variants annotated with clinical data.

ClinVar Contributors

Over 300 different submitters have contributed to ClinVar thus far. Those submitters comprise research groups, clinical laboratories, locus-specific databases, and aggregate databases (like OMIM). Here’s a plot of the variants submitted for some of the major (or interesting) contributors:

ClinVar Submitters

Selected ClinVar Submitters (adapted from Table 2, Rehm et al, NEJM 2015)

The largest submitter by far is OMIM, which has contributed over 25,000 variants to ClinVar. It’s encouraging to see two of the leading genetic testing providers (GeneDx and Ambry Genetics) making substantial contributions. Among academic centers, the University of Chicago and Emory University are the clear leaders.

As of May 2015, ClinVar contained 172,055 variant submissions across 22,864 genes. More than 118,000 unique variants have clinical annotations, though 21% of those are “variant of unknown significance.” Nevertheless, this rapidly-growing resource illustrates the power of sharing clinical variant annotations in a centralized manner.

Discordant Clinical Annotations

Notably, 12,895 variants have clinical annotations (pathogenic, unknown, or benign) from at least two different laboratories and 17% of the time, those annotations did not agree. For example, at least 220 of the “pathogenic” variants pulled in from OMIM (the largest contributing database) are classified by clinical laboratories as either benign or unknown significance.

It is clear that the guidelines for variant interpretation differ between laboratories, and need to be standardized. Even so, adopting standards and making the effort to share clinical variant findings and annotations (along with the relevant phenotype data) is critical to the success of rare disease research. ClinVar seems to be taking us in the right direction.

References

Rehm HL, Berg JS, Brooks LD, Bustamante CD, Evans JP, Landrum MJ, Ledbetter DH, Maglott DR, Martin CL, Nussbaum RL, Plon SE, Ramos EM, Sherry ST, Watson MS, & ClinGen (2015). ClinGen – The Clinical Genome Resource. The New England journal of medicine PMID: 26014595

Mary-Claire King on Inherited Breast/Ovarian Cancer

It is a rare but delightful opportunity to learn about something from an acknowledged world expert. Such was the case last month when I heard Mary-Claire King give the Stanley J. Korsmeyer Memorial lecture, hands-down one of the best talks I’ve ever heard. She was a wonderful public speaker: funny, charming, and straight-shooting.

Her topic, of course, was inherited breast and ovarian cancer. If you don’t know the story already, Dr. King wrote a wonderful perspective in Science about her role in the discovery of the BRCA1 gene and the race to clone it in the early 1990’s. Fascinatingly, she walked us through some of the pedigrees from early-onset breast cancer families described in the 1990 linkage study by her group.

The women in those families got breast cancer very young (20s or 3os) and usually died from it. Male obligate carriers were generally unaffected. Even for a highly penetrant mutation like BRCA1, there were exceptions, like the carrier who lived to 81 without ever getting cancer.

Of the seven early-onset breast cancer families, six harbored mutations in BRCA1 and one had a mutation in BRCA2. That paper was the culmination of 17 years of work and mapped the BRCA1 locus to chromosome 17.

Mapping BRCA1

Mapping the BRCA1 Region (Hall et al, Science 1990)

The existence of a gene for predisposition to breast cancer triggered enormous interest in big labs in government, universities, and the private sector. It was the birth of cancer genetics.

BRCA1, DNA Repair, and Chemotherapy

At the time of its discovery, we knew nothing about the function of the BRCA1 gene. Subsequent genetics studies would reveal that it worked as a tumor-suppressor in a two-hit model of inherited cancer: the disease develops only after carriers of one loss-of-function mutation (generally a nonsense change or frameshift indel) lost the other copy to somatic mutation in a vulnerable cell type.

Normally, BRCA1 forms a heterodimer with BARD1, which stabilizes the BRCA1/BARD1/Fanconi complex. That complex repairs double-stranded DNA breaks via the homologous repair pathway. Mutations in several DNA repair genes — TP53, PALB2, CHEK2, BARD1, BRIP1, ATM, RAD51C, and RAD51D — are also known to predispose to breast and ovarian cancer.

Although BRCA1/2 carriers suffer a significantly higher risk of breast and ovarian cancer, they also tend to respond better to chemotherapy. This is not terribly surprising, because the loss of homologous DNA repair capability diminishes the ability of cancer cells to recover from DNA damage. Yet there’s also a different mechanism for DNA repair, non-homologous end joining (NHEJ), that does not involve BRCA1/2.

The bad news is that this may enable tumor cells to resist chemotherapy. The good news is that we have a class of drugs, PARP inhibitors, that block the NHEJ pathway. The first clinical trial of PARP inhibitors in BRCA1/2 null cancer patients “crashed,” according to Dr. King, because the compound being used didn’t actually inhibit PARP. New clinical trials are under way. Hopefully, they’ll demonstrate that PARP inhibitors make BRCA1/2 null patients more responsive to chemotherapy, which will make genetic testing even more critical.

Genetics and Epidemiology of Familial Breast Cancer

The epidemiology of breast cancer is fairly well known. By rough approximation, 1 in 8 women will get breast cancer at some point in her lifetime, and 10-20% of patients will turn out to carry an inherited mutation in a known predisposition gene. Like many cancers, risk of breast/ovarian cancer is highly age-dependent. BRCA1/2 carriers not only have a higher lifetime risk of disease, but also have a considerably higher age-dependent risk; some might even be diagnosed with disease in their 20’s or 30’s.

There is also a widely accepted trend related to breast cancer incidence that’s been apparent for decades: more women are getting it, and seemingly at younger ages. Indeed, Dr. King showed some results from two large epidemiological studies of breast cancer showing that the incidence curves (incidence by age, classified by carrier/non-carrier status) are quite striking if you segregate the women into two groups: those born before 1940, and those born after 1940.

There are lots of theories for why this might be, including some I might call conspiracy theories (e.g. radiation exposure, or hormones in milk). Yet Dr. King offered an explanation that I find both simple and convincing. We know that certain factors increase a woman’s risk of breast cancer. For example, the age of first menstruation (earlier = higher risk) and when she has her first child (later = higher risk).

In 1950, a woman typically began menstruating at 15 and bore her first child at 21. Today, menstruation often begins sooner (say age 11, due to some complicated factors like better nutrition) and the first child often comes later (age 30, because women often pursue higher education and/or careers).

Nutrition and education/independence, of course, are good things. However, the side effect is that the window of time between menstruation and first child went from ~6 years in 1950 to ~19 years today. And during that window, a woman’s breast tissues are bathed in estrogen. It makes for some super-healthy cells that don’t die easily, even if they suffer mutations. That longer window simply increases the odds that a second “hit” will occur in the gene for which a woman already carries a loss-of-function mutation.

In support of this idea, if researchers adjust for the length of that time window, the year-of-birth effect totally goes away. I think that’s some fascinating stuff.

Genetic Structure of BRCA1/2

Interestingly, although the two most famous breast cancer susceptibility genes (BRCA1 and BRCA2) share no sequence similarity, they have a similar (and distinctive) genomic structure: many small exons and a large central exon. The central exon encodes a big portion of the protein and is surprisingly robust to amino acid substitutions, which is why most missense mutations in BRCA1 and BRCA2 are non-pathogenic.

brca1 and brca2 genes

BRCA1 and BRCA2 (Fackenthal & Olopade, Nat. Rev. Cancer, 2007)

Yet because these genes are so large, mutation databases have catalogued thousands of individual rare mutations that look deleterious. This is why a genotyping-based genetic test, like the one that was a cash cow for Myriad Genetics until recently, was never going to work in the long term. Now, with targeted sequencing, we have the capability to detect all types of mutation (substitutions, indels, even large SVs) affecting BRCA1/2 and other susceptibility genes.

From Gene Discovery to Population Screening

As the cost of sequencing-based genetic testing continues to drop, we’re in the position to screen the entire female population for cancer susceptibility genes.The World Health Organization offered guidelines for when genetic testing should be performed. In essence, four criteria must be met.

  1. The disease must be an important health problem
  2. Risk of disease for patients testing “positive” should be high.
  3. The mutations responsible for conferring risk must be identifiable
  4. Effective interventions must exist

Dr. King makes a pretty compelling argument that familial breast/ovarian cancer meets these requirements. #1 and #2 are well-established. #3 is true if you know your stuff: for a while, companies like Myriad leaned heavily on the “Variant of Unknown Significance” classification when they encountered a new variant, to the point that 88% of results were reported as such. Yet an expert team, like the one at UW, can classify all but <2% of variants as either pathogenic or non-pathogenic. The PARP inhibitor clinical trials should give us the answer for #4.

There are, of course, other considerations, like the cost of testing, the burden of genetic counseling, the age at which testing should be performed (Dr. King suggests 30), etc. Yet these are hurdles that can be overcome. Hurdles that must be overcome, if we’re to use our growing knowledge of disease genetics to improve the state of human health.

 

References
Hall JM, Lee MK, Newman B, Morrow JE, Anderson LA, Huey B, & King MC (1990). Linkage of early-onset familial breast cancer to chromosome 17q21. Science (New York, N.Y.), 250 (4988), 1684-9 PMID: 2270482
King MC (2014). “The race” to clone BRCA1. Science (New York, N.Y.), 343 (6178), 1462-5 PMID: 24675952