New Insights into Long Noncoding RNAs

March 3, 2017 by Dan Koboldt

As long-time readers of MassGenomics probably know, I’m fascinated by studies that interrogate functional elements of the human genome. The ENCODE Project is perhaps the most visible consortium effort in the United States, employing a variety of high-throughput genomic technologies such as RNA sequencing (expression), DNase I sequencing (open chromatin), and CHiP-Seq (DNA-protein interactions).

However, other groups have made major contributions recently, notably RIKEN-led FANTOM consortium. In the past few years, FANTOM researchers have applied Cap Analysis Gene Expression (CAGE), a genomic technology developed at RIKEN, to numerous mammalian cell and tissue types. CAGE isolates the 5′ end of long RNA molecules, providing high-resolution mapping of the transcription start sites and core promoters of genes. The current project iteration, FANTOM5, has mapped enhancers and transcription start sites in hundreds of primary human cell types.

This week in Nature, they’ve published another genomic annotation: an atlas of long non-coding RNAs with accurate 5′ ends. By integrating 1,829 CAGE profiles from the FANTOM project with transcript models from a variety of sources (GENCODE, Human BodyMap, ENODE, and miTranscriptome), the authors constructed a “CAGE-associated transcriptome” assembly cleverly branded FANTOM CAT.

Atlas of Long Noncoding RNA Genes

The 27,919 long noncoding RNA genes in FANTOM CAT represent the most comprehensive catalogue of human lncRNAs so far. The authors next sought to classify lncRNA genes according to genomic and epigenomic context. I found the way this was presented in the paper to be very confusing, so I’ve broken it down differently here.

First, the authors assigned lncRNAs to an epigenomic class based on the overlap between their strongest transcription start site (TSS) with DNase I hypersensitive regulatory regions defined by the RoadMap Epigenomics Consortium:

Adapted from Hon et al, Nature 2017.

Dyadic here means a DHS that looks like both a promoter and an enhancer. So that’s the epigenomic context. The authors also assigned lncRNAs to a genomic context based on their location and transription relative to messenger RNAs:

Adapted from Hon et al, Nature 2017

Divergent here means that a lncRNA seems to use the same TSS as the mRNA, but is transcribed in the other direction. Using these epigenomic and genomic classifications, the authors defined some lncRNA categories for deeper analysis:

Divergent p-lncRNAs (n=5,827) have promoter-like epigenomic signatures and use the same TSS as a known messenger RNA, but are transcribed in the other direction.
Intergenic p-lncRNAs (n=1,725) also have promoter-like epigenomic signatures but do NOT use the same TSS as a known messenger RNA.
e-lncRNAs (n=9,339) have enhancer-like epigenomic signatures (genomic context is ignored). These are the orange slice from the top pie chart.

There were also 10,543 lncRNAs that had have dyadic or undefined epigenomic signatures, or that had a promoter epigenomic signature but were antisense or intronic. These “other” lncRNAs are essentially set aside; in most of their analyses, the authors compare/contrast the three lncRNA categories described above with traditional mRNAs.

Credit: Hon et al, Nature 2017

Conservation of lncRNAs

One way to assess the functional relevance of a genomic region (or set of regions) is to assess the extent of evolutionary conservation across species. In this study, the authors examined whether the transcription initiation region (TIR) and exonic sequences were conserved for each class of lncRNA gene, using mRNAs as a comparator. This type of analysis was motivated by something that I did not know: at some lncRNA loci, the mere act of transcription is functionally relevant, but the actual sequence of the transcript is not.

The image to the right (adapted from Figure 1c) shows the proportion of transcription initiation regions (Y-axis) and exonic regions (X-axis) that overlap GERP-predicted conserved sequences for mRNAs (red), divergent p-lncRNAs (purple), intergenic p-lncRNAs (blue), and enhancer-lncRNA (green). Classic mRNAs, as expected, show conservation at both the TIR and the exonic portions.

In general, exonic regions from all three lncRNA categories were less conserved by comparison. Divergent p-lncRNAs showed high conservation for the TIR (75%) and a reasonable amount for the exon (54%), but remember that their “exonic” portion, by definition, is immediately upstream of an mRNA core promoter. In intergenic regions, 42% of promoter-like lncRNAs and 36% of enhancer-like lncRNAs did not overlap conserved elements.

Interestingly, those non-conserved intergenic TIRs were significantly enriched for retrotransposons, suggesting that retrotransposon activity contributes to the “birth” of new transcription activity in noncoding regions.

lncRNAs Expression Patterns

Next, the authors examined the expression patterns of lncRNAs across a variety of primary cell types. The expression levels of each category were relatively consistent across cell types. Around 45% of divergent p-lncRNAs (purple) and 31% of intergenic p-lncRNAs (blue) were expressed in each primary cell type, but the proportion of enhancer-like lncRNAs expressed (12%, green) was much lower:

Credit: Hon et al, Nature 2017

Enhancer-lncRNAs were also much more likely to be cell-type-specific, which is consistent with previous studies of enhancer activity. On average, 5,666 lncRNA genes were expressed in each cell type, though the range was fairly wide (3,000-10,000). I notice that many of the higher-activity cell types in the bottom panel are immune cells (basophils, NK cells, etc), which makes sense.

Association with Human Traits

The authors cross-referenced lncRNAs from FANTOM CAT with established GWAS loci, finding that 40.7% of lncRNA genes were associated with at least one trait. Unsupervised clustering of cell-type-specific lncRNAs and trait associations showed that related cell types and traits tend to clump together in biologically plausible ways: for example, lncRNAs enriched in nervous system tissues tended to be associated with neuropathy and behavior traits, and the odds ratios of lncRNA genes were comparable to those of mRNA genes.

The authors identified groups of mRNAs and lncRNAs that are active in the same cell types and associated with the same traits, i.e. “significant cell type-trait pairs.” Some 5,490 FANTOM CAT genes were involved in such pairs:

Credit: Hon et al 2017

Most of those were protein-coding genes (mRNAs), but a significant proportion, around 36%, were lncRNA genes. Before we react too much to the distribution among gene categories, let’s remember that the relative sizes differ. If I compare the number of genes associated with cell-trait pairs to the total number in FANTOM-CAT, here’s what it looks like:

Based on Hon et al, Nature 2017

As we might expect, protein coding genes had the highest proportion of associations (21.7%), and they were about twice as likely as divergent/intergenic p-lncRNAs and thrice as likely as enhancer-like lncRNAs to be involved in a cell type-trait association. Even so, given the massive disparity in research emphasis, I find it compelling that considerable numbers of lncRNAs are implicated in human traits.

It suggests that all of those noncoding genetic associations are not random, but indicative of the intricate regulatory genetic networks underlying complex human traits. Furthermore, it highlights the importance of divergent p-lncRNAs, which utilize the same TSS as protein-coding mRNAs but are divergently transcribed. Given their close proximity and the tendency of investigators to assign a GWAS hit to the nearest protein-coding gene, I wonder how often a genetic signal from a p-lncRNA is erroneously assigned to the mRNA instead.

Whole Genome Sequence Analysis of Complex Traits

January 27, 2017 by Dan Koboldt

As you probably know, I’m a fan of exome sequencing, particularly for studies of rare inherited disorders. While not a perfect (or comprehensive) assay, exome sequencing offers an efficient screen of the regions most likely to harbor disease-causing mutations. Ironically, another reason people like exome sequencing is because of the limited scope: it essentially doesn’t interrogate the regions (noncoding) and variant types (SVs) that are more difficult to interpret.

Well, the party’s over.

Several large scale whole-genome sequencing studies of human disease are hurtling forward in the U.S., the U.K., and other countries. We’ve put it off as long as we can, but now we’re faced with the daunting task of identifying and interpreting biologically-relevant variants outside of protein-coding exons.

There’s good reason to do so, by the way, especially studies of common disease in which regulatory activity is likely to play an important role. Many, if not most of the genomic loci associated with human traits lie in noncoding regions. According to datasets generated by projects like ENCODE and FANTOM5, noncoding regions also exhibit an astonishing amount of biochemical activity suggestive of diverse functions.

Sometimes with a daunting analysis task, it’s hard to know where to start. Fortunately, there’s a nice paper in the upcoming issue of AJHG that provides some practical guidance. Alanna C. Morrison et al present a series of integrated steps for whole-genome analysis and apply them to study 10 heart- and blood-related traits in 1,860 African Americans.

The authors perform aggregation tests (sometimes called burden tests) to test sets of rare variants for association. Such tests require that one define a unit by which to group variants. With exome sequencing, this was commonly done on a per-gene or per-exon basis. In this study, with WGS available, the authors aggregated variants:

In a sliding 4-kb window across the entire genome
In the first introns of protein-coding genes, which are known to harbor regulatory elements
In pre-defined regulatory domain motifs (promoters, enhancers, and UTRs) near genes.

This is, in modern-day terms, a modestly powered study of complex quantitative traits. Even so, the authors found several significant associations, some near known loci for these traits, and others in potentially novel regions. The results of their complementary methods are illustrated in Figure 3:

Morrison et al, Am. J Hum Genet, 2017

You’re looking at the association of Lp(a) levels in the well-known LPA gene locus. The sliding window approaches nicely captured association over the LPA gene region, but also in other nearby regions, some of which were also supported by the first-intron or regulatory-motif analyses. It’s a complicated picture, to be sure, but it suggests some specific areas in which noncoding variants are exerting a cis-regulatory effect on the LPA gene. Very cool stuff.

What’s especially nice about this paper is that it provides relatively straightforward methodology for tackling the daunting analysis task I talked about above:

Obtain WGS data for a well-phenotyped cohort
Define some common-sense strategies for aggregating (grouping) rare variants
Apply aggregation tests on a genome-wide basis
Replicate significant findings in an independent cohort (in this case, about 2,000 European-Americans)

This study demonstrated both the feasibility and the justification for interrogating noncoding regions for association with medically important traits. Imagine how much we’ll be able to discover as we get our hands on massive WGS cohorts, and extend these principles to other regions and regulatory motifs of the genome.

Exome or Whole-genome Sequencing for Mendelian Disorders

January 12, 2017 by Dan Koboldt

Exome sequencing has undeniably transformed the study of rare inherited disorders, enabling the rapid identification of hundreds of new diseases genes in the past few years and spurring the adoption of clinical exome sequencing as a frontline diagnostic tool. That’s great news. Hooray for the exome!

Is it a fantastic discovery tool? Absolutely. But it’s not a magic bullet.

The less-publicized outcome of widespread exome sequencing is that “hit rate” — the proportion of sequenced cases for which a likely genetic cause is found — has largely remained the same. For most studies, it’s in the neighborhood of 40-60%. Higher success rates have been reported, but these usually involve cherry-picking cases or the inclusion of patients who who’d not undergone any a priori molecular testing.

The bottom line is that a significant fraction of rare disease cases fail to achieve a genetic diagnosis by exome sequencing. When this happens, it’s tempting to consider whole-genome sequencing as the logical next step. Yet it’s hard to know how often that will help. A new study in the American Journal of Genetics has begun to answer that question.

Keren J. Carss et al performed exome sequencing, genome sequencing, or both on 722 patients with inherited retinal disease. IRDs offer a number of advantages for studies like this due to exceptional phenotypic, genetic, and allelic heterogeneity. There are more than 250 known genes associated with IRDs, and they can be inherited in every possible mode. Dominant, recessive, X-linked, and even mitochondrial inheritance have been documented.

Cohort Phenotype Composition

The 722 cases described here were recruited under the NIHR BioResource Rare Diseases research study. That’s in the United Kingdom, by the way. So are most of the study’s authors, a fact made plain by the UK epidemiology figures and the reference to Genome England as an example of a large-scale genome sequencing initiative in the Introduction. The phenotype composition generally reflects the prevalence of inherited retinal diseases:

311 had retinal pigmentosa (RP), characterized by night blindness and progressive rod photoreceptor loss.
101 had “retinal dystrophy”, a broader term that could mean RP or other retinal degenerations
53 had cone-rod dystrophy, which affects cone photoreceptors, causing loss of color and perceptive vision.
45 had Stargardt disease, the most common form of inherited juvenile macular degeneration.
37 had macular dystrophy, a broader term for diseases affecting the macula, the central portion of the retina.
37 had Usher syndrome, a condition characterized by vision loss (RP) and hearing loss.

Genome and Exome Sequencing

Too often, I see a paper with “Whole-genome sequencing” in the title in which a handful of samples actually obtained WGS, and the rest got targeted sequencing. Although I understand the economics of such a design, it feels like a bait-and-switch. This study did not disappoint: 650 of 722 cases underwent WGS, with the remaining 72 getting exome only. The average depth for WGS was 37x, which is standard, but the average exome depth (43x) is a little low. That’ll be relevant in a minute.

The bioinformatic analysis and interpretation strategies look solid. The authors searched for high-quality rare coding variants in a curated set of 224 retinal disease genes. Candidate causal variants were reviewed in IGV, and assessed in the context of databases (like HGMD), segregation, and how well the clinical phenotype matched the phenotype associated with the gene.

Pathogenic Variants Detected

They identified a likely causal variant in 404 individuals (56%). That’s slightly on the high end of realistic success rates, but as the authors admit, 152 individuals in the cohort had had no prior genetic testing (and 63% of them were solved). The hit rate also varied by phenotype. RP, the largest phenotype group, saw a hit rate of 54% which is right where we expect it to be.

The success rate varied widely for other phenotypes, ranging from 29% in cone-rod dystrophy to 84% of Usher syndrome (the latter is not terribly surprising, since variant interpretation is arguably the easiest for rare, recessive conditions).

Solve Rates Varied by Ancestry

A particularly intriguing observation was that diagnostic success varied by individual ancestry. The success rate was considerably lower for individuals of African ancestry (30%) compared to individuals of European (57%) or South Asian ancestry (53%). I admired how the authors remarked:

Higher genetic diversity in African populations, combined with underrepresentation of non-European populations in control datasets, result in an excess of rare and apparently rare variation in these individuals, rendering variant interpretation more challenging.

Another intriguing ancestry tidbit was that 66% of pathogenic variants in South Asian cases were homozygous, compared to 18% of pathogenic variants in European Americans. The authors argue that this is likely due to greater consanguinity in South Asian populations, which may also explain why their hit rate was comparable to that of European-ancestry individuals despite underrepresentation in control databases.

Exome and Genome Performance

Some 117 individuals underwent exome sequencing first, and in 59 of those (50%), a likely causal variant was uncovered in this first pass. Next, the authors selected 45 of the 58 exome-negative individuals for whole genome sequencing.

Reason Variant Was Missed

Of these, 14 cases, or 31%, achieved a genetic diagnosis after WGS. But take note of the reasons those variants were missed. Three of them had no probe (and thus no coverage) in the Nimblegen v3 exome kit and 3 were large indel/deletions missed by exome sequencing. Another 3 were called in the exome but flagged as LQ (low quality), likely due to poor coverage or representation of both alleles. In these 9/45 cases (20%), WGS did succeed where exome failed.

Yet the remaining 5 variants were called in the exome, but not considered causal until WGS eliminated all other possibilities. Should these go in the win column for WGS? I’m not sure. If they aren’t, then the true discovery rate for WGS in exome-negative cases in this study is 20%.

Whole-genome Advantages

Although the numbers are modest, whole-genome sequencing undoubtedly enabled the researchers to uncover more pathogenic variants in these cases. A wonderful example is offered in Figure 1:

Carss et al, AJHG 2017

In this case, a patient with recessive RP had one pathogenic variant (a missense change in EYS) detected by exome sequencing. However, it took whole-genome sequencing to identify the second disease allele, a heterozygous ~55 kb deletion spanning at least three other exons in the gene.

Pathogenic Noncoding Variants

The authors also leveraged our current knowledge of gene-phenotype relationships, and the comprehensive nature of WGS data, to identify three pathogenic noncoding variants. All of these where deep intronic variants that likely affect splicing, and were found in patients whose phenotypes corresponded to defects of a specific gene:

In 16 individuals with Stargardt disease, caused by recessive-acting variants in the ABCA4 gene, the authors identified a rare intronic variant. It was homozygous in two cases, compound-heterozygous with a coding variant in nine, and the only ABCA4 variant in the remaining 5 (which are classified as “partially solved” cases).
In a patient with Usher syndrome, the authors identified a known pathogenic noncoding variant (intronic) that causes the retention of a pseudo-exon.
In two unrelated males with choroideremia, an X-linked disorder caused by mutations in the CHM gene, the authors uncovered a novel deep intronic variant creates a cryptic splice site that causes retention of a 224-bp cryptic exon.

For all three regulatory variants, the investigators knew where to look because the patient’s phenotype strongly pointed to a known gene. This is a clever strategy for beginning to tease out regulatory variation in Mendelian disorders, and may help open the door for even more discoveries.

In Summary

This was a well-written paper that showcased some of the advantages to whole-genome sequencing over exome sequencing for uncovering the genetic basis of rare diseases. I hope (and expect) we’ll see more studies like it as WGS becomes ever more practical to apply as a frontline diagnostic tool.

This Year in Genomics and Next-Gen Sequencing

December 30, 2016 by Dan Koboldt

It’s been an interesting year for the field of genomics as next-gen sequencing technologies continue to emerge, evolve, and in some cases, fade to obscurity. This year also brought some big career and life events for yours truly. Let’s look at some of the highlights.

Common Complex Disease Genomics

In January, I had the thrill of announcing that the McDonnell Genome Institute had won a 4-year, $60 million grant for large-scale sequencing in common complex disease. Helping write that grant (which we actually did in 2015) consumed several months of my life. The previous few funding rounds for large-scale sequencing had essentially been competitive renewals, which were no less onerous to write, but gave existing large-scale centers a home field advantage.

The CCDG program, in contrast, was a new initiative. It also marked a significant shift in the direction of NHGRI’s flagship genomics program to focus only on common complex diseases. Cancer genomics and model organism sequencing were not to be included (according to the RFA), and these happened to be two of our center’s biggest strengths. I honestly feel that we came in with a competitive disadvantage, so winning that $60 million grant was a big deal.

The Opportunity Cost of Factory-Scale Sequencing

Of course, it wasn’t all champagne and roses for us, or for the CCDG program as a whole. Although $15 million per year seems like a lot of money, it was a significant reduction in our operating budget from the large-scale program. That made things tight, which is common in publicly-funded science (and never fun). Also, 80% of the budget goes toward data production, which leaves very little for everything else, such as analysis.

This budget distribution added to a disconcerting trend in public genomics funding in which agencies use large-scale centers to produce data, but balk at providing the funds to analyze those data. As I’ve written before, the vital component known as bioinformatics analysis is not free. In fact, as the cost of sequencing continues to plummet, the cost of analysis and interpretation is rising.

I hope that in 2017 and beyond, more funding agencies acknowledge the importance of funding data analysis, not just data generation.

Complex Disease Logistics

The four grantees of the CCDG program had proposed a diversity of projects across a rather wide spectrum of complex diseases. Due to the large sample numbers required for statistical power and the limited budget, we couldn’t sequence all of them. Instead, representatives from the CCDG centers and program officers came together to select a few key projects.

Most of the studies that I helped design and propose ended up on the cutting room floor for various reasons (some sensible, others not). For example, age-related macular degeneration — which affects 10 million Americans and helped establish the value of rare variant association studies for common disease — was cut because it’s not fatal.

Then we were off and running for CCDG and another large-scale sequencing effort funded under the Gabriella Miller Kids First (GMKF) program. At least, we wanted to be. The voracious HiSeq X Ten sequencers waited with open maws for DNA samples (and dollars) that were slow to arrive. It turns out that the logistics of getting 20,000 samples from various bio-repositories around the world to a single center were as complex as the diseases we wanted to study.

We had plenty of other samples in-house and ready to go — samples we’d proposed in our CCDG application — but we weren’t “authorized” to sequence them, which was very frustrating. Worse, much of our cancer genomics funding disappeared: NCI took over the Cancer Genome Atlas project and quietly awarded 99% of the sequencing contracts to one institution.

The unfortunate result was wasted time, unused sequencing capacity, and science left on the table. The factory wasn’t running, morale was low, and an even bigger bombshell was about to drop.

The End of an Era

As CCDG finally ramped up in late spring, I learned that my center directors — Rick Wilson and Elaine Mardis — were leaving WashU to establish a new genomics institute at Nationwide Children’s Hospital and The Ohio State University in Columbus, Ohio. It was a wonderful opportunity for them (and a huge win for NCH/OSU) but a time of stress and uncertainty for the McDonnell Genome Institute.

I got a double dose of that, because our fearless leaders asked me to move with them.

It was a big ask. My wife and I both grew up in St. Louis (which counts for something, when you live there) and had put down some serious roots. I’d been at WashU for 13 years. Honestly, I didn’t think I was likely to move. Then I came for a visit to Columbus, and that changed.

The Start of a New Era

I really liked the people I met at NCH, and could see a place for myself in the new institute. There are some outstanding opportunities both for research and clinical genomics. NCH and Columbus are in growth mode, which was a refreshing thing to see. The Midwest lifestyle was very similar to what we had in St. Louis. The work sounded pretty exciting, too: use genomics to help save children’s lives. It was hard to deny the importance of that.

Also, if I’m being honest, I felt like I was in a bit of a rut for the reasons outlined above. The CCDG program is undoubtedly important, and I want to see it succeed, but sequencing 10,000 genomes to find the odds ratio 1.1 variant just didn’t excite me as it did many others. This was disheartening, because I felt somewhat on the fringe of a grant that I’d fought very hard for.

There were other personal and community things going on in St. Louis, too, that contributed to my feeling of malcontent. Starting something entirely new, in a new place, had a certain appeal.

In the end, I received two very generous offers: one to stay, and one to go. It was not an easy decision. But I brought my family to Columbus to look around before our summer vacation, and they loved it. That was the tipping point.

Fast forward a couple of crazy months, and we’d left everything (and almost everyone) we knew to start a new adventure in Ohio, with me as a principal investigator (NCH) and assistant professor of pediatrics (OSU). It’s certainly a new milestone for my career, and I’m pretty excited about it.

Old Friends at ASHG

In October, I attended the American Society of Human Genetics meeting in lovely Vancouver, Canada. I had the privilege of moderating a wonderful panel on gene discovery, genetic counseling, and clinical care for inherited retinal diseases. And I was on television! Well, ASHG television, but that counts for something.

I saw a number of old friends and colleagues there, but not nearly everyone I wanted to. The trip went by so quickly. There were (as usual) too many great presentations, but too little time. I will say that the location was easily one of my favorites for this meeting, despite the fact that the American Society of Human Genetics meeting happened in Canada. I’d go there again, which is more than I can say for Baltimore.

Turbulent NGS Technologies

Some fellow NGS bloggers have covered some important developments in the field of next-gen sequencing. Over at CoreGenomics, James has a nice piece on whether or not the world has too many HiSeq X Tens, and points out the fact that sales of the instrument have cooled, and most of the installations are not operating at full capacity.

This is not terribly surprising, given the incredible throughput of that instrument, the minimum buy-in, and the company’s legal restrictions on its applications. There are other factors at play, too: as I’ve mentioned before, in a world with cheap sequencing, samples are the most precious commodity.

Over at Omics! Omics!, there’s a nice piece by Keith Robison on Roche’s breakup with Pac Bio, complete with speculation as to the underlying reasons and a look at the competitive field. Keith is probably my favorite writer in the ever-dwindling genomics blog space. If you aren’t reading his blog, you should.

Wishing for a Productive New Year

Thank you for sticking around MassGenomics through this rocky year. I hope to get back to more routine blog posts in 2017. Until then, happy new year!

« Previous Page