New Challenges of Next-Gen Sequencing

I first started MassGenomics in the early days of next-gen sequencing, when Illumina was called “Solexa” and came in fragment-end, 35-bp reads. Even so, the unprecedented throughput of NGS and the nature of the sequencing technology brought a whole host of difficulties to overcome, notably:

  • Bioinformatics algorithms developed for capillary-based sequencing didn’t scale.
  • Sequencing reads were shorter and more error-prone.
  • The instruments were expensive, limiting access to the technology
  • Most of the genetics/genomics/clinical community had no experience with NGS

All of these are essentially solved problems: new bioinformatics tools and algorithms were developed, the reads became longer and more accurate, benchtop sequencers and sequencing-service-providers hit the market, and NGS was widely adopted by the research community. Mission accomplished!

Yet these victories were short-lived, because we find ourselves facing new challenges. Harder challenges. Here are a few of them.

1. Data storage

You’ve probably seen the plot of Moore’s Law compared to sequencing throughput. In short, the cost of DNA sequencing has plummeted much faster than the cost of disk storage and CPU. A run on the Illumina HiSeq2000 provides enough capacity for about 48 human exomes. Even if you don’t keep the images, each exome requires about 10 gigabytes of disk space to store the bases, qualities, and alignments in compressed (BAM) format. At three runs a month, each instrument is generating 1.4 terabytes of data files. It adds up quickly.

Analysis of sequencing data — variant calling, annotation, expression analysis, genetic analysis — also requires disk space. Most non-BGI research budgets are finite, so investigators must choose between (1) deleting data, (2) spending money, or (3) holding up data production/analysis. None of those sound very appealing, do they?

2. Achieving Statistical Significance

NGS is no longer an exploratory tool, and descriptive studies reporting a dozen or a couple hundred genomes/exomes are harder and harder to publish. This is particularly true for common diseases, in which large numbers of samples are typically required to achieve statistical significance. The number 10,000 has been discussed as an appropriate number. Even if that many samples could be found, the cost of sequencing so many is substantial. If you had an Illumina X Ten system and could do whole genomes for $1,000 each (that only covers reagents, by the way), it’s still ten million dollars. That’s probably over budget for most groups, so they’ll have to take another tack:

  • Sequencing fewer samples, which will make the work harder to fund/publish
  • Combining some sequencing with follow-up genotyping, which limits the discovery power
  • Collaborating with other labs/consortia, whose sample populations, phenotypes, or study designs may vary

How many of your project planning meetings have ended with someone saying, “Well, maybe we’ll get lucky.” ?

3. Finding Samples

Getting access to large sample cohorts is another challenge. As I’ve previously written about, given the widespread availability of exome and genome sequencing, samples are the new commodity. High-quality DNA samples from informative sources — tumor tissue, diabetes patients, families with rare disorders, even healthy members of minority populations — are increasingly valuable. Why should an investigator collaborate with you, when they might send the samples off for sequencing on their own?

Sequencing samples with public funds (i.e. NIH grants) adds another layer of difficulty: all sequencing data must be submitted to public repositories. This means that the volunteer must have given informed consent not just for study but for data sharing. Local IRBs even need to sign off. The net result is that many of the samples that come to us for sequencing don’t meet the criteria, and must be returned.

4. Privacy

Even if you have an outstanding, comprehensive informed consent document, it might be difficult to get volunteers to sign it. There’s a growing public concern about the privacy of genetic information. As Yaniv Erich demonstrated by hacking the identities of CEPH sample contributors, genetic profiles obtained from SNP arrays, exome, or genome sequencing can be used to identify individual people. They also contain some very private details — like ancestry and disease risk alleles — that might be exploited, made public, or used for discrimination.

How long is it before genetic profiling replaces Google-stalking as a screening tool for job candidates or romantic interests? Thanks for coming in, Mr. Johnson. All we need now is your Facebook password and a cheek swab.

5. Functional Validation of Genomic Findings

Numerous research groups have demonstrated the immense discovery power of NGS. The mere fact that dbSNP — the NCBI database of human sequencing variation — has swelled to more than 50 million distinct variants tells us something about what pervasive genome sequencing capabilities might uncover. And yet, the variants implicated in sequencing-based studies of human disease are increasingly difficult to “sell” to peer reviewers on genetic information alone. Our inability to predict the phenotypic impact of genetic variants lurks beneath the veneer of genetic discoveries like a shark following a deep-sea trawler.

Referees of most high-impact journals want to see some form of functional validation of genomic discoveries. That’s a daunting challenge for many of us accustomed to the rapid turnaround, high-throughput nature of NGS. Most functional validation experiments are slow and laborious by comparison.

6. Translation of NGS to the Clinic

We all know that NGS is destined for the clinic. Targeted sequencing panels are already in routine use at many cancer centers; in time, this will likely become exome/genome sequencing. Possibly transcriptome (RNA-Seq) and methylome (Methyl-Seq) as well. Undiagnosed inherited diseases, and rare genetic disorders whose genetic cause is unknown, are two other common-sense applications. There are many hurdles to overcome in order to apply a new technology to patient care. CLIA/CAP certification is a complex, expensive, and time-consuming process.

The reporting is more difficult, too. Unlike the research setting in which most NGS results have arisen, a clinical setting requires very high confidence in order to report anything back to the patient or treating physician. This is a good thing, since patient care decisions might be made based on genomic findings. Yet it means that we have a considerable amount of work ahead to ensure that genomic discoveries are followed up, replicated, and otherwise vetted to the point where they can be of clinical use.

Share This on Twitter

It’s time for us to spread the word about the new challenges facing NGS.

Click to Tweet Great post by @MassGenomics on 6 new challenges of next-gen sequencing: Challenge #1: Data storage. #genomics
Click to Tweet 6 new challenges of next-gen sequencing by @MassGenomics: Challenge #2: Statistical significance. #genomics
Click to Tweet A post by @MassGenomics on 6 new challenges of next-gen sequencing: Challenge #5: Functional validation. #genomics

Genetic Studies in Isolated Populations: Greenland

Many large-scale genetic studies are conducted in broad/homogenous population cohorts, like the panels chosen for the 1,000 Genomes Project. Yet there are advantages to studying smaller and more isolated founder populations, as exhibited in a study in this month’s Nature. Moltke et al report a genetic association study in the Greenlandic population, a small and historically isolated population (57,000 people). Of particular interest: type II diabetes (T2D), the prevalence of which has skyrocketed over the past 25 years.

Granted, T2D is a complex disease with many contributing factors. From a genetics perspective, the small and historically isolated population offers some advantages. The limited genetic diversity means that linkage disequilibrium is extended, which boosts the power of genetic association studies. Further, deleterious variants can reach higher frequencies due to the founder effect (genetic drift in a small population across many generations). This phenomenon has already been documented in the Finnish population, whose unique population history has contributed to the increased prevalence of at least 40 genetic disorders.

Part I: The Genome-Wide Association Study

GWAS manhattan plot

Moltke et al, Nature 2014

In the current study, the authors began with a GWAS in ~2,700 samples collected from 12 locations in Greenland. They did the genotyping on the Illumina Metabochip, a customized SNP array with 200K variants marking regions implicated in metabolic, cardiovascular, and anthropometric traits. Notice how the authors are stacking odds in their favor? A founder population, with an increased prevalence of T2D, and a custom genotyping array to study the relevant traits.

The authors employed a linear-mixed model to search for association of variant alleles with T2D status. You can see the hit on chromosome 13. That’s rs7330796, a variant in intron 11 of the TBC1D4 gene that was selected for inclusion on the Metabochip because it was in the top 5,000 candidate SNPs for association with waist-to-hip ratio.

Part II: Exome Sequencing and Fine Mapping

GWAS find mapping in T2d

Moltke et al, Nature 2014

Even though this was a specialized SNP chip, it was unlikely that rs7330796 was the causal variant. It’s intronic, and not in a large LD block. So the authors performed exome sequencing on nine trios, and identified four coding variants in high LD with rs7330796. One of these was a nonsense variant, introducing an early stop codon (p.Arg684Ter) in the longer isoform of TBC1D4.

When genotyped in the main cohort, p.R684X was strongly associated with 2-h plasma glucose levels. When conditioning on rs733096 alleles, it was also associated with 2-h serum insulin levels.

Functional Assessments and Follow-up

Functional impact of T2D variant

Moltke et al, Nature 2014

The fascinating thing about this nonsense variant was that it had an almost Mendelian effect on plasma glucose levels. Heterozygous carriers had a slight elevation, but subjects homozygous for the alternate allele had significantly elevated plasma glucose. Strikingly, the frequency of T2D in homozygous carriers was also about 5x higher.

Thus, the p.R684X variant confers increased risk of a subset of diabetes that features deterioration of glucose homoeostasis. And with an odds ratio of 10.3, the variant exhibits an effect size much higher than any other variants associated with T2D to date.

Interestingly, though p.R684X has a MAF of ~17% in Greenlanders, the variant was essentially absent in other sequenced European populations (the 1,000 Genomes panels, 6500 ESP exomes, and 2,000 Danish exomes). Except for one Japanese sample from 1,000 Genomes. However, the Greenlandic population is an admixed population with European and Inuit heritage. Sure enough, when the authors looked at the Inuit population, they found the variant at a MAF of 23%. Thus, though this variant is not unique to Greenland, it’s far more common there than in other populations.

One criticism of classic GWAS approaches is that they typically identify genetic markers associated with certain phenotypes, not the causal variants themselves. This study went beyond that with exome sequencing and follow-up genotyping. The beauty is that TBC1D4 makes a lot of sense as a diabetes gene. It acts as a mediator of insulin-stimulated, Akt-induced glucose uptake by regulating the GLUT4 transporter. Knockout mice have decreased basal plasma glucose levels, and are resistant to insulin-stimulated glucose uptake in muscle and adipose tissue.

TBC1D4, Glucose, and Diabetes

As I mentioned, TBC1D4 has two different isoforms: a full-length protein, and a shorter isoform that’s missing exons 11 and 12. The p.R684X variant is in exon 11, so it likely only impacts the full-length isoform. Interestingly, that isoform is only expressed in skeletal muscle. Therefore it doesn’t affect TBC1D4 signaling in B-cells, the liver, or adipose tissue. In short, this variant’s disruption of the full-length protein causes higher plasma glucose levels, insulin resistance, and thereby confers risk of type II diabetes. It’s a sensible, straightforward story.

The work demonstrates that isolated founder populations are an important resource for mapping physiologically-relevant genetic variation associated with complex disease. 

Click to Tweet A new post on @MassGenomics on the value of isolated founder populations for genetic studies of complex disease:

The Future of Cancer Genomics

As you have probably noticed, there’s been a major shift in the focus of next-gen sequencing over the past couple of years. First it was all about new genomes, new techniques, and discovery. Now it’s all about translation. We are entering a new era in next-gen sequencing, one in which NGS technologies will not only be used for discovery, but will be integrated into clinical care.

A review in the latest issue of Human Molecular Genetics discusses NGS-enabled cancer genomics from the clinician’s point of view. In it, the authors highlight recent findings from large-scale cancer genomics efforts — such as the Cancer Genome Atlas — and offer their perspectives on the significant challenge facing us: translating the knowledge from such massive “oncogenomic” datasets to the clinic.

Large-scale Tumor Genomics Studies

Ambitious efforts by the Cancer Genome Atlas (TCGA) and the International Cancer Genomics Consortium (ICGC) have provided, in the last few years, comprehensive molecular profiles of the most common cancer types. Some of the key findings included:

For Glioblastoma (GBM)

TCGA’s first integrative analysis, a study of 91 tumors, revealed:

  • Frequent mutations affecting TP53 (37% of tumors) and NF1 (14%)
  • A subset of tumors with epigenetic abnormalities (MGMT promoter methylation) and hyper-mutation.
  • Gene expression-based definition of four subtypes: proneural, neural, classical, and mesenchymal

For Ovarian Cancer

TCGA’s study on 489 patients with serous ovarian cancer reported:

  • Nearly all tumors had mutation and/or deletion of tumor suppressor TP53.
  • Nine further genes, including NF1, BRCA1/2, RB1, and CDK12 were significantly mutated
  • Ovarian carcinomas also harbored extensive copy number alteration and promoter methylation

For Colorectal Cancer

A study of 276 colorecatal carcinomas found that among these tumors:

  • 16% were hypermutated: 3/4 with microsatellite instability, hypermethylation and MLH1 silencing; 1/4 with mutations in DNA repair genes.
  • Frequent alteration of the WNT, MAPK, PI3K, TGF-B, and TP53 pathways
  • 24 significantly mutated genes, including known (APC, TP53, SMAD4, PIK3CA, and KRAS) as well as novel (ARID1A, SOX9, and FAM123B).

Lung Squamous Cell Carcinoma

TCGA also profiled 178 lung SqCCs, which exhibited:

  • Very high mutation rates (~360 coding mutations, ~165 rearrangements, and ~323 CNAs per tumor)
  • TP53 mutations in almost every tumor
  • Frequent alteration of CDKN2A/RB1 (72%), PI3K/AKT (47%), and squamous differentiation (44%) pathways
  • 7% of cases with EGFR amplification resulting in sensitivity to erlotinib and gefitinib

Breast Carcinoma

The largest sequencing study from TCGA to date included a comprehensive molecular profile of 510 breast tumors. Some of the highlights:

  • Significantly mutated genes included classical ones (PIK3CA, PTEN, AKT1, TP43, etc.) as well as novel ones (TBX3, RUNX1, CBFB, etc).
  • Key differences in mutation patterns between luminal A, luminal B, Her2-enriched, and Basal-like subtypes
  • Frequent mutation of TP53 and PIK3CA genes (29%)

Endometrial Cancer

TCGA’s extensive characterization of 373 endometrial carcinomas revealed that:

  • Uterine serous tumors and 25% of high-grade endrometroid tumors had many CNAs and frequent TP53 mutation, but low DNA methylation changes and progesterone/estrogen receptor expression.
  • Most endometrioid tumours had few copy number alterations or TP53 mutations, but frequent mutations in PTEN, CTNNB1, PIK3CA, ARID1A and KRAS and novel SMG ARID5B.
  • A subset of endometrioid tumours had a markedly increased transversion mutation frequency and hotspot mutations in POLE.
  • Tumors fell into one of four groups: ultramutated, microsatellite instability hypermutated, copy-number low, and copy-number high

Global Oncogenomics Findings

A systematic analysis of 3281 tumors from 12 cancer types by Kandoth et al) offered a global picture of the genomics of common human cancers. Many tumor types had mutations in chromatin remodeling genes (MLL2MLL4, or the ARID gene family). TP53 was the most common mutated gene overall. Mutations in that gene and six others (BAP1, DNMT3A, HGF, KDM5C, FBX7, and BRCA2) were significantly associated with poor survival. Large alterations (CNAs, SVs), clearly have an important role in tumor biology, and gene/miRNA expression profiling allows stratification of tumors into subtypes, often ones that correlate with clinical outcomes. Even within one tumor type, the mutational profiles suggested that few driver genes were shared across subtypes.

The broader conclusion from these and from so-called pan-can studies is that cancer represents a wide variety of diseases originating from different organs. Clustering genomic data across organs will therefore allow a biology-driven approach, focusing more on key genes and cellular pathways and less on simple tumor morphology.

Clinical Translation of Cancer Genomics

The real question, now that we’ve made considerable progress, is how to make use of that information in the clinic. Many institutions have launched personalized oncology programs which consider tumor mutation and/or gene expression profiling. Early reports suggest that 30-70% of cases will harbor mutations that are “actionable” for targeted therapy or patient stratification. The poster child for this might be the identification of BRAF as a driver gene in melanoma, which led to the use of BRAF inhibitors in melanomas that harbor the V600E mutation. It’s a wonderful story, but the simple fact is that most targeted therapies didn’t emerge from large-scale genomics studies, but from a deep understanding of specific pathways involved in defined tumor types.

Further, the successful identification (and targeted therapy against) a driver mutation in one tumor type does not guarantee it will work in aother type. Other factors — tissue specificity, genetic environment, and tumor micro-environment — must be considered as well.

In many current clinical trials, gene expression and mutation data are being concomitantly assessed for insight into patient stratification and therapeutic response. These sorts of trials are necessary to close the gap between new knowledge from large-scale cancer genomics and its application in the clinic. The feedback loop needs to work both ways: clinical trial results should inform future oncogenomics studies as well.

It’s clear that we will both creativity and cross-discipline expertise to carry the mission forward from here. Specifically, we’ll need:

  • Continued efforts to develop large, high-resolution, clinical-genomics data sets
  • Better and earlier access to drugs
  • Cross-discipline expertise in cancer, genomics, and informatics (“onco-bioinformaticians”)
  • Integration of genomic data into clinical tumor board discussions

Beating cancer is an important but incredibly difficult mission, and it won’t be solved by one scientific discipline alone. Collaborative efforts by cross-discipline teams are going to be necessary. Let’s get going.

Promoters and Enhancers in the Human Genome

promoters and enhancers in human genomeA recent issue of Nature featured two articles from the FANTOM5 project, an effort to systematically study gene expression and regulation by performingg capped analysis of gene expression (CAGE) across a diversity of cell types. FANTOM5 has generated single molecule CAGE profiles for 574 primary human cell samples, each sequenced to a median depth of ~4 million mapped tags per sample. CAGE is essentially a technology to isolate short sequence tags from capped RNAs. The approach was pioneered at RIKEN and is useful for determining transcription start sites (i.e., the regions of the genome corresponding to the capped 5′ end of RNAs).

Inferring Transcription Start Sites

CAGE tags are short, and thus can be directly sequenced using current NGS technologies. When mapped to the genome, they tell us not only about the level of expression for various genes, but also which transcription start sites were used. This is an advantage of RNA-seq and microarray-based expression profiling, which can’t always distinguish between multiple promoters of the same gene.

CAGE Technology

CAGE technology overview (RIKEN)

Also, CAGE data can be used to identify in-vivo transcribed enhancers, which are remote elements that increase transcription of a gene, independent of their distance or orientation to the gene’s promoter.

CAGE Advantages & Dataset

Also, because FANTOM5 generated all of the data on a single molecule sequencing platform (Helicos), there’s no PCR or cloning bias to worry about. Every read represents the 5′ end of a unique RNA molecule. Thus, the FANTOM5 dataset nicely complements the open chromatin, CHiP-Seq, and RNA-seq data generated by the ENCODE project.

The complete FANTOM5 dataset includes CAGE profiles for:

  • 573 human primary cell samples (~3 donors for most cell types)
  • 250 different cancer cell lines
  • 152 human post-mortem tissues
  • 128 mouse primary cell samples
  • 271 mouse developmental tissue samples

All samples were annotated using structured ontologies, and made available in the FANTOM5 online resource along with a genome browser for exploring the relationship between CAGE tag distribution and expression profiles.

Gene Promoters

The authors developed a method called decomposition-based peak identification (DPI) to cluster CAGE tags and identify the peaks, which should represent transcription start sites (TSSs). Sample and genome-wide, DPI identified 3.5 million peaks in the human genome and 2.1 million peaks in the mouse genome. Next, they applied tag evidence thresholds to define “permissive” and “robust” datasets, with the latter more stringent set used for most of the analysis.

Matching DPI peaks to the 5′ end of known genes within 500 bp revealed that 91% of human protein-coding genes had a TSS supported by robust CAGE peaks. This is pretty impressive coverage for an expression dataset, since not all genes will be “turned on” in every sample. To me, it suggests that the broad panel of primary cell samples captures most genes that it should, including the differentially expressed ones. Almost all peaks (96%) were observed in at least 2 samples, but most were present in less than half of samples, suggesting a lot of cell/tissue/sample specificity.

Nucleosomes and Preferred TSSs

The FANTOM5 dataset supports the previous observation that gene promoters can be classified as either “broad” or “sharp” types, i.e. genes with many or few TSSs. However, for genes with broad promoters (many TSSs), the CAGE dataset provided enough depth to identify which TSS seems to be preferred. Using this dominant TSS, the authors searched for phased dinucleotides associated with nucleosome location (AA/AT/TA/TT). There was a striking pattern of such motifs repeated about 10.5 bp downstream of the dominant TSS, suggesting that the nucleosomes have something to do with TSS preference in broad promoters.

Evolutionary Conservation of Gene Promoters

TSS conservation by cell type

TSS conservation by cell type (FANTOM Cons., Nature 2014)

About 38% of human TSSs overlapped previously-defined sequences that are evolutionarily conserved in mammals. TSSs for protein-coding genes were more conserved than those of noncoding RNAs. TSSs of housekeeping (essential) genes showed the highest conservation. Interestingly, TSSs that were specific to one cell or tissue type were more likely to change through evolution:

  • TSSs in fibroblasts, chondrocytes, and pre-adipocytes were among the most conserved
  • TSSs in T-cells, macrophages, dendritic cells, whole blood, and endothelial cells were the least conserved

You might notice something about the least conserved TSSs: they’re all related to immunity, suggesting that cells/tissues of the immune system are rapidly evolving.

Other Promoter Observations

There’s too much good stuff in this paper to cover it all, but here are some of the highlights:

  • Only 54% of human TSSs and 61% of mouse TSSs that were in human-mouse conserved regions had peaks in the other species.
  • The authors obtained promoter-level expression profiles for 1,665 of 1,762 human transcription factors (94%).
  • A typical primary cell had ~8,752 TSSs expressed at 3 or more copies per cell, including 430 corresponding to known transcription factors.
  • CHiP-Seq data show that cell-type-specific promoters are enriched for cell-type-specific transcription factors.
  • A small number of highly-abundant RNAs (HBB, SMR3B, STATH, PRB4, CLPS, HTN3, SERPINA1, CTRB2, CPB1, CPA1, and MALAT1) accounted for 20% or more of the reads in some libraries.

Comprehensive Enhancer Discovery

Enhancers bidirectional capped CAGE

Andersson et al, Nature 2014

This CAGE dataset also contained many TSSs not associated with known protein-coding genes. In a companion paper, Andersson et al show that balanced bi-directional CAGE peaks are a signal of active enhancers. This is fascinating stuff. The forward- and reverse-oriented peaks were usually separated by around 180 base pairs, and corresponded to nucleosome boundaries.

The authors identified 43,011 candidate enhancers across 808 human CAGE libraries. These candidates were depleted for CpG islands and repeats, and 95% of the RNAs from them were short, unspliced transcripts (median 346 nucleotides long) very unlike typical mRNAs. Another difference from mRNAs: both sense and antisense transcripts from enhancers were sensitive to degradation by the exosome complex. In contrast, only antisense mRNAs are degraded by the exosome. TSSs of mRNAs and enhancers seemed to have similar RNA-PolII initiation elements, but motif analysis suggests that enhancers are more similar to non-CpG island promoters.

This is fascinating stuff, and a great companion paper to the promoter atlas paper. We need more studies like these (and ENCODE) to fully understand the function of the human genome.


FANTOM Consortium and the RIKEN PMI and CLST (DGT) (2014). A promoter-level mammalian expression atlas. Nature, 507 (7493), 462-70 PMID: 24670764

Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, Chen Y, Zhao X, Schmidl C, Suzuki T, Ntini E, Arner E, Valen E, Li K, Schwarzfischer L, Glatz D, Raithel J, Lilje B, Rapin N, Bagger FO, Jørgensen M, Andersen PR, Bertin N, Rackham O, Burroughs AM, Baillie JK, Ishizu Y, Shimizu Y, Furuhata E, Maeda S, Negishi Y, Mungall CJ, Meehan TF, Lassmann T, Itoh M, Kawaji H, Kondo N, Kawai J, Lennartsson A, Daub CO, Heutink P, Hume DA, Jensen TH, Suzuki H, Hayashizaki Y, Müller F, FANTOM Consortium, Forrest AR, Carninci P, Rehli M, & Sandelin A (2014). An atlas of active enhancers across human cell types and tissues. Nature, 507 (7493), 455-61 PMID: 24670763