A New Era for MassGenomics

When I started MassGenomics in 2008, next-generation sequencing was in its infancy. We’d sequenced AML1 — the first cancer genome — with two nascent platforms: Illumina/Solexa (32-bp reads) and 454 FLX (450-bp reads). Already, we had a glimpse of the bioinformatics challenges that these technologies brought forth.

Sequencing for Common Disease

It’s astonishing how far the field has come in just eight years. Factory-scale sequencing now makes it practically and economically feasible to sequence tens of thousands of (whole human) genomes in a single year. Washington University and other large-scale sequencing institutions are currently applying it to ambitious studies of cardiovascular, autoimmune, and neurological conditions. By studying tens of thousands of genomes, it should be possible to comprehensively define the genetic architecture underlying each of these common complex diseases.

Rare Disease and Clinical Applications

Yet there are other important applications of next-gen sequencing, such as:

  • The identification of genes underlying rare inherited disorders
  • Molecular diagnosis and characterization of undiagnosed diseases
  • Utilization of genomic information to improve clinical care

The distribution model for these applications is different from the factory-scale sequencing operation required for common disease. The democratization of NGS has empowered hundreds of smaller labs to carry out such research, and enabled rapid clinical sequencing at the point of care. That’s where the rubber hits the road, and it’s also where I want to be.

A New Position: Nationwide Children’s Hospital

Thus, after 13 years at Washington University, I’ve accepted a position as Principal Investigator at Nationwide Children’s Hospital. If the name of that institution sounds familiar, it’s because they’ve recruited Rick Wilson and Elaine Mardis to establish a new Institute for Genomic Medicine (IGM). Under their leadership, I’ll help build up the research program for the genetic basis of rare pediatric disorders.

Future Directions

So, what does this mean for MassGenomics? The blog will continue, hopefully at a greater frequency, and with a new emphasis into pediatric and clinical genomics. I should state for the record that the blog does not represent the views of Nationwide Children’s Hospital or the Ohio State University (where I’m now an assistant professor).

The McDonnell Institute at Washington University will continue on, by the way. The talented faculty and staff have already begun work on the common complex disease genomics (CCDG) program, while University leadership has initiated a search for a new director. They have capacity to spare, so if you’re looking for high-quality exome or genome sequencing (human or non-human), please reach out to Bob Fulton.

So that’s my news, and I hope to have more to share in the weeks to come.

 

The Genetic Architecture of Complex Disease

Genetics of complex disease

Fuchsberger et al, Nature 2016

It’s no secret that while genome-wide association studies (GWAS) have implicated thousands of genetic loci in human phenotypes, the variants uncovered collectively explain only a fraction of the observed variance between individuals. The reasons for this “missing heritability” are a subject of vigorous debate in the scientific community. One possible explanation is that rare (low-frequency) variants — which are poorly represented on the arrays used for GWAS — underlie a substantial proportion of the variability.

This idea is intuitive: in theory, large-effect variants would be kept at low frequency by natural selection, a pattern that’s well established for mutations that cause rare single-gene disorders. It also makes a strong argument for large-scale sequencing for common complex disease, which is the purpose of the NHGRI’s flagship CCDG program. The problem, of course, is that we can’t really understand the contribution of low-frequency variants to human disease without actually performing such an experiment.

A new study in this week’s issue of Nature represents one of the first and highest-profile attempts to do so for a common disease. Type II diabetes (T2D) affects 29 million people in the United States (according to the CDC), which is about 9.3% of the entire population. It also has a strong genetic component, and has thus been a priority GWAS target for over a decade. So far, GWAS efforts have reported 80 robust associations, largely involving common (MAF>5%) variants that have very small effects on disease risk.

In the current study, Christian Fuchsberger and his 300+ co-authors used a combination of genome sequencing, exome sequencing, genotyping, and imputation to examine the genetic architecture of type II diabetes. This report is the fruit of a years-long collaboration between two consortium efforts: GoT2D, which applied whole-genome sequencing to individuals of European ancestry, and T2D-GENES, which performed exome sequencing in multi-ethnic cohorts. Here’s a summary of the data generated:

Genome-wide Data (European ancestry) Cases Controls
Low-coverage (5x) whole genome sequencing: 1,326 1,331
Genotype imputation in 13 other cohorts: 11,645 32,769
Total: 12,971 34,100
Exome-centric Data (5 ancestry groups) Cases Controls
Deep (82x) exome sequencing: 6,504 6,436
SNP array genotyping (2.5 million sites): 28,305 51,549
Total: 34,809 57,985

Genome Sequencing Coverage Matters

I think it’s important to point out the nuance of whole-genome sequencing coverage. Generally, we target 30x coverage for a whole-genome of a germline (i.e. non-tumor) sample, which provides excellent power for variant detection. Some groups have touted 20x as a possible minimum threshold, and I’m comfortable with that.

But low-coverage (5x) whole genome sequencing is a whole different animal. WGS coverage is like a bell curve: while many positions will have 5x coverage, some will have 1-3x and some will have 7-10x. Even for this group of authors, which include some of the top experts on NGS variant calling, this presents a significant challenge for variant detection.

Simply put, at 5x coverage, a number of rare and/or hard-to-call variants (e.g. SVs) will be missed.

Useful WGS Metrics

In spite of my concerns, I’m a sucker for summary metrics in large-scale WGS datasets. Here are some highlights from low-coverage WGS of 2,657 European-ancestry individuals:

  • 26.7 million variants were detected, genotyped, and phased, including 1.5 million small indels and 8,876 large (>100 bp) deletions.
  • Individuals harbored an average of 3.30 million genetic variants, including 271,245 indels and 669 deletions.
  • 420,473 common SNVs and 2.4 million low-frequency SNVs were poorly tagged by genotype arrays (r-squared < 0.30), and thus haven’t been interrogated by any T2D GWAS to date.

Genome-wide Single-variant Associations

The primary association analysis uncovered 126 variants at 4 loci that were associated with T2D, three of which were known. EML4 was novel, but when the authors imputed sequencing variants into a much larger sample collection (44,414 individuals from 17 other studies), the association didn’t hold up. Another novel signal (CENPW) did appear, and this was replicated in an independent cohort.

Associations with T2D

Fuchsberger et al, Nature 2016

In summary, the meta-analysis of sequencing and imputed data examined 26.7 million variants in over 47,000 individuals of European ancestry. That’s a massive association study with extremely high resolution, yet it recapitulated only 13/80 loci (16%) known to be “robustly associated” with T2D, and uncovered only one new locus. I find that a bit discouraging, and I’m sure the authors did, too.

Coding Variation in Type II Diabetes

The analysis of exome data fared little better, I’m afraid. The authors combined exome sequencing data from 10,437 individuals representing five ancestry groups (European, South Asian, East Asian, Hispanic, and African American) with equivalent data from the WGS study for a joint dataset comprising 12,940 individuals. They identified:

  • 3.04 million variants overall, of which 1.19 million were protein-altering
  • ~9,243 synonymous, 7,636 missense, and 250 protein-truncating variants per individual

Single-variant testing yielded only a single significant result, (PAX4 p.Arg192His, a.k.a. rs2233580) that was only observed in East Asian individuals. Gene-level aggregation testing yielded no exome-side significant finding. Limiting the analysis to 634 genes in known associated loci uncovered an association (FES in South Asians, driven by a single likely-causal variant) that met the more forgiving threshold for significance.

To increase power, the authors integrated SNP genotypes from 2.5 million sites in about 79,000 additional cases and controls (all European ancestry) obtained using a custom Illumina SNP chip. Integrating these with the exome data yielded an exome-centric dataset of more than 90,000 individuals. Some 18 variants at 13 loci exceeded genome-wide significance, but all were common (MAF>5%), and only one (MTMR3) was outside of known GWAS loci.

No Evidence for Synthetic Association

Back in 2010, Goldstein and colleagues proposed the concept of “synthetic association” — the idea that common GWAS signals may be due to individually rare causal variants which cluster on certain common haplotypes. The thinking was that sequencing in GWAS regions might therefore reveal all of these causal variants. This would offer an intriguing explanation for the fact that most lead GWAS hits lie outside of coding regions. It might be possible that nearby rare causal variants were in LD with the tag SNP, and these (not the tag SNP) exerted the causal effect on disease risk.

The authors tested this hypothesis in T2d using the WGS dataset for 2,657 individuals, which they describe as having “near-complete ascertainment of genetic variation.” They took the 10 T2D GWAS loci with the strongest support in their study, and looked for low-frequency missense variants within 2.5 million base pairs of the common index SNV. None of the loci showed supporting evidence of “synthetic association,” and 8/10 were convincingly not consistent with the proposed phenomenon.

Thus, while synthetic association might well underline common GWAS signals for other phenotypes, it does not appear to do so for T2D.

The Contribution of Rare and Common Variants

To model the disease architecture of T2D, the authors conducted an elegant experiment. They simulated three possible models which had seemed plausible prior to large-scale sequencing, and computed the number of associated low-frequency and rare variants that would be uncovered with their study design.

genetic models of T2d

Simulated models of T2D genetics (Fuchsberger et al, Nature 2016)

In the first two models, low-frequency variants explain a significant proportion of the heritability, and over a hundred of them should have been uncovered at the more forgiving significance threshold. In a third model, where rare variants make a minority contribution, they’d uncover only a few dozen.

T2D genetics results

Actual results for T2D (Fuchsberger et al, Nature 2016)

Next, the authors compared these outcomes to their actual results. Only 23 low-frequency and rare variants achieved significance, which is nowhere close to the first two models (the ones that suggest a major role). It’s most similar to the common polygenic model of disease for T2D, suggesting that this study supports a minor role for rare and low-frequency variants.

In Summary

Overall, I found this to be a comprehensive and extremely well-written paper of the caliber we’d expect to see in Nature. It represents years of work by more than 300 contributing authors, and probably the first study of many to come. While the number of new discoveries may be a tad disappointing, the authors have uncovered novel loci and secondary signals. They’ve also done a great deal to shed light on the genetic architecture of this common complex disease, particularly as far as coding variants are concerned.

 

We will need, and I hope to see, many efforts like this to understand the genetic architecture of other diseases and important human traits.

References
Fuchsberger C, Flannick J, Teslovich TM, Mahajan A, Agarwala V, Gaulton KJ, Ma C, Fontanillas P, Moutsianas L, McCarthy DJ, Rivas MA, Perry JR, Sim X, Blackwell TW, Robertson NR, Rayner NW, Cingolani P, Locke AE, Tajes JF, Highland HM, Dupuis J, Chines PS, Lindgren CM, Hartl C, Jackson AU, Chen H, Huyghe JR, van de Bunt M, Pearson RD, Kumar A, Müller-Nurasyid M, Grarup N, Stringham HM, Gamazon ER, Lee J, Chen Y, Scott RA, Below JE, Chen P, Huang J, Go MJ, Stitzel ML, Pasko D, Parker SC, Varga TV, Green T, Beer NL, Day-Williams AG, Ferreira T, Fingerlin T, Horikoshi M, Hu C, Huh I, Ikram MK, Kim BJ, Kim Y, Kim YJ, Kwon MS, Lee J, Lee S, Lin KH, Maxwell TJ, Nagai Y, Wang X, Welch RP, Yoon J, Zhang W, Barzilai N, Voight BF, Han BG, Jenkinson CP, Kuulasmaa T, Kuusisto J, Manning A, Ng MC, Palmer ND, Balkau B, Stančáková A, Abboud HE, Boeing H, Giedraitis V, Prabhakaran D, Gottesman O, Scott J, Carey J, Kwan P, Grant G, Smith JD, Neale BM, Purcell S, Butterworth AS, Howson JM, Lee HM, Lu Y, Kwak SH, Zhao W, Danesh J, Lam VK, Park KS, Saleheen D, So WY, Tam CH, Afzal U, Aguilar D, Arya R, Aung T, Chan E, Navarro C, Cheng CY, Palli D, Correa A, Curran JE, Rybin D, Farook VS, Fowler SP, Freedman BI, Griswold M, Hale DE, Hicks PJ, Khor CC, Kumar S, Lehne B, Thuillier D, Lim WY, Liu J, van der Schouw YT, Loh M, Musani SK, Puppala S, Scott WR, Yengo L, Tan ST, Taylor HA Jr, Thameem F, Wilson G, Wong TY, Njølstad PR, Levy JC, Mangino M, Bonnycastle LL, Schwarzmayr T, Fadista J, Surdulescu GL, Herder C, Groves CJ, Wieland T, Bork-Jensen J, Brandslund I, Christensen C, Koistinen HA, Doney AS, Kinnunen L, Esko T, Farmer AJ, Hakaste L, Hodgkiss D, Kravic J, Lyssenko V, Hollensted M, Jørgensen ME, Jørgensen T, Ladenvall C, Justesen JM, Käräjämäki A, Kriebel J, Rathmann W, Lannfelt L, Lauritzen T, Narisu N, Linneberg A, Melander O, Milani L, Neville M, Orho-Melander M, Qi L, Qi Q, Roden M, Rolandsson O, Swift A, Rosengren AH, Stirrups K, Wood AR, Mihailov E, Blancher C, Carneiro MO, Maguire J, Poplin R, Shakir K, Fennell T, DePristo M, Hrabé de Angelis M, Deloukas P, Gjesing AP, Jun G, Nilsson P, Murphy J, Onofrio R, Thorand B, Hansen T, Meisinger C, Hu FB, Isomaa B, Karpe F, Liang L, Peters A, Huth C, O’Rahilly SP, Palmer CN, Pedersen O, Rauramaa R, Tuomilehto J, Salomaa V, Watanabe RM, Syvänen AC, Bergman RN, Bharadwaj D, Bottinger EP, Cho YS, Chandak GR, Chan JC, Chia KS, Daly MJ, Ebrahim SB, Langenberg C, Elliott P, Jablonski KA, Lehman DM, Jia W, Ma RC, Pollin TI, Sandhu M, Tandon N, Froguel P, Barroso I, Teo YY, Zeggini E, Loos RJ, Small KS, Ried JS, DeFronzo RA, Grallert H, Glaser B, Metspalu A, Wareham NJ, Walker M, Banks E, Gieger C, Ingelsson E, Im HK, Illig T, Franks PW, Buck G, Trakalo J, Buck D, Prokopenko I, Mägi R, Lind L, Farjoun Y, Owen KR, Gloyn AL, Strauch K, Tuomi T, Kooner JS, Lee JY, Park T, Donnelly P, Morris AD, Hattersley AT, Bowden DW, Collins FS, Atzmon G, Chambers JC, Spector TD, Laakso M, Strom TM, Bell GI, Blangero J, Duggirala R, Tai ES, McVean G, Hanis CL, Wilson JG, Seielstad M, Frayling TM, Meigs JB, Cox NJ, Sladek R, Lander ES, Gabriel S, Burtt NP, Mohlke KL, Meitinger T, Groop L, Abecasis G, Florez JC, Scott LJ, Morris AP, Kang HM, Boehnke M, Altshuler D, & McCarthy MI (2016). The genetic architecture of type 2 diabetes. Nature PMID: 27398621

Transitions and Excuses

My sincere apologies to the dedicated MassGenomics readers who’ve noticed the recent decline in new posts here. It’s a busy and tumultuous time for our institute.

Leadership Transition at MGI

For those who missed the announcement earlier this month: our center’s director Rick Wilson and co-director Elaine Mardis announced that they’re leaving to establish a new Institute for Genomic Medicine at Nationwide Children’s Hospital / Ohio State University in Columbus, Ohio.

We are still figuring out the transition plan, but the Washington University School of Medicine remains very committed to supporting our center and the people who work here. In other words, the McDonnell Genome Institute will continue on.

Large-scale Sequencing Opportunities

In the meantime, we are in the midst of large-scale sequencing efforts for the Centers for Complex Disease Genomics (CCDG), Alzheimers Disease Sequencing Project (ADSP), and Gabriella Miller Kids First (GMKF) initiatives. These are all ambitious projects in which I’m intimately involved, which means they consume a lot of my time. On the bright side, they keep me at the forefront of genomics where I can continue to be useful to you.

Important note for fellow scientists: Even with our current commitments, the HiSeq X Ten remains a hungry beast, so please get in touch if you’re looking for low-cost genome sequencing. With the X Ten and other instruments, we can provide custom-targeted, exome, whole genome, and/or transcriptome sequencing for humans and model organisms.

More Science Fiction

Last but not least, some personal news that may help explain why I’ve had less free time to write on MassGenomics. As you know, Harper Voyager (an imprint of HarperCollins) published my debut novel earlier this year. I’m thrilled to announce that I’ve accepted an offer from my publisher for two more books, effectively making The Rogue Retrieval into a trilogy.

Harper-Voyager-2-Book-Deal

All of you have been enormously supportive of my science fiction writing as well as my science writing, and I hope that will continue.

Once the dust settles from this transition period, I should be posting at a more regular schedule. So please stick around!

The Real Cost of Sequencing

The real cost of sequencing is as hard to pin down as a sumo wrestler. Working in a large-scale sequencing laboratory offers an interesting perspective on the duality of the so-called “cost per genome.” On one hand, we see certain equipment manufacturers and many people in the media tossing around claims that sequencing a genome now costs under $1,000. On the other, we write grant budgets and estimates based on actual costs, which include things like sample assessment, variant calling, and data storage. With these incorporated, the cost per genome is not that low, even for large projects.

I came across a wonderful opinion piece at Genome Biology, in which the authors discuss the evolution of sequencing and computing technologies over the past 60 years. Admittedly, I found it a bit daunting at first, because theories of computation and “conceptual frameworks” don’t excite me. Once I pushed past the organizing principle stuff, however, I found it contained some shrewd perspectives on the current state and near future of genomics.

Big Data: Large Scale Sequencing

Big Data Sequencing

Credit: Muir et al, Genome Biology, 2016

The rise of next-gen sequencing factors significantly in the big data paradigm for genomics. Rather than trot out the sequencing cost versus Moore’s law figure, the authors provided some compelling illustrations of the dramatic increase in the pace and quantity of sequencing. The most striking of these was a pie chart of the sequence data contributed by large-scale projects.

The Cancer Genome Atlas (TCGA) dwarfs everyone else, with 2300 Terabases of sequencing data. This is ten times the amount generated by the 1,000 Genomes Project, and 30 times the amount in the Alzheimer’s Disease Sequencing Project (ADSP).

Costs and Economies of Scale

A key concept highlighted by the authors is the interplay between fixed and variable costs. The sequencing technologies utilized for the Human Genome Project had considerable up-front costs (i.e. instrument purchase) and somewhat fixed per-sample costs. In contrast, next-generation sequencing has a high up-front cost, but a reduced per-sample cost as volume increases. In other words, the more genomes we produce, the less they cost. True, this economy of scale has an upper limit, but the current throughput of an Illumina X Ten system — 18,000 human whole genomes per year — provides enormous capacity.

Interestingly, the opposite paradigm-shift is taking place in the computing industry. Until recently, the model for computing mirrored NGS: large up-front cost of buying the servers, but lower variable costs for running them. In some ways, this erected a barrier for smaller labs hoping to tackle complex problems, because they might not be able to afford enough computing equipment to handle the workload. Yet cloud computing and computing-as-a-service platforms have largely removed the need for that up-front investment. Anyone can buy as much computing power as they need on the Amazon or Google clouds. Although the variable cost (per CPU hour) is higher than that of a large data center, there’s no large fixed cost at the front end. As the authors put it:

This new regime, in which costs scale with the amount of computational processing time, places a premium on driving down the average cost by developing efficient algorithms for data processing.

As a bioinformatician, I think this is a good thing, because it forces us to improve our software tools and pipelines to become as efficient as possible.

Although cloud computing offers tremendous appeal, it faces some challenges for widespread adoption in our field. Most sequencing take place in academic settings, where equipment purchases are often exempt from indirect fees (because the university can write off depreciation). Also, many investigators don’t have to pay for the basic utilities required to run computing equipment (e.g. electricity and cooling). These factors encourage us to stick with the traditional computing model, rather than shifting to cloud computing which will be subject to indirect costs.

Breaking Down the Cost of Sequencing

sequencing cost breakdown

Muir et al, Genome Biology, 2016

We tend to measure the cost of sequencing as bases per dollar, or more recently, X dollars per genome. Both funding agencies and sequencing customers like to ask how much an exome or a genome costs. This single-price figure has some disadvantages:

  1. It’s not always clear what that dollar figure includes. Is it purely the sequencing run cost, or does it account for non-free things like sample assessment, handling, and bioinformatics analysis? Notice how they’re not included in the figure at right.
  2. It obscures the true cost breakdown of a sequencing project into its constituent parts, which complicates cost estimates and makes it harder to adapt to changes like the shift to cloud computing.
  3. It can lead to unrealistic expectations. People hear about this $1,000 genome, so they come to us for a whole-genome sequencing quote, and get upset when (1) it’s not that low, and (2) we have to add other costs, like sample handling, to the estimate.

Unrealistic expectations are a source of constant frustration for us. When we provide estimates for a sequencing project, we include analysis time as a recommended (but often not required) line item. Of course, no one wants to pay for analysis — they just want the sequencing. Sometimes this is just fine — we provide sequencing for a number of collaborators who are capable at NGS analysis. Other times, the customer later asks “How do I open this BAM file to see my variants?”

Sorry, but high-quality variant calls require analysis, and as I’ve written before, bioinformatics analysis is not free.

One thing that concerns me about the current state of federal funding (for sequencing) in the United States is that large-scale projects emphasize data production, not data analysis. The RFA for NHGRI’s large-scale sequencing program (CCDG) mandated that 80% of the budget go to data production. Yet as the authors of this opinion piece correctly point out:

As bioinformatics becomes increasingly important in the generation of biological insight from sequencing data, the long-term storage and analysis of sequencing data will represent a larger fraction of project cost.

I couldn’t agree more.

 

References

Muir P, Li S, Lou S, Wang D, Spakowicz DJ, Salichos L, Zhang J, Weinstock GM, Isaacs F, Rozowsky J, & Gerstein M (2016). The real cost of sequencing: scaling computation to keep pace with data generation. Genome biology, 17 (1) PMID: 27009100