The Genetic Architecture of Complex Disease

Genetics of complex disease

Fuchsberger et al, Nature 2016

It’s no secret that while genome-wide association studies (GWAS) have implicated thousands of genetic loci in human phenotypes, the variants uncovered collectively explain only a fraction of the observed variance between individuals. The reasons for this “missing heritability” are a subject of vigorous debate in the scientific community. One possible explanation is that rare (low-frequency) variants — which are poorly represented on the arrays used for GWAS — underlie a substantial proportion of the variability.

This idea is intuitive: in theory, large-effect variants would be kept at low frequency by natural selection, a pattern that’s well established for mutations that cause rare single-gene disorders. It also makes a strong argument for large-scale sequencing for common complex disease, which is the purpose of the NHGRI’s flagship CCDG program. The problem, of course, is that we can’t really understand the contribution of low-frequency variants to human disease without actually performing such an experiment.

A new study in this week’s issue of Nature represents one of the first and highest-profile attempts to do so for a common disease. Type II diabetes (T2D) affects 29 million people in the United States (according to the CDC), which is about 9.3% of the entire population. It also has a strong genetic component, and has thus been a priority GWAS target for over a decade. So far, GWAS efforts have reported 80 robust associations, largely involving common (MAF>5%) variants that have very small effects on disease risk.

In the current study, Christian Fuchsberger and his 300+ co-authors used a combination of genome sequencing, exome sequencing, genotyping, and imputation to examine the genetic architecture of type II diabetes. This report is the fruit of a years-long collaboration between two consortium efforts: GoT2D, which applied whole-genome sequencing to individuals of European ancestry, and T2D-GENES, which performed exome sequencing in multi-ethnic cohorts. Here’s a summary of the data generated:

Genome-wide Data (European ancestry) Cases Controls
Low-coverage (5x) whole genome sequencing: 1,326 1,331
Genotype imputation in 13 other cohorts: 11,645 32,769
Total: 12,971 34,100
Exome-centric Data (5 ancestry groups) Cases Controls
Deep (82x) exome sequencing: 6,504 6,436
SNP array genotyping (2.5 million sites): 28,305 51,549
Total: 34,809 57,985

Genome Sequencing Coverage Matters

I think it’s important to point out the nuance of whole-genome sequencing coverage. Generally, we target 30x coverage for a whole-genome of a germline (i.e. non-tumor) sample, which provides excellent power for variant detection. Some groups have touted 20x as a possible minimum threshold, and I’m comfortable with that.

But low-coverage (5x) whole genome sequencing is a whole different animal. WGS coverage is like a bell curve: while many positions will have 5x coverage, some will have 1-3x and some will have 7-10x. Even for this group of authors, which include some of the top experts on NGS variant calling, this presents a significant challenge for variant detection.

Simply put, at 5x coverage, a number of rare and/or hard-to-call variants (e.g. SVs) will be missed.

Useful WGS Metrics

In spite of my concerns, I’m a sucker for summary metrics in large-scale WGS datasets. Here are some highlights from low-coverage WGS of 2,657 European-ancestry individuals:

  • 26.7 million variants were detected, genotyped, and phased, including 1.5 million small indels and 8,876 large (>100 bp) deletions.
  • Individuals harbored an average of 3.30 million genetic variants, including 271,245 indels and 669 deletions.
  • 420,473 common SNVs and 2.4 million low-frequency SNVs were poorly tagged by genotype arrays (r-squared < 0.30), and thus haven’t been interrogated by any T2D GWAS to date.

Genome-wide Single-variant Associations

The primary association analysis uncovered 126 variants at 4 loci that were associated with T2D, three of which were known. EML4 was novel, but when the authors imputed sequencing variants into a much larger sample collection (44,414 individuals from 17 other studies), the association didn’t hold up. Another novel signal (CENPW) did appear, and this was replicated in an independent cohort.

Associations with T2D

Fuchsberger et al, Nature 2016

In summary, the meta-analysis of sequencing and imputed data examined 26.7 million variants in over 47,000 individuals of European ancestry. That’s a massive association study with extremely high resolution, yet it recapitulated only 13/80 loci (16%) known to be “robustly associated” with T2D, and uncovered only one new locus. I find that a bit discouraging, and I’m sure the authors did, too.

Coding Variation in Type II Diabetes

The analysis of exome data fared little better, I’m afraid. The authors combined exome sequencing data from 10,437 individuals representing five ancestry groups (European, South Asian, East Asian, Hispanic, and African American) with equivalent data from the WGS study for a joint dataset comprising 12,940 individuals. They identified:

  • 3.04 million variants overall, of which 1.19 million were protein-altering
  • ~9,243 synonymous, 7,636 missense, and 250 protein-truncating variants per individual

Single-variant testing yielded only a single significant result, (PAX4 p.Arg192His, a.k.a. rs2233580) that was only observed in East Asian individuals. Gene-level aggregation testing yielded no exome-side significant finding. Limiting the analysis to 634 genes in known associated loci uncovered an association (FES in South Asians, driven by a single likely-causal variant) that met the more forgiving threshold for significance.

To increase power, the authors integrated SNP genotypes from 2.5 million sites in about 79,000 additional cases and controls (all European ancestry) obtained using a custom Illumina SNP chip. Integrating these with the exome data yielded an exome-centric dataset of more than 90,000 individuals. Some 18 variants at 13 loci exceeded genome-wide significance, but all were common (MAF>5%), and only one (MTMR3) was outside of known GWAS loci.

No Evidence for Synthetic Association

Back in 2010, Goldstein and colleagues proposed the concept of “synthetic association” — the idea that common GWAS signals may be due to individually rare causal variants which cluster on certain common haplotypes. The thinking was that sequencing in GWAS regions might therefore reveal all of these causal variants. This would offer an intriguing explanation for the fact that most lead GWAS hits lie outside of coding regions. It might be possible that nearby rare causal variants were in LD with the tag SNP, and these (not the tag SNP) exerted the causal effect on disease risk.

The authors tested this hypothesis in T2d using the WGS dataset for 2,657 individuals, which they describe as having “near-complete ascertainment of genetic variation.” They took the 10 T2D GWAS loci with the strongest support in their study, and looked for low-frequency missense variants within 2.5 million base pairs of the common index SNV. None of the loci showed supporting evidence of “synthetic association,” and 8/10 were convincingly not consistent with the proposed phenomenon.

Thus, while synthetic association might well underline common GWAS signals for other phenotypes, it does not appear to do so for T2D.

The Contribution of Rare and Common Variants

To model the disease architecture of T2D, the authors conducted an elegant experiment. They simulated three possible models which had seemed plausible prior to large-scale sequencing, and computed the number of associated low-frequency and rare variants that would be uncovered with their study design.

genetic models of T2d

Simulated models of T2D genetics (Fuchsberger et al, Nature 2016)

In the first two models, low-frequency variants explain a significant proportion of the heritability, and over a hundred of them should have been uncovered at the more forgiving significance threshold. In a third model, where rare variants make a minority contribution, they’d uncover only a few dozen.

T2D genetics results

Actual results for T2D (Fuchsberger et al, Nature 2016)

Next, the authors compared these outcomes to their actual results. Only 23 low-frequency and rare variants achieved significance, which is nowhere close to the first two models (the ones that suggest a major role). It’s most similar to the common polygenic model of disease for T2D, suggesting that this study supports a minor role for rare and low-frequency variants.

In Summary

Overall, I found this to be a comprehensive and extremely well-written paper of the caliber we’d expect to see in Nature. It represents years of work by more than 300 contributing authors, and probably the first study of many to come. While the number of new discoveries may be a tad disappointing, the authors have uncovered novel loci and secondary signals. They’ve also done a great deal to shed light on the genetic architecture of this common complex disease, particularly as far as coding variants are concerned.


We will need, and I hope to see, many efforts like this to understand the genetic architecture of other diseases and important human traits.

Fuchsberger C, Flannick J, Teslovich TM, Mahajan A, Agarwala V, Gaulton KJ, Ma C, Fontanillas P, Moutsianas L, McCarthy DJ, Rivas MA, Perry JR, Sim X, Blackwell TW, Robertson NR, Rayner NW, Cingolani P, Locke AE, Tajes JF, Highland HM, Dupuis J, Chines PS, Lindgren CM, Hartl C, Jackson AU, Chen H, Huyghe JR, van de Bunt M, Pearson RD, Kumar A, Müller-Nurasyid M, Grarup N, Stringham HM, Gamazon ER, Lee J, Chen Y, Scott RA, Below JE, Chen P, Huang J, Go MJ, Stitzel ML, Pasko D, Parker SC, Varga TV, Green T, Beer NL, Day-Williams AG, Ferreira T, Fingerlin T, Horikoshi M, Hu C, Huh I, Ikram MK, Kim BJ, Kim Y, Kim YJ, Kwon MS, Lee J, Lee S, Lin KH, Maxwell TJ, Nagai Y, Wang X, Welch RP, Yoon J, Zhang W, Barzilai N, Voight BF, Han BG, Jenkinson CP, Kuulasmaa T, Kuusisto J, Manning A, Ng MC, Palmer ND, Balkau B, Stančáková A, Abboud HE, Boeing H, Giedraitis V, Prabhakaran D, Gottesman O, Scott J, Carey J, Kwan P, Grant G, Smith JD, Neale BM, Purcell S, Butterworth AS, Howson JM, Lee HM, Lu Y, Kwak SH, Zhao W, Danesh J, Lam VK, Park KS, Saleheen D, So WY, Tam CH, Afzal U, Aguilar D, Arya R, Aung T, Chan E, Navarro C, Cheng CY, Palli D, Correa A, Curran JE, Rybin D, Farook VS, Fowler SP, Freedman BI, Griswold M, Hale DE, Hicks PJ, Khor CC, Kumar S, Lehne B, Thuillier D, Lim WY, Liu J, van der Schouw YT, Loh M, Musani SK, Puppala S, Scott WR, Yengo L, Tan ST, Taylor HA Jr, Thameem F, Wilson G, Wong TY, Njølstad PR, Levy JC, Mangino M, Bonnycastle LL, Schwarzmayr T, Fadista J, Surdulescu GL, Herder C, Groves CJ, Wieland T, Bork-Jensen J, Brandslund I, Christensen C, Koistinen HA, Doney AS, Kinnunen L, Esko T, Farmer AJ, Hakaste L, Hodgkiss D, Kravic J, Lyssenko V, Hollensted M, Jørgensen ME, Jørgensen T, Ladenvall C, Justesen JM, Käräjämäki A, Kriebel J, Rathmann W, Lannfelt L, Lauritzen T, Narisu N, Linneberg A, Melander O, Milani L, Neville M, Orho-Melander M, Qi L, Qi Q, Roden M, Rolandsson O, Swift A, Rosengren AH, Stirrups K, Wood AR, Mihailov E, Blancher C, Carneiro MO, Maguire J, Poplin R, Shakir K, Fennell T, DePristo M, Hrabé de Angelis M, Deloukas P, Gjesing AP, Jun G, Nilsson P, Murphy J, Onofrio R, Thorand B, Hansen T, Meisinger C, Hu FB, Isomaa B, Karpe F, Liang L, Peters A, Huth C, O’Rahilly SP, Palmer CN, Pedersen O, Rauramaa R, Tuomilehto J, Salomaa V, Watanabe RM, Syvänen AC, Bergman RN, Bharadwaj D, Bottinger EP, Cho YS, Chandak GR, Chan JC, Chia KS, Daly MJ, Ebrahim SB, Langenberg C, Elliott P, Jablonski KA, Lehman DM, Jia W, Ma RC, Pollin TI, Sandhu M, Tandon N, Froguel P, Barroso I, Teo YY, Zeggini E, Loos RJ, Small KS, Ried JS, DeFronzo RA, Grallert H, Glaser B, Metspalu A, Wareham NJ, Walker M, Banks E, Gieger C, Ingelsson E, Im HK, Illig T, Franks PW, Buck G, Trakalo J, Buck D, Prokopenko I, Mägi R, Lind L, Farjoun Y, Owen KR, Gloyn AL, Strauch K, Tuomi T, Kooner JS, Lee JY, Park T, Donnelly P, Morris AD, Hattersley AT, Bowden DW, Collins FS, Atzmon G, Chambers JC, Spector TD, Laakso M, Strom TM, Bell GI, Blangero J, Duggirala R, Tai ES, McVean G, Hanis CL, Wilson JG, Seielstad M, Frayling TM, Meigs JB, Cox NJ, Sladek R, Lander ES, Gabriel S, Burtt NP, Mohlke KL, Meitinger T, Groop L, Abecasis G, Florez JC, Scott LJ, Morris AP, Kang HM, Boehnke M, Altshuler D, & McCarthy MI (2016). The genetic architecture of type 2 diabetes. Nature PMID: 27398621

Transitions and Excuses

My sincere apologies to the dedicated MassGenomics readers who’ve noticed the recent decline in new posts here. It’s a busy and tumultuous time for our institute.

Leadership Transition at MGI

For those who missed the announcement earlier this month: our center’s director Rick Wilson and co-director Elaine Mardis announced that they’re leaving to establish a new Institute for Genomic Medicine at Nationwide Children’s Hospital / Ohio State University in Columbus, Ohio.

We are still figuring out the transition plan, but the Washington University School of Medicine remains very committed to supporting our center and the people who work here. In other words, the McDonnell Genome Institute will continue on.

Large-scale Sequencing Opportunities

In the meantime, we are in the midst of large-scale sequencing efforts for the Centers for Complex Disease Genomics (CCDG), Alzheimers Disease Sequencing Project (ADSP), and Gabriella Miller Kids First (GMKF) initiatives. These are all ambitious projects in which I’m intimately involved, which means they consume a lot of my time. On the bright side, they keep me at the forefront of genomics where I can continue to be useful to you.

Important note for fellow scientists: Even with our current commitments, the HiSeq X Ten remains a hungry beast, so please get in touch if you’re looking for low-cost genome sequencing. With the X Ten and other instruments, we can provide custom-targeted, exome, whole genome, and/or transcriptome sequencing for humans and model organisms.

More Science Fiction

Last but not least, some personal news that may help explain why I’ve had less free time to write on MassGenomics. As you know, Harper Voyager (an imprint of HarperCollins) published my debut novel earlier this year. I’m thrilled to announce that I’ve accepted an offer from my publisher for two more books, effectively making The Rogue Retrieval into a trilogy.


All of you have been enormously supportive of my science fiction writing as well as my science writing, and I hope that will continue.

Once the dust settles from this transition period, I should be posting at a more regular schedule. So please stick around!

The Real Cost of Sequencing

The real cost of sequencing is as hard to pin down as a sumo wrestler. Working in a large-scale sequencing laboratory offers an interesting perspective on the duality of the so-called “cost per genome.” On one hand, we see certain equipment manufacturers and many people in the media tossing around claims that sequencing a genome now costs under $1,000. On the other, we write grant budgets and estimates based on actual costs, which include things like sample assessment, variant calling, and data storage. With these incorporated, the cost per genome is not that low, even for large projects.

I came across a wonderful opinion piece at Genome Biology, in which the authors discuss the evolution of sequencing and computing technologies over the past 60 years. Admittedly, I found it a bit daunting at first, because theories of computation and “conceptual frameworks” don’t excite me. Once I pushed past the organizing principle stuff, however, I found it contained some shrewd perspectives on the current state and near future of genomics.

Big Data: Large Scale Sequencing

Big Data Sequencing

Credit: Muir et al, Genome Biology, 2016

The rise of next-gen sequencing factors significantly in the big data paradigm for genomics. Rather than trot out the sequencing cost versus Moore’s law figure, the authors provided some compelling illustrations of the dramatic increase in the pace and quantity of sequencing. The most striking of these was a pie chart of the sequence data contributed by large-scale projects.

The Cancer Genome Atlas (TCGA) dwarfs everyone else, with 2300 Terabases of sequencing data. This is ten times the amount generated by the 1,000 Genomes Project, and 30 times the amount in the Alzheimer’s Disease Sequencing Project (ADSP).

Costs and Economies of Scale

A key concept highlighted by the authors is the interplay between fixed and variable costs. The sequencing technologies utilized for the Human Genome Project had considerable up-front costs (i.e. instrument purchase) and somewhat fixed per-sample costs. In contrast, next-generation sequencing has a high up-front cost, but a reduced per-sample cost as volume increases. In other words, the more genomes we produce, the less they cost. True, this economy of scale has an upper limit, but the current throughput of an Illumina X Ten system — 18,000 human whole genomes per year — provides enormous capacity.

Interestingly, the opposite paradigm-shift is taking place in the computing industry. Until recently, the model for computing mirrored NGS: large up-front cost of buying the servers, but lower variable costs for running them. In some ways, this erected a barrier for smaller labs hoping to tackle complex problems, because they might not be able to afford enough computing equipment to handle the workload. Yet cloud computing and computing-as-a-service platforms have largely removed the need for that up-front investment. Anyone can buy as much computing power as they need on the Amazon or Google clouds. Although the variable cost (per CPU hour) is higher than that of a large data center, there’s no large fixed cost at the front end. As the authors put it:

This new regime, in which costs scale with the amount of computational processing time, places a premium on driving down the average cost by developing efficient algorithms for data processing.

As a bioinformatician, I think this is a good thing, because it forces us to improve our software tools and pipelines to become as efficient as possible.

Although cloud computing offers tremendous appeal, it faces some challenges for widespread adoption in our field. Most sequencing take place in academic settings, where equipment purchases are often exempt from indirect fees (because the university can write off depreciation). Also, many investigators don’t have to pay for the basic utilities required to run computing equipment (e.g. electricity and cooling). These factors encourage us to stick with the traditional computing model, rather than shifting to cloud computing which will be subject to indirect costs.

Breaking Down the Cost of Sequencing

sequencing cost breakdown

Muir et al, Genome Biology, 2016

We tend to measure the cost of sequencing as bases per dollar, or more recently, X dollars per genome. Both funding agencies and sequencing customers like to ask how much an exome or a genome costs. This single-price figure has some disadvantages:

  1. It’s not always clear what that dollar figure includes. Is it purely the sequencing run cost, or does it account for non-free things like sample assessment, handling, and bioinformatics analysis? Notice how they’re not included in the figure at right.
  2. It obscures the true cost breakdown of a sequencing project into its constituent parts, which complicates cost estimates and makes it harder to adapt to changes like the shift to cloud computing.
  3. It can lead to unrealistic expectations. People hear about this $1,000 genome, so they come to us for a whole-genome sequencing quote, and get upset when (1) it’s not that low, and (2) we have to add other costs, like sample handling, to the estimate.

Unrealistic expectations are a source of constant frustration for us. When we provide estimates for a sequencing project, we include analysis time as a recommended (but often not required) line item. Of course, no one wants to pay for analysis — they just want the sequencing. Sometimes this is just fine — we provide sequencing for a number of collaborators who are capable at NGS analysis. Other times, the customer later asks “How do I open this BAM file to see my variants?”

Sorry, but high-quality variant calls require analysis, and as I’ve written before, bioinformatics analysis is not free.

One thing that concerns me about the current state of federal funding (for sequencing) in the United States is that large-scale projects emphasize data production, not data analysis. The RFA for NHGRI’s large-scale sequencing program (CCDG) mandated that 80% of the budget go to data production. Yet as the authors of this opinion piece correctly point out:

As bioinformatics becomes increasingly important in the generation of biological insight from sequencing data, the long-term storage and analysis of sequencing data will represent a larger fraction of project cost.

I couldn’t agree more.



Muir P, Li S, Lou S, Wang D, Spakowicz DJ, Salichos L, Zhang J, Weinstock GM, Isaacs F, Rozowsky J, & Gerstein M (2016). The real cost of sequencing: scaling computation to keep pace with data generation. Genome biology, 17 (1) PMID: 27009100

Why Do We Need Sequencing When There’s Exome Chip?

In my last post, I reviewed a genome-wide association study highlighting the importance of rare genetic variants in complex disease, specifically age-related macular degeneration (AMD). Notably, that GWAS was conducted using a custom high-throughput SNP array with classic GWAS variants (tag SNPs), a catalog of known protein-altering variants (exome chip) and several custom sets based on prior studies of AMD:

  1. All variants correlated with replicated GWAS hits for AMD
  2. Protein-altering variants within 500 kb of 22 “index SNPs” uncovered in targeted sequencing of GWAS loci
  3. Virtually all known variants in ABCA4 (in which recessive-acting mutations cause Stargardt disease), independent of consequence
  4. Predicted cysteine-altering substitutions in TIMP3, because the known cysteine mutations cause an AMD-like phenotype.

Altogether, the authors examined 440,000 unique variants in more than 43,000 samples (cases & controls). The genotyped markers accounted for 47% of variability in advanced AMD risk. Some of the associated variants were super rare (MAF<1%), suggesting that genotyping studies like this are well-powered to detect associations even at allele frequencies below one percent. Which leads some researchers to ask a difficult question:

Why Do We Need Sequencing?

Despite the plummeting costs afforded by newer instruments, sequencing studies remain far more expensive than genotyping studies: exome sequencing costs 3-5x more, and whole genome sequencing costs 15-20x more.For genetic studies of common complex disease, many researchers now consider 10,000 samples the absolute minimum. A cohort like the one in the AMD study (44,000 samples) probably costs $2.2 million to genotype, compared to sequencing costs of $9.7 million (exome) to $45 million (whole genome).

Sure, you could do fewer samples, but then you lose the power for detecting association in the lowest-frequency variant classes.

Despite these economic challenges, I believe there are several strong arguments for sequencing rather than genotyping.

1. Rare and private variants

No matter how comprehensive a SNP chip might be, the design still relies on known variant positions. The incredible growth of databases like dbSNP — fueled by large-scale discovery sequencing studies — certainly provides millions of variants to choose from. Yet the fact remains that 2-5% of genetic variants in an individual genome are novel with respect to public databases. These are generally super-rare variants which is why they haven’t been found. They might be private to just a family (inherited) or even an individual (de novo). SNP arrays will always miss this class of variants, because their positions and alleles aren’t known before the experiment.

2. Large-scale variation

The argument for SNP arrays also conveniently ignores larger genomic variants: insertions, deletions, duplications, inversions, and other rearrangements. While far less prevalent than SNPs, structural variants (SVs) can affect more bases in an individual’s genome simply because of their size. These are generally not amenable to high-throughput array designs because of their size, and the imprecise nature of their boundaries. Although common SVs may be tagged reasonably well by SNP markers, the rare genetic variants will not be.

A significant proportion of SVs affect known protein-coding genes by altering coding sequence or gene dose. Although SV detection by short-read sequencing is by no means a solved problem, this class of variation is missed entirely by a SNP chip design.

3. Regulatory sequences and functional elements

A fundamental weakness of exome chip designs (and exome sequencing, for that matter) is the emphasis of known genes. Undoubtedly, many (if not most) of the variants underlying complex phenotypes are located outside of the 1.5% of our genome that codes for proteins. We expect that common variants in such elements will be well-interrogated by classic GWAS approaches, but rare variants will not. And we don’t yet know enough about the regulatory regions of the genome to select variants for a custom array.

4. Aggregation Tests

Aggregation tests (sometimes called burden tests) were developed to help identify genetic association driven by variants that, individually, are too rare to reach statistical significance on their own. The theory and approaches of aggregation testing are too broad a topic to cover here, but the general concept is this: by grouping individual rare variants into a biological unit — most often, a gene or exon — it’s possible to test the super-genotype (i.e. “has a rare variant in gene A”) for association. Without examining hundreds of thousands of samples, this is the only way to identify very rare trait-associated variants.

Although they rely on certain assumptions, such as the ability to define which variants truly affect gene function, aggregation tests have another thing going for them. The collective association of multiple independent variants in one locus strongly suggests that it’s the functional element responsible for that association. In other words, while common associated variants (tag SNPs) tell us the region of the gene where variation seems to affect a disease, aggregation tests identify the potential drug target.

You Don’t Know What You’re Missing

All of this points to the heart of the problem with genotyping over sequencing studies: many of the most interesting classes of variation are overlooked. These may or may not play a role in disease risk, but you don’t know because the GWAS didn’t interrogate them. This is especially worrisome for diseases in which the genetic architecture of risk remains poorly understood. A well-powered GWAS that produces no hits might mean that genetic variation has little influence on the phenotype. But it might also mean that rare or non-SNP variants govern the genetic basis of disease.

We won’t know until we survey them all, and genome sequencing remains the only way to do that.