Transitions and Excuses

My sincere apologies to the dedicated MassGenomics readers who’ve noticed the recent decline in new posts here. It’s a busy and tumultuous time for our institute.

Leadership Transition at MGI

For those who missed the announcement earlier this month: our center’s director Rick Wilson and co-director Elaine Mardis announced that they’re leaving to establish a new Institute for Genomic Medicine at Nationwide Children’s Hospital / Ohio State University in Columbus, Ohio.

We are still figuring out the transition plan, but the Washington University School of Medicine remains very committed to supporting our center and the people who work here. In other words, the McDonnell Genome Institute will continue on.

Large-scale Sequencing Opportunities

In the meantime, we are in the midst of large-scale sequencing efforts for the Centers for Complex Disease Genomics (CCDG), Alzheimers Disease Sequencing Project (ADSP), and Gabriella Miller Kids First (GMKF) initiatives. These are all ambitious projects in which I’m intimately involved, which means they consume a lot of my time. On the bright side, they keep me at the forefront of genomics where I can continue to be useful to you.

Important note for fellow scientists: Even with our current commitments, the HiSeq X Ten remains a hungry beast, so please get in touch if you’re looking for low-cost genome sequencing. With the X Ten and other instruments, we can provide custom-targeted, exome, whole genome, and/or transcriptome sequencing for humans and model organisms.

More Science Fiction

Last but not least, some personal news that may help explain why I’ve had less free time to write on MassGenomics. As you know, Harper Voyager (an imprint of HarperCollins) published my debut novel earlier this year. I’m thrilled to announce that I’ve accepted an offer from my publisher for two more books, effectively making The Rogue Retrieval into a trilogy.


All of you have been enormously supportive of my science fiction writing as well as my science writing, and I hope that will continue.

Once the dust settles from this transition period, I should be posting at a more regular schedule. So please stick around!

The Real Cost of Sequencing

The real cost of sequencing is as hard to pin down as a sumo wrestler. Working in a large-scale sequencing laboratory offers an interesting perspective on the duality of the so-called “cost per genome.” On one hand, we see certain equipment manufacturers and many people in the media tossing around claims that sequencing a genome now costs under $1,000. On the other, we write grant budgets and estimates based on actual costs, which include things like sample assessment, variant calling, and data storage. With these incorporated, the cost per genome is not that low, even for large projects.

I came across a wonderful opinion piece at Genome Biology, in which the authors discuss the evolution of sequencing and computing technologies over the past 60 years. Admittedly, I found it a bit daunting at first, because theories of computation and “conceptual frameworks” don’t excite me. Once I pushed past the organizing principle stuff, however, I found it contained some shrewd perspectives on the current state and near future of genomics.

Big Data: Large Scale Sequencing

Big Data Sequencing

Credit: Muir et al, Genome Biology, 2016

The rise of next-gen sequencing factors significantly in the big data paradigm for genomics. Rather than trot out the sequencing cost versus Moore’s law figure, the authors provided some compelling illustrations of the dramatic increase in the pace and quantity of sequencing. The most striking of these was a pie chart of the sequence data contributed by large-scale projects.

The Cancer Genome Atlas (TCGA) dwarfs everyone else, with 2300 Terabases of sequencing data. This is ten times the amount generated by the 1,000 Genomes Project, and 30 times the amount in the Alzheimer’s Disease Sequencing Project (ADSP).

Costs and Economies of Scale

A key concept highlighted by the authors is the interplay between fixed and variable costs. The sequencing technologies utilized for the Human Genome Project had considerable up-front costs (i.e. instrument purchase) and somewhat fixed per-sample costs. In contrast, next-generation sequencing has a high up-front cost, but a reduced per-sample cost as volume increases. In other words, the more genomes we produce, the less they cost. True, this economy of scale has an upper limit, but the current throughput of an Illumina X Ten system — 18,000 human whole genomes per year — provides enormous capacity.

Interestingly, the opposite paradigm-shift is taking place in the computing industry. Until recently, the model for computing mirrored NGS: large up-front cost of buying the servers, but lower variable costs for running them. In some ways, this erected a barrier for smaller labs hoping to tackle complex problems, because they might not be able to afford enough computing equipment to handle the workload. Yet cloud computing and computing-as-a-service platforms have largely removed the need for that up-front investment. Anyone can buy as much computing power as they need on the Amazon or Google clouds. Although the variable cost (per CPU hour) is higher than that of a large data center, there’s no large fixed cost at the front end. As the authors put it:

This new regime, in which costs scale with the amount of computational processing time, places a premium on driving down the average cost by developing efficient algorithms for data processing.

As a bioinformatician, I think this is a good thing, because it forces us to improve our software tools and pipelines to become as efficient as possible.

Although cloud computing offers tremendous appeal, it faces some challenges for widespread adoption in our field. Most sequencing take place in academic settings, where equipment purchases are often exempt from indirect fees (because the university can write off depreciation). Also, many investigators don’t have to pay for the basic utilities required to run computing equipment (e.g. electricity and cooling). These factors encourage us to stick with the traditional computing model, rather than shifting to cloud computing which will be subject to indirect costs.

Breaking Down the Cost of Sequencing

sequencing cost breakdown

Muir et al, Genome Biology, 2016

We tend to measure the cost of sequencing as bases per dollar, or more recently, X dollars per genome. Both funding agencies and sequencing customers like to ask how much an exome or a genome costs. This single-price figure has some disadvantages:

  1. It’s not always clear what that dollar figure includes. Is it purely the sequencing run cost, or does it account for non-free things like sample assessment, handling, and bioinformatics analysis? Notice how they’re not included in the figure at right.
  2. It obscures the true cost breakdown of a sequencing project into its constituent parts, which complicates cost estimates and makes it harder to adapt to changes like the shift to cloud computing.
  3. It can lead to unrealistic expectations. People hear about this $1,000 genome, so they come to us for a whole-genome sequencing quote, and get upset when (1) it’s not that low, and (2) we have to add other costs, like sample handling, to the estimate.

Unrealistic expectations are a source of constant frustration for us. When we provide estimates for a sequencing project, we include analysis time as a recommended (but often not required) line item. Of course, no one wants to pay for analysis — they just want the sequencing. Sometimes this is just fine — we provide sequencing for a number of collaborators who are capable at NGS analysis. Other times, the customer later asks “How do I open this BAM file to see my variants?”

Sorry, but high-quality variant calls require analysis, and as I’ve written before, bioinformatics analysis is not free.

One thing that concerns me about the current state of federal funding (for sequencing) in the United States is that large-scale projects emphasize data production, not data analysis. The RFA for NHGRI’s large-scale sequencing program (CCDG) mandated that 80% of the budget go to data production. Yet as the authors of this opinion piece correctly point out:

As bioinformatics becomes increasingly important in the generation of biological insight from sequencing data, the long-term storage and analysis of sequencing data will represent a larger fraction of project cost.

I couldn’t agree more.



Muir P, Li S, Lou S, Wang D, Spakowicz DJ, Salichos L, Zhang J, Weinstock GM, Isaacs F, Rozowsky J, & Gerstein M (2016). The real cost of sequencing: scaling computation to keep pace with data generation. Genome biology, 17 (1) PMID: 27009100

Why Do We Need Sequencing When There’s Exome Chip?

In my last post, I reviewed a genome-wide association study highlighting the importance of rare genetic variants in complex disease, specifically age-related macular degeneration (AMD). Notably, that GWAS was conducted using a custom high-throughput SNP array with classic GWAS variants (tag SNPs), a catalog of known protein-altering variants (exome chip) and several custom sets based on prior studies of AMD:

  1. All variants correlated with replicated GWAS hits for AMD
  2. Protein-altering variants within 500 kb of 22 “index SNPs” uncovered in targeted sequencing of GWAS loci
  3. Virtually all known variants in ABCA4 (in which recessive-acting mutations cause Stargardt disease), independent of consequence
  4. Predicted cysteine-altering substitutions in TIMP3, because the known cysteine mutations cause an AMD-like phenotype.

Altogether, the authors examined 440,000 unique variants in more than 43,000 samples (cases & controls). The genotyped markers accounted for 47% of variability in advanced AMD risk. Some of the associated variants were super rare (MAF<1%), suggesting that genotyping studies like this are well-powered to detect associations even at allele frequencies below one percent. Which leads some researchers to ask a difficult question:

Why Do We Need Sequencing?

Despite the plummeting costs afforded by newer instruments, sequencing studies remain far more expensive than genotyping studies: exome sequencing costs 3-5x more, and whole genome sequencing costs 15-20x more.For genetic studies of common complex disease, many researchers now consider 10,000 samples the absolute minimum. A cohort like the one in the AMD study (44,000 samples) probably costs $2.2 million to genotype, compared to sequencing costs of $9.7 million (exome) to $45 million (whole genome).

Sure, you could do fewer samples, but then you lose the power for detecting association in the lowest-frequency variant classes.

Despite these economic challenges, I believe there are several strong arguments for sequencing rather than genotyping.

1. Rare and private variants

No matter how comprehensive a SNP chip might be, the design still relies on known variant positions. The incredible growth of databases like dbSNP — fueled by large-scale discovery sequencing studies — certainly provides millions of variants to choose from. Yet the fact remains that 2-5% of genetic variants in an individual genome are novel with respect to public databases. These are generally super-rare variants which is why they haven’t been found. They might be private to just a family (inherited) or even an individual (de novo). SNP arrays will always miss this class of variants, because their positions and alleles aren’t known before the experiment.

2. Large-scale variation

The argument for SNP arrays also conveniently ignores larger genomic variants: insertions, deletions, duplications, inversions, and other rearrangements. While far less prevalent than SNPs, structural variants (SVs) can affect more bases in an individual’s genome simply because of their size. These are generally not amenable to high-throughput array designs because of their size, and the imprecise nature of their boundaries. Although common SVs may be tagged reasonably well by SNP markers, the rare genetic variants will not be.

A significant proportion of SVs affect known protein-coding genes by altering coding sequence or gene dose. Although SV detection by short-read sequencing is by no means a solved problem, this class of variation is missed entirely by a SNP chip design.

3. Regulatory sequences and functional elements

A fundamental weakness of exome chip designs (and exome sequencing, for that matter) is the emphasis of known genes. Undoubtedly, many (if not most) of the variants underlying complex phenotypes are located outside of the 1.5% of our genome that codes for proteins. We expect that common variants in such elements will be well-interrogated by classic GWAS approaches, but rare variants will not. And we don’t yet know enough about the regulatory regions of the genome to select variants for a custom array.

4. Aggregation Tests

Aggregation tests (sometimes called burden tests) were developed to help identify genetic association driven by variants that, individually, are too rare to reach statistical significance on their own. The theory and approaches of aggregation testing are too broad a topic to cover here, but the general concept is this: by grouping individual rare variants into a biological unit — most often, a gene or exon — it’s possible to test the super-genotype (i.e. “has a rare variant in gene A”) for association. Without examining hundreds of thousands of samples, this is the only way to identify very rare trait-associated variants.

Although they rely on certain assumptions, such as the ability to define which variants truly affect gene function, aggregation tests have another thing going for them. The collective association of multiple independent variants in one locus strongly suggests that it’s the functional element responsible for that association. In other words, while common associated variants (tag SNPs) tell us the region of the gene where variation seems to affect a disease, aggregation tests identify the potential drug target.

You Don’t Know What You’re Missing

All of this points to the heart of the problem with genotyping over sequencing studies: many of the most interesting classes of variation are overlooked. These may or may not play a role in disease risk, but you don’t know because the GWAS didn’t interrogate them. This is especially worrisome for diseases in which the genetic architecture of risk remains poorly understood. A well-powered GWAS that produces no hits might mean that genetic variation has little influence on the phenotype. But it might also mean that rare or non-SNP variants govern the genetic basis of disease.

We won’t know until we survey them all, and genome sequencing remains the only way to do that.


Rare Variant Studies of Common Disease

Not so long ago, there was a hope in the research community that common genetic variation, i.e. variants present at minor allele frequencies >5% in human populations, might explain most or all of the heritability of common complex disease. That would have been convenient, because such variants can be genotyped with precise, inexpensive, high-density SNP arrays in tens of thousands of samples.

Sadly, the human genome doesn’t play that way.

Genome-wide association studies have implicated hundreds (if not thousands) of new loci in common complex disease. Yet most of the identified variants had a very small effect on risk, and they collectively explained only a fraction of disease heritability. One possible explanation was that rare variants, which are largely untested by high-density SNP arrays, might account for some of that missing heritability. Yet large-scale sequencing studies of common complex disease have not been financially viable until very recently.

As we forge ahead with the Alzheimer’s Disease Sequencing Project, TopMed, CCDG, and other projects, it’s promising to see results like those in the common/rare variant association study recently published by the International AMD Genomics Consortium.

Age-related Macular Degeneration: A Common Disease

Age-related macular degeneration (AMD) is the leading cause of blindness, affecting about 10 million patients worldwide. It’s a progressive disease whose biological underpinnings are still not well understood, and therapeutic options are limited. Like most age-related diseases, this is a complex phenotype with numerous risk factors, but there’s clearly a substantial inherited component at play.

As of last year, GWAS efforts had uncovered 21 loci in which genetic variation affects disease risk. Translating these into biological insights (or better yet, therapeutic targets) has been challenging.

Massive GWAS: Common and Rare Variants

The International AMD Genomics Consortium (IAMDGC) brought together 16,000 AMD cases and 17,000 controls from 26 different studies, and genotyped them using a customized set of variants:

  • Common variants used for classic genome-wide association studies
  • Low-frequency coding variants, i.e. “exome chip”
  • Protein-altering variants detected by previous AMD gene sequencing studies

Altogether, the authors directly genotyped about 450,000 variants (160,000 of which were protein-altering). After imputation, they were able to analyze 12 million variants overall. Single-variant association testing revealed 34 susceptibility loci for AMD:

AMD gwas loci

Figure 1a (IAMDGC, Nature Genet, 2015)

The 52 associated variants roughly double the number of genetic loci for AMD. The vast majority of them (42/52) are common, with MAF >1% and relatively small effects on risk. The odds ratio (OR) which measures the relative increase/decrease of risk conferred by such variants, ranges from 1.1-2.9.

The Role of Rare Variants

Yet the authors also observed 7 significantly associated rare variants (MAF<1%) with odds ratios of 1.1-47.6. All seven were located in or near complement genes (that’s “complement” as in the innate immune system complex), which had been implicated in AMD by sequencing studies over the past couple of years. Four genes also exhibited a significant burden of rare damaging variants, suggesting a functional link to disease risk.

Notably, three of those four burden signals were due to variants with frequency <0.1%, suggesting that trait-associated variants with clear functional consequences might be even rarer than we’d guessed. The corollary, of course, is that sample sizes will need to be much larger to detect them with any kind of power.

Shared Genetics for Mendelian and Complex Disease

One of the rare variant burden genes, TIMP3, was previously associated with Sorsby’s fundus dystrophy, a rare disease similar to AMD but with earlier on set and Mendelian inheritance. The Mendelian disease variants occur largely in exon 5, but the IAMDGC’s study uncovered a number of rare variants of the same class (nonsynonymous changes disrupting cysteine residues) in other exons in AMD cases.

Carriers of such alleles also had a burden of other AMD-associated variants, suggesting that TIMP3 variation contributes to disease risk in conjunction with other variants. It’s a cool example of variation in the same gene giving rise to monogenic and complex disorders with similar clinical presentations.

Outlook for Common Disease Genomics

I like this study because it demonstrates the importance of looking at both common and rare variants, in a large number of samples, to more comprehensively interrogate the genome for complex disease loci. It sets the stage for large-scale sequencing of complex disease. We have the tools and we have the sample collections. Now, we just need the funding.

Fritsche LG, Igl W, Bailey JN, Grassmann F, Sengupta S, Bragg-Gresham JL, Burdon KP, Hebbring SJ, Wen C, Gorski M, Kim IK, Cho D, Zack D, Souied E, Scholl HP, Bala E, Lee KE, Hunter DJ, Sardell RJ, Mitchell P, Merriam JE, Cipriani V, Hoffman JD, Schick T, Lechanteur YT, Guymer RH, Johnson MP, Jiang Y, Stanton CM, Buitendijk GH, Zhan X, Kwong AM, Boleda A, Brooks M, Gieser L, Ratnapriya R, Branham KE, Foerster JR, Heckenlively JR, Othman MI, Vote BJ, Liang HH, Souzeau E, McAllister IL, Isaacs T, Hall J, Lake S, Mackey DA, Constable IJ, Craig JE, Kitchner TE, Yang Z, Su Z, Luo H, Chen D, Ouyang H, Flagg K, Lin D, Mao G, Ferreyra H, Stark K, von Strachwitz CN, Wolf A, Brandl C, Rudolph G, Olden M, Morrison MA, Morgan DJ, Schu M, Ahn J, Silvestri G, Tsironi EE, Park KH, Farrer LA, Orlin A, Brucker A, Li M, Curcio CA, Mohand-Saïd S, Sahel JA, Audo I, Benchaboune M, Cree AJ, Rennie CA, Goverdhan SV, Grunin M, Hagbi-Levi S, Campochiaro P, Katsanis N, Holz FG, Blond F, Blanché H, Deleuze JF, Igo RP Jr, Truitt B, Peachey NS, Meuer SM, Myers CE, Moore EL, Klein R, Hauser MA, Postel EA, Courtenay MD, Schwartz SG, Kovach JL, Scott WK, Liew G, Tan AG, Gopinath B, Merriam JC, Smith RT, Khan JC, Shahid H, Moore AT, McGrath JA, Laux R, Brantley MA Jr, Agarwal A, Ersoy L, Caramoy A, Langmann T, Saksens NT, de Jong EK, Hoyng CB, Cain MS, Richardson AJ, Martin TM, Blangero J, Weeks DE, Dhillon B, van Duijn CM, Doheny KF, Romm J, Klaver CC, Hayward C, Gorin MB, Klein ML, Baird PN, den Hollander AI, Fauser S, Yates JR, Allikmets R, Wang JJ, Schaumberg DA, Klein BE, Hagstrom SA, Chowers I, Lotery AJ, Léveillard T, Zhang K, Brilliant MH, Hewitt AW, Swaroop A, Chew EY, Pericak-Vance MA, DeAngelis M, Stambolian D, Haines JL, Iyengar SK, Weber BH, Abecasis GR, & Heid IM (2016). A large genome-wide association study of age-related macular degeneration highlights contributions of rare and common variants. Nature genetics, 48 (2), 134-43 PMID: 26691988