Illumina’s New HiSeq X Instruments

Ah, Illumina. You have to admire their marketing savvy. Last year at around this time, they announced the HiSeq X Ten system, a “factory installation” for human whole genome sequencing (only) at an incredible scale: 18,000 genomes per year, at a cost of $1,000 each (for consumables). It’s still the sequencing-by-synthesis technology employed by previous Illumina instruments, but with a new “patterned” flowcell that spaces the clusters more evenly, for a more efficient sequencing yield.

The cost, of course, was considerable: $10 million for 10 instruments, the required minimum. The operating costs are also considerable, not just for reagents (up to $1.8 million per year) but secondary things like disk, compute, and even internet bandwidth. It’s a massive amount of data to store, analyze, and submit.

There’s Always A Catch

Importantly, the HiSeq X systems can only be applied to human whole genome sequencing. Human means no plants, animals, or model organisms. Whole genome means no targeted/exome sequencing, no RNA-seq. All of those applications will have to go on other platforms.

According to their CEO, Illumina sold 18 HiSeq X Ten systems last year. That’s an impressive number, and even more than they expected. The total capacity exceeds 400,000 genomes per year (some installations got more than 10 instruments). Filling all of that capacity is (and remains) a major challenge, because human samples are precious. They require informed consent for sequencing, IRB review boards, privacy protections. Human samples are the new commodity.

The HiSeq X Five

Illumina hiseq x system

The HiSeq X Five

Now there’s a second option: as announced earlier this month, Illumina will be selling HiSeq X Five systems (5 instruments) for $6 million each.The lower buy-in likely means that even more groups can adopt the HiSeq X technology. They’ll have the same restrictions as the HiSeq X Ten, but half of the capacity (9,000 genomes per year). That’s still a considerable number of whole genomes. Probably more than have been sequenced by the research community in the past five years.

The per-genome cost will also be $1,400 per sample. That’s 40% higher than the cost on the X Ten, but I think it’s around half of what it would cost on the HiSeq2500.

HiSeq3000 and HiSeq4000

There’s also a new generation of the HiSeq2500 instrument to become available later this year. The HiSeq3000 will run a single patterned flowcell for 750 Gbp per run. The HiSeq4000 will run two patterned flowcells, for twice that capacity.

Eventually, these will supplant current HiSeq2500 instruments. I expect they’ll be busy, too, running the exomes, the targeted sequencing, the RNA-seq and bisulfite sequencing.

The Promise of Human WGS

But back to the HiSeq X systems. Personally, I don’t like the idea of a single company dominating the market, and essentially attempting to dictate how human genetics research should be conducted. At the same time, I can’t argue with the direction we’re headed. We had high hopes for SNP arrays and GWAS, but as I discussed in my previous post, sequencing at large scale is required to uncover the full scope of genetic variation underlying complex phenotypes.

And let’s face it, exome sequencing lets us conveniently avoid some of the most challenging aspects of human genomics, like detecting complex rearrangements (SVs) and interpreting noncoding regulatory variants. Both are undoubtedly important to human disease, but more difficult to study. Yet the only way we’ll make progress is to study them in large cohorts numbering thousands of samples. Now, at least, we have the tools to do that.

Common disease genomics by large-scale sequencing

Understanding the genetic basis of common disease is an important goal for human genetics research. Nothing that we do is easy — the ~25% success rate of exome sequencing in monogenic (Mendelian) disorders is proof enough of that — but the challenges of complex disease genetics are considerable.

Cardiovascular and metabolic diseases in particular arise from a complex array of factors beyond genetics, such as age, diet, and lifestyle. We also expect that most of the genetic variants conferring risk will have small effect sizes, which makes their identification all the more difficult.

Common Variation: the GWAS

We do have some powerful tools. Over the last decade, researchers have leveraged high-density SNP array genotyping — which is relatively cheap, high-throughput, and captures the majority of common genetic variation in human populations — to conduct massive genome-wide association studies (GWAS) of common disease.

This approach has yielded thousands of genetic associations, implicating certain loci in the risk for certain diseases.

Rare Variation: Sequencing Required

Yet the variants identified (and genes implicated) explain only a fraction of the genetic component of these diseases, and they generally don’t interrogate rare variation, i.e. variants with a frequency of <1% in the population. The only way to get at these is by sequencing, and the rapid evolution of next-generation sequencing technologies has begun to make that feasible.

A new study in Nature describes such an effort: a search for rare variants associated with risk for myocardial infarction (MI), or in layman’s terms, heart attack. It not only yielded some key discoveries, but showcased some of the challenges and expectations we should have in mind when undertaking large-scale sequencing studies of common disease.

NHLBI’s Exome Sequencing Project

A few years ago, the National Heart, Lung, and Blood Institute of the NIH did something very wise: they funded a large-scale exome sequencing project (referred to by many as “the ESP”) comprising several thousand samples from a number of cohorts. As one of the earliest widely-available exome sequencing datasets at this scale, the NHLBI-ESP quickly became an important resource for the human genetics community.

At the most basic level, it tells us the approximate frequencies of hundreds of thousands of coding variants in European and African populations. Unlike the 1,000 Genomes Project, however, the ESP collected deep phenotyping data, enabling genetic studies of many complex phenotypes.

First Pass: Association and Burden of Rare Variants

Discovery exome sequencing

Discovery phase: case/control selection (R. Do et al, Nature 2015)

Ron Do and his 90+ co-authors designed a discovery study for the extreme phenotype of early-onset MI. Across 11 studies in the ESP, they identified 1088 individuals who’d had a heart attack at an early age (<50 for men, <60 for women).  As a control group, they selected 978 individuals who were at least a decade older than that but had had no heart attack. And with the exome data already in hand, they could search for rare variation associated with the phenotype (early-onset MI) in different ways:

  1. Individual variant associations. Among low-frequency (MAF 1-5%) coding variants, no single variant was significantly associated with the phenotype.
  2. Gene-based associations. Rather than considering individual variants, the authors looked at the “burden” of rare variants at the gene level. For each gene, the authors compared the fraction of samples with at least one rare (MAF<1%) coding variant between cases and controls. No genes had significant associations.

Importantly, gene-based association tests (also called “burden tests”) can be performed in a variety of ways. What frequency threshold should be used? What distinguishes a benign variant from a damaging one? The authors set a MAF ceiling of 1% and considered three sets of variants:

  • Nonsynonymous. All missense, splice site, nonsense, and frameshift variants.
  • Deleterious. The nonsynonymous set above, minus missense variants predicted to be benign by Polyphen2.
  • Disruptive. Nonsense, splice site, and frameshift variants only.

These were reasonable choices, comparable to what we or other groups do in this kind of study. Still, there were no significant results so it was on to phase 2.

Genotype Imputation and Exome Chip

principal components of ESP

PCA analysis (R. Do et al, Nature 2015)

It’s very possible that there are individual variants and genes associated with the phenotype, but the authors didn’t examine enough samples to find them (by their own calculations, in a best-case scenario the power for a study of this size was about 0.2).

 

So they pursued a few strategies to increase the sample numbers substantially. Across the 11 cohorts there were over 25,000 early-onset MI cases (and an even larger number of suitable controls) but these samples only had SNP array data, and the vast majority of markers on SNP arrays are non-coding.

low frequency variants in MI

Low freq. variant follow-up (R. Do et al, Nature 2015)

So the authors undertook a major effort to impute (statistically predict) the genotypes of 400,000 coding variants based on the SNP array data and a reference panel of samples that had both SNP array and exome data. This was a herculean effort that only merited two sentences in the main text (there are severe restrictions on a “letter” to Nature) because it yielded no finding: no significant association, even with imputed genotypes for 28,068 cases and 36,064 controls.

The authors also performed high-throughput genotyping with the so-called “exome chip,” which looks at ~250,000 known coding variants, in about 15,000 samples. At the time, the cost of running that many exome chips likely exceeded $1 million. Yet there were no significant associations, so that, too, got 2 sentences in the main text.

Targeted Resequencing Follow-up

Rare variant sequencing

Sequencing follow up (R. Do et al, Nature 2015)

The authors needed sequencing data, but they also needed more samples. These things not being free, they decided to choose six of the most promising genes (based on not-entirely-disclosed biologic / statistical evidence) for targeted resequencing in about 1,000 more samples. Once that was done, and the analysis performed yet again, of those (APOA5) looked promising. So the authors sequenced just that gene in three additional studies. This was a mix of PCR-based 3730 sequencing and multiplexed long-range PCR libraries on a MiSeq instrument.

Finally, after sequencing the exons of APOA5 in 6,721 cases and 6,711 controls, the authors had an association that reached genome-wide significance: 5 x 10-7 when a burden test with all nonsynonymous variants was used (the threshold is 8 x 10-7).

More Exome Sequencing Yields Most Obvious Gene Ever

The fourth and final follow-up strategy was simply to do more exome sequencing of ~7,700 individuals, bringing the total to 9,793 samples (4,703 cases and 5,090 controls). After applying a variety of burden test strategies, the authors found exactly one gene with significant evidence of association: LDLR, which encodes the low-density lipoprotein receptor. It’s been known for many years that mutations in LDLR cause autosomal-dominant familial hypercholesteremia, and high LDL cholesterol is one of the top risk factors for MI, so this is both a biologically plausible and completely unsurprising hit.

About 6% of cases carry a nonsynonymous variant in LDLR, compared to 4% of controls, so the odds ratio is about 1.5. This is a classic GWAS result, isn’t it? Very obvious candidate gene achieves statistical significance and the odds ratio is very low.

Interestingly, however, if the authors apply more stringent criteria for variants, the effect becomes more dramatic:

  • Deleterious variants (i.e removing Polyphen’s benign missense) were in 3.1% of cases, 1.3% of controls, yielding the best p-value 1 x 10-11 and an odds ratio of 2.4.
  • Strictly-deleterious missense (requiring 5/5 programs to call a missense variant deleterious) were in 1.9% of cases and 0.45% of controls, yielding a slightly higher p-value of 3 x 10-11 but an odds ratio of 4.2.
  • Disruptive variants had the highest odds ratio (13.0), but with a much higher p-value (9 x 10-5) and affecting just 0.5% of cases. These are basically familial hypercholesteremia carriers.

Conclusions

At first glance, one might wonder how this came to be a Nature paper because there were no truly novel findings. LDLR was already well known, and APOA5, which encodes an apolipoprotein that regulates plasma triglyceride levels, was already a strong candidate gene for MI. In fact, two other genes related to APOA5 function had already been reported for association with plasma TG levels and early-onset MI, and the gene resides in a known locus for plasma TG levels identified by classic GWAS.

True, that region had extensive LD and it wasn’t clear which of the few genes there were involved in the phenotype. And technically, APOA5 had not yet been established as a bona-fide gene for early-onset MI. This is the final nail in the coffin, but look what it took to get here: exome sequencing, followed by targeted sequencing, and then even more targeted sequencing. It’s glossed over in the paper, but every step in the authors’ pursuit of APOA5 required timely, careful analysis of the genetic evidence.

In the last part of their letter, the authors discuss some of their power calculations for large-scale genetic studies of this nature. They sought to answer that pivotal question, “How many samples do we have to sequence to find something?”

Because of the challenge of distinguishing benign from deleterious alleles, and the extreme rarity of the latter, well-powered studies of complex disease will require sequencing thousands of cases. Here’s the authors’ power calculations for a gene harboring a median number of nonsynonymous variants:

Power calculations for sequencing gwas

Power to detect gene with median # of variants (R. Do et al, Nature 2015)

  • In a best-case scenario — a gene harboring large numbers of nonsynonymous variants, each conferring the same direction of effect — we’re talking 7,500 samples to achieve >90% power.
  • In a more likely scenario (i.e. the power calculations above) — a gene harboring median numbers of nonsynonymous variants — it’s 10,000 or more samples.

Generating, managing, and analyzing exome or genome sequencing data for these sample numbers is a massive undertaking. Undoubtedly, this will be the mission for us and other large-scale sequencing centers for years to come.

References
Do R, Stitziel NO, Won H, Jørgensen AB, Duga S, et al (2014). Exome sequencing identifies rare LDLR and APOA5 alleles conferring risk for myocardial infarction. Nature PMID: 25487149

A New Name and Era for The Genome Institute

The Genome Institute at WashU has gone by a few different names over the years… previously we were “The Genome Center” and before that, during the Human Genome Project, we were “The Genome Sequencing Center.” Now the name changes but for a good reason. As announced today in the WashU Record, The Genome Institute has received a $25 million endowment from James and Elizabeth McDonnell. This gift will support our research on the genetic basis of cancer, diabetes, Alzheimer’s, and other diseases with the goal of improving diagnosis and treatment. With it comes a new name: The Elizabeth H. and James S. McDonnell Genome Institute.

Less formally but more practically speaking, we’ll probably be called the McDonnell Genome Institute, or MGI.

The McDonnell Family

James and Elizabeth McDonnell

James and Elizabeth McDonnell

I happened to grow up in the St. Louis area, where the name “McDonnell” is legendary. James S. McDonnell (our new patron’s father) founded McDonnell Aircraft in 1939. Later it became the McDonnell Douglas Corporation, a household name in St. Louis. James S. McDonnell III began his career in 1963 as an aerodynamics engineer at his father’s firm, became vice president of the company in 1973, and retired in 1991. He remained a director for the company until its merger with Boeing in 1997.

The McDonnell family has given generously to many organizations for decades. Their legacy in the field of genetics began in the 60s, with gifts that established the McDonnell Medical Sciences building on our campus and endowed the James S. McDonnell Professorship of Genetics. In 1975, another gift from the McDonnell family established the Department of Genetics, which was one of the first in the country.

Genetic and Genomic Research

Our center was founded in 1993 and played a key role in the Human Genome Project (contributing 25% of sequencing data). We’re one of three NIH-funded large-scale sequencing centers in the United States. Some of our key research areas that will benefit from this gift include:

Cancer

We published the first complete genome of a tumor in 2008 and have applied next-gen sequencing technologies to breast cancer, glioblastoma, ovarian cancer, leukemia, and other cancers. Much of that is through the Cancer Genome Atlas (TCGA), a national effort to characterize the genomes of common cancer types. Recently, we partnered with the Siteman Cancer Center to establish the Genomics Tumor Board, which aims to apply rapid-turnaround sequencing to guide diagnosis and treatment of cancer patients.

Common diseases

In the last few years we’ve also devoted considerable efforts to understanding the genetic basis of common diseases. Last year, for example, we identified a rare coding variant in complement 3 associated with age-related macular degeneration, which is the leading cause of vision loss in adults. For the past two years, we’ve been part of the Alzheimer’s Disease Sequencing Project (ADSP), a $50 million effort to identify new genomic variants contributing to late-onset Alzheimer’s Disease. We’re also working with Nelson Freimer (UCLA), Mike Boehnke (UM), Aarno Palotie (MGH), and other collaborators to sequence the exomes of 10,000 individuals from Finnish populations in a search for new variants linked to cardiovascular and metabolic traits.

Pediatric Disease

We’ve also tackled a number of diseases that affect children. A few years ago, we teamed up with St. Jude Children’s Research Hospital for the Pediatric Cancer Genome Project (PCGP), which has sequenced the genomes of more than 600 pediatric cancer patients. A particular focus of that effort is neuroblastoma, a form of brain cancer that took the life of the McDonnells’ 2-year-old daughter, Peggy. We’re also working with F. Sessions Cole (St. Louis Children’s Hospital) on the genomics of rare pediatric diseases. I’m thrilled that we’ve partnered with them and the Department of Pathology to start a Genomics Pediatric Board which will apply panel, exome, and whole genome sequencing to children afflicted with severe genetic diseases and their family members.

New Collaborations

MGI has benefited immensely from early adoption of new sequencing technologies. With the installation of the new Illumina HiSeq X Ten sequencing system, we’ll have the capability to sequence whole genomes at unprecedented speed and relatively low cost. We’re looking for new collaborators! If you have a (human) cohort and are interested in whole genome sequencing, MGI would like to collaborate. Please contact me (dkoboldt [at] genome.wustl.edu) and I’ll put you in touch with the right people here.

6 Applications for Whole Genome Sequencing

As 2014 draws to a close, I can’t help but speculate about the face of next-gen sequencing, genetics, and genomics in 2015. Illumina announced their plans for HiSeq X Ten “factory installation” sequencing system way back in January. It’s taken some time before the early adopters of this new technology have it up and running. But it seems reasonable to expect that several Illumina X Ten systems will be operational in 2015. Each one of those has the capacity to sequence 18,000 human genomes per year. As I wrote about recently, the transition to large-scale whole genome sequencing will bring many challenges.

Let’s set the difficulties aside for now and ask a more interesting question. What kind of scientific endeavors could we undertake with this new capability? Here are a few ideas.

1. Newborn and Pediatric Disease

Newborn intensive care units and children’s hospitals see many patients with severe, sometimes fatal diseases that have a genetic basis. Some of these are known genetic disorders, correctly diagnosed and confirmed by clinical genetic testing. A considerable number, however, resemble known diseases but affect patients with negative genetic test results. Numerous pilot programs, like the NIH’s Undiagnosed Disease Network, are using exome sequencing to cases like these. On average, exome sequencing uncovers a pathogenic mutation in 25-30% of cases.

Whole-genome sequencing is the natural next step: it can survey exonic regions that are poorly captured, and be used to detect structural variants. Now, with the X Ten system, whole genome sequencing might be the logical first step. It has a faster turnaround time, no hybridization required, and it surveys everything from single nucleotide variants to large deletions. Ideally, sequencing would be performed on the patient, both parents, and a sibling (if available).

2. Drug Trials and Pharmacogenomics

One of the great promises of genomic research is personalized medicine: tailoring disease treatments to an individual’s genetic makeup. Getting there will require studying the genetic variation underlying disease prognosis and pharmaceutical response. Many such pharmacogenomics projects are under way, though most are employing SNP arrays or targeted sequencing. Whole genome sequencing would better empower these efforts, since it would capture a much broader scope of variation that might contribute to the response.

WGS might even provide a useful front-end tool for clinical trials, where it might be used to stratify patients based on their likely response to the drug being studied.

3. Regulatory variation and eQTLs

One of the many payoffs of the International HapMap Project was that it characterized genetic variation in fibroblast cell lines that could be ordered from Coriell for subsequent experiments. With all of the SNP genotypes in hand, researchers could assess gene expression — initially with microarrays, and later with RNA-seq — and then correlate it with genetic variation. These types of studies yielded thousands of expression quantitative trait loci (eQTLs), along with insights into how genetic variation influences transcription.

Imagine a state-of-the-art study involving RNA-Seq and WGS from the same tissue sample (the RNA-seq would have to be done on another platform, like the HiSeq2000, since the X Ten can only be used for WGS). Studies from the ENCODE Project Consortium and other groups have revealed just how pervasively transcribed the genome appears to be. Undoubtedly there is sequence variation that influences gene expression but isn’t well-captured by SNP arrays.

4. Rare Tumor Types

Large-scale cancer sequencing efforts such as TCGA and ICGC have catalogued somatic mutations in a variety of common cancer types. Most of these projects had both an exome sequencing and a whole-genome sequencing component, but due to the cost, the majority of cases got exome sequencing. Even so, these studies have been incredibly useful for identifying recurrently mutated genes and pathways.

Notably, however, these efforts have targeted primarily common cancer types. There are many good reasons for this, but with low-cost whole genome sequencing I think that we can explore the whole genomes of rare tumor types as well. With TCGA, ICGC, and other datasets as a framework for comparison, we can undoubtedly learn a great deal about the somatic changes underlying rare tumor types. It could not only help the patients affected, but will give insights into what must be very unique biology.

WGS is the right tool to study these kinds of tumors because we know less about them: it will capture the full spectrum of mutations, from single base changes to large chromosomal rearrangements, in a single experiment. Then again, we’ve always been a proponent of WGS for cancer so this suggestion shouldn’t surprise anyone.

5. Clan Genomics: Family Disease Pedigrees

This may sound similar to application #1 (newborn/pediatric sequencing) but it’s a different kind of study that taps into a unique resource: multiplex pedigrees from families affected by genetic disorders. Family-based studies seemed to fall out of fashion a little bit with the rise of the case-control study, but they’re making a huge comeback now for a variety of reasons. Obviously there’s considerable power to detect variants contributing to disease in a family with segregating alleles (rather than unrelated individuals).

Also, WGS remains too expensive for case-control studies at the scale required to pick up low-effect and/or rare associated variants. With a large family pedigree, you can do linkage analysis but usually still need sequencing to pinpoint the causal mutation. WGS is attractive here, because it enables you to look at noncoding and structural variants in linkage regions, rather than taking a gene-centric approach. This is absolutely necessary: just ask any gene hunter to tell you about that huge linkage peak they have in a region without any annotated genes. There are countless examples.

 6. Large Cohorts with Extensive Phenotyping

Samples from large, well-phenotyped cohorts have always been in high demand for genetic studies. Many of them have been surveyed with SNP arrays and more recently exome sequencing. Over time, many cohorts grow both in the number of participants and the amount of phenotype data collected. Large-scale, longitudinal studies of complex traits are essential for pinpointing the underlying genetics.

Even with the HiSeq X Ten, WGS remains too costly to be applied to everyone 10,000 sample cohort. Yet a pilot study of 200, 500, or 1,000 samples may be feasible, and may uncover results that can be replicated in the larger cohort. If it were up to me, I’d select the subset of samples with the most extensive phenotype data — biomarkers, clinical measurements, RNA-seq, health records, etc. Deep phenotypes combined with WGS seems like a very powerful combination indeed.

How Would You Apply WGS?

I’ve offered a few suggestions here, but there are undoubtedly other applications of WGS that should be considered in the light of the new X Ten system. What kinds of studies would you apply it to? Please leave me a comment and let me know. By the way, one of those Illumina HiSeq X ten installations is here at WashU.

So if you have a cohort and are looking for whole-genome sequencing, we should talk.