A New Name and Era for The Genome Institute

The Genome Institute at WashU has gone by a few different names over the years… previously we were “The Genome Center” and before that, during the Human Genome Project, we were “The Genome Sequencing Center.” Now the name changes but for a good reason. As announced today in the WashU Record, The Genome Institute has received a $25 million endowment from James and Elizabeth McDonnell. This gift will support our research on the genetic basis of cancer, diabetes, Alzheimer’s, and other diseases with the goal of improving diagnosis and treatment. With it comes a new name: The Elizabeth H. and James S. McDonnell Genome Institute.

Less formally but more practically speaking, we’ll probably be called the McDonnell Genome Institute, or MGI.

The McDonnell Family

James and Elizabeth McDonnell

James and Elizabeth McDonnell

I happened to grow up in the St. Louis area, where the name “McDonnell” is legendary. James S. McDonnell (our new patron’s father) founded McDonnell Aircraft in 1939. Later it became the McDonnell Douglas Corporation, a household name in St. Louis. James S. McDonnell III began his career in 1963 as an aerodynamics engineer at his father’s firm, became vice president of the company in 1973, and retired in 1991. He remained a director for the company until its merger with Boeing in 1997.

The McDonnell family has given generously to many organizations for decades. Their legacy in the field of genetics began in the 60s, with gifts that established the McDonnell Medical Sciences building on our campus and endowed the James S. McDonnell Professorship of Genetics. In 1975, another gift from the McDonnell family established the Department of Genetics, which was one of the first in the country.

Genetic and Genomic Research

Our center was founded in 1993 and played a key role in the Human Genome Project (contributing 25% of sequencing data). We’re one of three NIH-funded large-scale sequencing centers in the United States. Some of our key research areas that will benefit from this gift include:


We published the first complete genome of a tumor in 2008 and have applied next-gen sequencing technologies to breast cancer, glioblastoma, ovarian cancer, leukemia, and other cancers. Much of that is through the Cancer Genome Atlas (TCGA), a national effort to characterize the genomes of common cancer types. Recently, we partnered with the Siteman Cancer Center to establish the Genomics Tumor Board, which aims to apply rapid-turnaround sequencing to guide diagnosis and treatment of cancer patients.

Common diseases

In the last few years we’ve also devoted considerable efforts to understanding the genetic basis of common diseases. Last year, for example, we identified a rare coding variant in complement 3 associated with age-related macular degeneration, which is the leading cause of vision loss in adults. For the past two years, we’ve been part of the Alzheimer’s Disease Sequencing Project (ADSP), a $50 million effort to identify new genomic variants contributing to late-onset Alzheimer’s Disease. We’re also working with Nelson Freimer (UCLA), Mike Boehnke (UM), Aarno Palotie (MGH), and other collaborators to sequence the exomes of 10,000 individuals from Finnish populations in a search for new variants linked to cardiovascular and metabolic traits.

Pediatric Disease

We’ve also tackled a number of diseases that affect children. A few years ago, we teamed up with St. Jude Children’s Research Hospital for the Pediatric Cancer Genome Project (PCGP), which has sequenced the genomes of more than 600 pediatric cancer patients. A particular focus of that effort is neuroblastoma, a form of brain cancer that took the life of the McDonnells’ 2-year-old daughter, Peggy. We’re also working with F. Sessions Cole (St. Louis Children’s Hospital) on the genomics of rare pediatric diseases. I’m thrilled that we’ve partnered with them and the Department of Pathology to start a Genomics Pediatric Board which will apply panel, exome, and whole genome sequencing to children afflicted with severe genetic diseases and their family members.

New Collaborations

MGI has benefited immensely from early adoption of new sequencing technologies. With the installation of the new Illumina HiSeq X Ten sequencing system, we’ll have the capability to sequence whole genomes at unprecedented speed and relatively low cost. We’re looking for new collaborators! If you have a (human) cohort and are interested in whole genome sequencing, MGI would like to collaborate. Please contact me (dkoboldt [at] genome.wustl.edu) and I’ll put you in touch with the right people here.

6 Applications for Whole Genome Sequencing

As 2014 draws to a close, I can’t help but speculate about the face of next-gen sequencing, genetics, and genomics in 2015. Illumina announced their plans for HiSeq X Ten “factory installation” sequencing system way back in January. It’s taken some time before the early adopters of this new technology have it up and running. But it seems reasonable to expect that several Illumina X Ten systems will be operational in 2015. Each one of those has the capacity to sequence 18,000 human genomes per year. As I wrote about recently, the transition to large-scale whole genome sequencing will bring many challenges.

Let’s set the difficulties aside for now and ask a more interesting question. What kind of scientific endeavors could we undertake with this new capability? Here are a few ideas.

1. Newborn and Pediatric Disease

Newborn intensive care units and children’s hospitals see many patients with severe, sometimes fatal diseases that have a genetic basis. Some of these are known genetic disorders, correctly diagnosed and confirmed by clinical genetic testing. A considerable number, however, resemble known diseases but affect patients with negative genetic test results. Numerous pilot programs, like the NIH’s Undiagnosed Disease Network, are using exome sequencing to cases like these. On average, exome sequencing uncovers a pathogenic mutation in 25-30% of cases.

Whole-genome sequencing is the natural next step: it can survey exonic regions that are poorly captured, and be used to detect structural variants. Now, with the X Ten system, whole genome sequencing might be the logical first step. It has a faster turnaround time, no hybridization required, and it surveys everything from single nucleotide variants to large deletions. Ideally, sequencing would be performed on the patient, both parents, and a sibling (if available).

2. Drug Trials and Pharmacogenomics

One of the great promises of genomic research is personalized medicine: tailoring disease treatments to an individual’s genetic makeup. Getting there will require studying the genetic variation underlying disease prognosis and pharmaceutical response. Many such pharmacogenomics projects are under way, though most are employing SNP arrays or targeted sequencing. Whole genome sequencing would better empower these efforts, since it would capture a much broader scope of variation that might contribute to the response.

WGS might even provide a useful front-end tool for clinical trials, where it might be used to stratify patients based on their likely response to the drug being studied.

3. Regulatory variation and eQTLs

One of the many payoffs of the International HapMap Project was that it characterized genetic variation in fibroblast cell lines that could be ordered from Coriell for subsequent experiments. With all of the SNP genotypes in hand, researchers could assess gene expression — initially with microarrays, and later with RNA-seq — and then correlate it with genetic variation. These types of studies yielded thousands of expression quantitative trait loci (eQTLs), along with insights into how genetic variation influences transcription.

Imagine a state-of-the-art study involving RNA-Seq and WGS from the same tissue sample (the RNA-seq would have to be done on another platform, like the HiSeq2000, since the X Ten can only be used for WGS). Studies from the ENCODE Project Consortium and other groups have revealed just how pervasively transcribed the genome appears to be. Undoubtedly there is sequence variation that influences gene expression but isn’t well-captured by SNP arrays.

4. Rare Tumor Types

Large-scale cancer sequencing efforts such as TCGA and ICGC have catalogued somatic mutations in a variety of common cancer types. Most of these projects had both an exome sequencing and a whole-genome sequencing component, but due to the cost, the majority of cases got exome sequencing. Even so, these studies have been incredibly useful for identifying recurrently mutated genes and pathways.

Notably, however, these efforts have targeted primarily common cancer types. There are many good reasons for this, but with low-cost whole genome sequencing I think that we can explore the whole genomes of rare tumor types as well. With TCGA, ICGC, and other datasets as a framework for comparison, we can undoubtedly learn a great deal about the somatic changes underlying rare tumor types. It could not only help the patients affected, but will give insights into what must be very unique biology.

WGS is the right tool to study these kinds of tumors because we know less about them: it will capture the full spectrum of mutations, from single base changes to large chromosomal rearrangements, in a single experiment. Then again, we’ve always been a proponent of WGS for cancer so this suggestion shouldn’t surprise anyone.

5. Clan Genomics: Family Disease Pedigrees

This may sound similar to application #1 (newborn/pediatric sequencing) but it’s a different kind of study that taps into a unique resource: multiplex pedigrees from families affected by genetic disorders. Family-based studies seemed to fall out of fashion a little bit with the rise of the case-control study, but they’re making a huge comeback now for a variety of reasons. Obviously there’s considerable power to detect variants contributing to disease in a family with segregating alleles (rather than unrelated individuals).

Also, WGS remains too expensive for case-control studies at the scale required to pick up low-effect and/or rare associated variants. With a large family pedigree, you can do linkage analysis but usually still need sequencing to pinpoint the causal mutation. WGS is attractive here, because it enables you to look at noncoding and structural variants in linkage regions, rather than taking a gene-centric approach. This is absolutely necessary: just ask any gene hunter to tell you about that huge linkage peak they have in a region without any annotated genes. There are countless examples.

 6. Large Cohorts with Extensive Phenotyping

Samples from large, well-phenotyped cohorts have always been in high demand for genetic studies. Many of them have been surveyed with SNP arrays and more recently exome sequencing. Over time, many cohorts grow both in the number of participants and the amount of phenotype data collected. Large-scale, longitudinal studies of complex traits are essential for pinpointing the underlying genetics.

Even with the HiSeq X Ten, WGS remains too costly to be applied to everyone 10,000 sample cohort. Yet a pilot study of 200, 500, or 1,000 samples may be feasible, and may uncover results that can be replicated in the larger cohort. If it were up to me, I’d select the subset of samples with the most extensive phenotype data — biomarkers, clinical measurements, RNA-seq, health records, etc. Deep phenotypes combined with WGS seems like a very powerful combination indeed.

How Would You Apply WGS?

I’ve offered a few suggestions here, but there are undoubtedly other applications of WGS that should be considered in the light of the new X Ten system. What kinds of studies would you apply it to? Please leave me a comment and let me know. By the way, one of those Illumina HiSeq X ten installations is here at WashU.

So if you have a cohort and are looking for whole-genome sequencing, we should talk.


Genome, Evolution, and Domestication of the Cat

Even though most of my posts on MassGenomics concern human genetics and genomics, today I’d like to highlight a milestone in another species, one that many humans care fiercely about. This guy:

domesticated cat

Credit: renekyllingstad on Flickr

Cat lovers, rejoice! This month in the Proceedings of the National Academy of Sciencs, Mike Montague, Wes Warren, and colleagues published the first complete reference genome for the domestic cat. Their analyses offer insights into the genetics underlying feline biology, evolution, and most recently, domestication.

The cat recently surpassed the dog as the most popular pet in the world, with a global population size estimated at 600 million. Many people credit ancient Egypt with cat domestication, but there’s archaeological evidence showing that cats and humans lived together 5,000 years ago in China, and ~9,500 years ago in Cyprus. In both cases, the new relationship seemed to arise when people turned to agriculture to feed themselves. It seems obvious what happened: farming and storing grains drew rodents, and rodents drew cats.

The path to domestication for cats differs from that of most other domesticated animals that were selectively bred for food (livestock), herding, hunting, or security (dogs). Most of the 30-40 cat breeds recognized today originated within the last 150 years, and were selected mainly for aesthetic traits rather than functional ones. That’s a fancy way of saying we bred cats to be pretty, not to be useful.

The Cat Genome

The 19 chromosomes in the reference cat genome (18 autosomes and an X-chromosome) span 2.35 billion base pairs. It contains about 19,500 protein-coding genes and 1,850 non-coding RNAs, numbers that are very similar to the dog. The authors first looked at the ~10,000 genes that had orthologs (matches in another species) in the tiger, dog, cow, and human genomes.

Genes Under Positive Selection

They searched in particular for genes under positive natural selection, and put those findings in context with what we know about cats relative to other carnivores.

Cats have the broadest hearing range among carnivores. There are at least six genes that look to be under positive selection in cats that are associated with hearing capacity; we know this because mutations in these genes cause nonsyndromic recessive hearing loss or deafness. At least 20 genes under positive selection in cats are associated with vision-related pathways, which fits with the importance of visual acuity for these natural-born hunters.

Felines are “crepescular” hunters, meaning that they’re most active in the twilight periods before sunrise and after sunset. Thus it was fascinating to see positive selection on genes like CHM and CNGB3, in which mutations can cause retinal diseases featuring night blindness (i.e. choroideremia and retinitis pigmentosa) in humans.

Cats rely less on their sense of smell for huntings than dogs do, which is apparent from the smaller repertoire of olfactory receptor genes in the feline genome. However, the cat ancestor had more genes encoding vomeronasal sensation. The vomeronasal organ is a sort of auxiliary sense of smell, mainly used to detect pheremones. It’s been suggested that there’s a tradeoff between olfactory and vomeronasal capacity in evolution, and the cat’s genome supports that: sense of smell was traded for pheremone detection, on which cats rely for social communication.

Wildcats and Domestication Genes

What about genes that might be involved in the domestication process? To search for these, the authors combined sequencing data from 22 cats, including both domestic and wildcat breeds. Wildcats (Felis silvestris) are small cats found in Africa, Europe, and parts of Asia. They tend to be larger than domestic cats, with longer legs and more robust bodies. There are numerous subspecies of wildcat, but they generally fall into one of three specialties:

  1. Forest wildcats, like the European wildcat.
  2. Steppe wildcats, whose ancestors migrated to the Middle East and tend to have smaller bodies, longer tails, and lighter fur.
  3. Bay or bush wildcats, which have paler coats and more defined patterns (stripes and spots).

As you might have guessed, house cats are thought to have been domesticated from those fancy-looking bay wildcats, probably an African subspecies.

When the authors looked for evidence of selection, they found regions harboring genes like:

  • PCDHA1 and PCDHB4, which play a role in neural connection establishment/maintenance and fear conditioning.
  • GRIA1, a glutamate receptor gene involved in associating stimulus with reward.
  • DCC, encoding the netrin receptor, which is expressed in dopaminergic neurons. Knockouts of this gene in mice produced animals with defects in memory, behavior, and reward.

So it looks like cats chose to domesticate themselves because they noticed that, if they came in and helped out with the rodent control, we would reward them with food. And they stayed because they were afraid that we wouldn’t feed them if they remained in the wild. The last assertion might not be correct based on observations of my neighbor feeding strays, but please, no one tell the cats that.

More on Cats and Domestication

If you want more great stories about the cat genome and domestication, you’ll find good articles in Wired Magazine and the Washington Post. Senior author Wes Warren also appeared on NPR’s Science Friday last week.

Montague MJ, Li G, Gandolfi B, Khan R, Aken BL, Searle SM, Minx P, Hillier LW, Koboldt DC, Davis BW, Driscoll CA, Barr CS, Blackistone K, Quilez J, Lorente-Galdos B, Marques-Bonet T, Alkan C, Thomas GW, Hahn MW, Menotti-Raymond M, O’Brien SJ, Wilson RK, Lyons LA, Murphy WJ, & Warren WC (2014). Comparative analysis of the domestic cat genome reveals genetic signatures underlying feline biology and domestication. Proceedings of the National Academy of Sciences of the United States of America PMID: 25385592

Brace Yourself for Large-Scale Whole Genome Sequencing

The release of the Illumina HiSeq X Ten sequencing system, and its current use restriction (only human, only whole-genome sequencing) are going to cause a major paradigm shift in human genetics studies over the next few years. Until now, we’ve seen relatively few large-scale efforts to apply whole-genome sequencing (WGS) to large numbers of samples. But the capability of a single X Ten installation to sequence ~18,000 genomes per year at a relatively low cost means that, for the first time, it may become easier to apply WGS as the primary discovery tool.

large scale genome sequencing on the X Ten

The Illumina HiSeq X Ten

I’ve already written about the realities of the sequencing GWAS to discuss some of the considerations in going from genotyping (SNP arrays) to sequencing (next-gen) for genetic association studies. Unlike genotyping, sequencing enables both discovery and genotyping, with the caveat that you’ll end up with:

  • Many rare variants private to an individual or family
  • Increased missingness in the resulting genotypes
  • More false-positive variants
  • Additional QC challenges

These are simply the reality of going from clean, defined SNP array datasets (>99.1% call rate) to next-gen sequencing data, which depends on alignment and variant calling and depth/breadth of coverage.

Data Storage Demon

One of the major practical considerations for whole-genome sequencing data is on the computational requirements side: data processing, storage, and retention. A binary alignment/map (BAM) file — which contains the sequences, base qualities, and alignments to a reference sequence — for a 30x whole genome is about 80-90 gigabytes in size. The BAM files for a modest sample size (1,000) might consume 80 terabytes of disk space. And that disk space is not free. It costs actual dollars to purchase and maintain over time.

I’m resisting the urge to show you that cost of sequencing / Moore’s law comparison plot here.

Because disk space is both finite and costly, and these files are so huge, at some point researchers will have to choose between getting new data and actually deleting some old data. Kind of like a “one in, one out” policy at a crowded bar. No one likes throwing data away. We NGS analysts shudder at the idea of not being able to go back to the BAM file to run yet another variant caller, or review that interesting variant. At some point we may have to call the sample’s analysis DONE and leave it that way. Because, let’s be honest, 99% of the bases in a BAM file match the reference. It’s the variants that we’re truly interested in.

Data Transfer: Traffic Jams Ahead

Another consideration is the simple act of moving data around. With a $10 million price tag, few research groups will be able to afford an X Ten cluster, but those who can’t will be unable to stay competitive on the cost of WGS. On the other side of the table, the lucky X Ten installation sites will need to find samples. This means that most whole-genome sequencing will take place at a few locations, and the resulting data transferred to the investigators who sent in the samples.

Have you tried to download an 80 gigabyte file lately? The regular internet is just not going to work for this.

You There, with the Samples!

A couple of years ago, I wrote that in a world with widespread genome sequencing capacity, samples are the new commodity. That has never been more true than in the world of the X Ten. The institutions that have them will need to find several thousand samples per year in order to achieve the optimal per-genome cost.

I don’t know too much about the details of consenting samples, but I know that many, many research samples are not consented for whole genome sequencing. Because whole-genome has everything: your Y-haplogroup (for males), your APOE allele, your BRCA1/2 risk variants, etc. There’s no “we will only look at this gene or region” nonsense.

The Awkward Question

Who is going to pay for sequencing all of these samples? Don’t count on the X Ten centers to do it; remember, they had to shell out $10 million just to buy the thing. Even at a reagents/personnel cost of $1,000 per genome, an X Ten running at full capacity will cost $18 million per year. That’s a lot of cash, in an era when research budgets seem to be flat (if not shrinking). So now you need samples and the funds to sequence them.

It may actually be more difficult to persuade researchers to make the switch to sequencing, because it will still be five times more expensive than running a SNP array.

The Promise Ahead

I know that this post has had a bit of a negative tone, but I felt it necessary to get people thinking about the challenges ahead. Now, perhaps, we should talk about the promise of large-scale whole genome sequencing. At last, we’ll have sequencing studies that aren’t biased towards coding regions or certain genes. Every sequenced genome will harbor over 3 million sequence variants. We can go after non-SNP variation, too: indels and structural variants are far easier to detect by WGS, though SV calling is still a nascent area of bioinformatics.

The wonderful thing about WGS is that it both enables and forces us to look beyond the obvious (e.g. the nonsynonymous variants in known protein-coding genes). We’re headed into the unknown, the dark matter of the genome, whether we like it or not. And that is a good thing.