Science Fiction: Going Viral

The rapid advance of next-generation sequencing technologies, particularly in the last several years, has almost seemed like something out of a science fiction novel. Think about it: on a HiSeq X Ten instrument, we can sequence a complete human genome in less than a week, at a cost that’s 0.00001% of what it took to fund the Human Genome Project.

It might surprise you to learn that — in addition to my blog posts here, and the grant/paper writing I do for my job — that I dabble in science fiction writing as well. If you think that scientific publication/success is hard (10% acceptance rate for tip-tier journals, or 8% NIH funding level), you should look into the the fiction side of publishing sometime.

The acceptance rate for most professional science fiction magazines (for short fiction) is generally below 1%. The pay is usually $0.05-$0.10 per word, meaning that a 4,000 word story might bring $200-400 in the (unlikely) event that you get it professionally published. The odds of landing a literary agent — which is required, if you want to have your novel shopped to most traditional publishing houses — are about 1 in 1,000.

A few months ago, Third Flatiron Publishing (which does quarterly science fiction anthologies) announced that their Spring 2015 anthology would be themed around world-altering events. As it happened, I’d written a science fiction story that seemed like it might fit — it was about a couple of researchers working in a dusty lab who stumble upon a universal cure for cancer (you remember I said science fiction, right?), and their struggle to make it available to the world.

The Time It Happened

I’m thrilled to say that the editors at Third Flatiron liked my story enough to choose it for their anthology The Time It Happened, which just came out and is available on Amazon in both Kindle and paperback versions. They’ve also bought audio rights, and intend to create a free podcast of my story (as well as a couple of others) sometime in the near future.

Since you readers enjoyed the non-fiction I write for MassGenomics, hopefully you’ll enjoy this as well.


Targeted Sequencing of GWAS Loci for Cleft Lip

In the last decade, genome-wide association studies (GWAS) enabled by cheap, high-throughput SNP genotyping have identified thousands of loci that influence disease susceptibility, quantitative traits, and other complex phenotypes. The genetic markers on high-density SNP arrays are carefully chosen to capture (or “tag”) most common haplotypes in human populations. Common SNPs tend to be more informative in this regard, and most of these fall outside the exons of protein-coding genes.

cleft lip association study

Credit: Leslie et al, AJHG 2015

This efficiency is both the strength and the weakness of SNP arrays: they are well-suited to represent variation across the human genome, but they’re unlikely to be causal variants themselves. In essence, the loci uncovered by GWAS are signposts that tell us where to look for functional variation that influences a trait of interest.

Following up GWAS hits — with sequencing and functional validation — will ultimately be required to understand the mechanism of disease. A paper online at The American Journal of Human  Genetics offers an informative example of how that plays out.

Cleft Lip/Palate as a Complex Trait

Non-syndromic cleft lip with or without cleft palate (NSCL/P) affects about 1 in 700 live births, and represents a global health problem (particularly in the developing world). Multiple genetic and environmental risk factors give rise to a complex etiology for this trait. One candidate gene (IRF6) was known to harbor common variants associated with NSCL/P, and large-scale GWAS efforts have yielded 12 additional loci reproducibly associated with it.

To further investigate the genetic architecture of this phenotype, a group of researchers from several institutions (including the Genome Institute at WashU) sequenced those 13 regions in over 4,000 individuals. At the study design stage, the researchers made two key decisions that undoubtedly contributed to their success:

  1. A case-parent trio design. Most of the samples chosen came in the form of an affected child and two unaffected parents. This structure makes it possible to examine not just the presence or absence of alleles, but whether or not they’re transmitted to the affected child. It also permitted a search for de novo mutations that might contribute to susceptibility.
  2. A wide target region for each locus, including both coding and non-coding regions. The latter type are increasingly important as we delve into complex traits in which a significant fraction (if not a majority) of causal variants will be regulatory rather than coding in nature.

De novo Mutations

Just as one can’t truly identify somatic mutations in a tumor tissue without a matched normal, it’s nearly impossible to distinguish de novo mutations in a patient without sequencing both of his or her parents. This is the only way, people. Filtering dbSNPs is not going to get you there.

The thing about de novo mutations is that they’re exquisitely rare — according to estimates of the de novo mutation rate, any given individual should have around 34 mutations genome-wide. Since the target space for this study represented 0.19% of the genome, that’s a long shot. Then again, we’re talking about a lot of trios, and they’re selected for a trait that’s been linked to this target space.

My back-of-the-envelope calculations based on the amount of target space (6.3 Mbp), the de novo mutation rate (1e-08), and the number of trios sequenced here (1,409) suggest that we’d expect ~89 de novo mutations. The authors came up with 123, which is a little high. We do have an enriched population, but calling de novos is very likely to yield some false positives.

They were able to design assays for 82 mutations and confirmed 66 (80%) by Sanger sequencing. That’s a good validation rate, and it suggests that about 98 of the predicted mutations would hold up (pretty close to my estimate).

Only 3 of 66 confirmed de novo mutations (3.6%) altered protein sequence. The majority (95%) were noncoding, though 11 of these mapped to a predicted regulatory element.

Common Variant Associations

To identify common functional variants, the authors used an allelic transmission disequilibrium test (TDT), which determines if an allele is transmitted more (or less) than we’d expect by chance. All but one of the GWAS regions (PAX7) showed evidence of association with p-values less than 10-5. In general, the results supported the GWAS findings: the variants yielding the lowest p-value were either the lead GWAS SNP or were in perfect LD with the lead GWAS SNP.

A conditional analysis revealed only one locus (ARHGAP29) with evidence for secondary independent signals, suggesting more than one common functional variant.

Rare Variant Associations

The common variants explained only a fraction of the heritability for disease. Yet these GWAS regions were also logical candidates for rare variation that contributes to disease. The challenge with rare variants is that one needs thousands of samples to even see them, much less establish statistically significant association.

To address this, we often collapse individual rare variants to the gene, regulatory element, or genomic interval in which they occur to get the power up. These so-called burden tests can boost the power for detection, but that’s dependent on one’s ability to predict which variants are truly functional. In this study, neither gene-based or regulatory element-based burden tests yielded significant associations.

A ScanTrio analysis of genomic intervals — after experimenting with different window sizes and overlaps — yielded signals for 2 of the 13 regions (NOG and NTN1).

Functional Validation

The challenge of genetic association studies — especially ones for complex phenotypes — is confirming statistical evidence of association with a functional assay. Sequencing and genotyping methods continually become faster and cheaper. With functional validation, the only paradigm shift is that more and more journals want to see it before they publish a genetic study.

PAX7 de novo Mutation

One of the few de novo coding mutations was predicted to disrupt the DNA-binding domain of PAX7. The authors designed an electromobility shift assay (EMSA) to examine how the missense substitution affected PAX7‘s ability to bind a target regulatory sequence. They also used quantitative reporter assays in HeLa cells, with co-transfection of a plasmid containing either wild-type or mutant PAX7.

cleft lip pax7 mutation

PAX7 functional validation (Leslie et al, AJHG 2015)


Both experiments showed that the wild-type allele had greater DNA-binding capacity, and drove higher expression of the reporter gene.

FGFR2 de novo Mutation

One of the noncoding de novo mutations was 254 kilobases downstream of FGFR2, in a noncoding region that looks (according to chromatin marks) like a neural crest enhancer. FGFR2 is known to play a role in craniofacial development, and rare variants in it had been reported in cases of NSCL/P. Here, the authors leveraged a zebrafish model system to examine the role of that enhancer during development. In transient transgenic reporter studies of zebrafish embryos, the +254kb element holds up: the wild-type allele had enhancer activity in 41/82 embryos (50%), whereas the mutant allele had enhancer activity in 3/83 embryos (3.6%).

FGFR enhancer validation

FGFR +254kb enhancer (Leslie et al, AJHG 2015)

This, in my opinion, is one of the most compelling parts of this study: in vivo functional validation of a single base change in a noncoding enhancer that’s hundreds of thousands of bases away from the gene it regulates.

Common Variant at 17q22 (NOG)

Multiple SNPs reached genome-wide significance in the 17q22 region. The greatest significance was detected at rs227727, about 105 kb downstream of the NOG transcriptional start site. This variant was in complete LD with the lead SNP from the prior GWAS. This was interesting because NOG encodes a BMP antagonist that’s expressed primarily in the epithelium during palatal development.

Tandem enhancers in cleft lip

Tandem enhancer disruption (Leslie et al, AJHG 2015)

The authors confirmed NOG expression in the palatal epithelium in mouse embryos. They also noted that rs227727 mapped to one of two enhancers in the region, +105kb (the other being +87 kb). The variant allele disrupts predicted binding sites for two transcription factors (MEF2C and CDX2) and creates possible binding sites for at least two others.

Interestingly, the zebrafish assay did not show epithelial enhancer activity for the +105kb element by itself. However, a tandem construct with both enhancers (+87kb and +105kb) lit things up. The effect was at least additive, and constructs containing the risk allele of rs227727 showed significantly decreased enhancer activity.

Beyond GWAS for Complex Disease

What I like about this study is that it studied GWAS and candidate gene regions in careful investigations that included functional validation components. It’s so easy to take a GWAS hit, look for the nearest gene, and spin a story about how variation in that gene affects the phenotype of interest. Here, the authors have done the difficult and time-consuming work of (1) exhaustive sequencing to identify the possible functional variants, and (2) in vivo functional assays to prove that the implicated variants have a phenotypic effect. That’s a lot of work to pin down the genetic architecture and disease mechanism for a handful of disease loci.

The high-throughput nature of genotyping (and increasingly, sequencing) and the discovery power of large cohorts are going to yield promising new findings. With them comes a strong temptation to take the association hits and run with them. Write up some voodoo in the discussion about the gene and its role, get the paper out, and move on. The problem is that statistical genetic evidence is not enough. You don’t know that your lead SNP is functional, or that the nearest neighboring gene provides the mechanism of phenotypic effect.

More studies like these, with well-planned study designs and compelling functional assays, will be required as we continue to unravel the complex fabric of human genetics.


Leslie, E., Taub, M., Liu, H., Steinberg, K., Koboldt, D., Zhang, Q., Carlson, J., Hetmanski, J., Wang, H., Larson, D., Fulton, R., Kousa, Y., Fakhouri, W., Naji, A., Ruczinski, I., Begum, F., Parker, M., Busch, T., Standley, J., Rigdon, J., Hecht, J., Scott, A., Wehby, G., Christensen, K., Czeizel, A., Deleyiannis, F., Schutte, B., Wilson, R., Cornell, R., Lidral, A., Weinstock, G., Beaty, T., Marazita, M., & Murray, J. (2015). Identification of Functional Variants for Cleft Lip with or without Cleft Palate in or near PAX7, FGFR2, and NOG by Targeted Sequencing of GWAS Loci The American Journal of Human Genetics DOI: 10.1016/j.ajhg.2015.01.004

The Value of the Cohort: 23andMe’s Research Portal

23andMe genetic testing23andMe has been an interesting company to watch over the last five years. For a variety of reasons, they remain the visible direct-to-consumer (DTC) genetic testing company, and also became Illumina’s single biggest customer for high-density SNP arrays. As I’ve written about before, I underwent the 23andMe genetic testing service just months before the FDA’s cease-and-desist letter on the medical/health reporting aspects of that service. So I’ve been able to see things from the consumer side of it as well.

An article this month in the MIT Technology Review examines 23andMe’s new formula for business success: building up and selling access to their ever-growing database of willing research participants. This is not a new direction for the company, but is garnering more attention after they signed a deal with Genentech under which the pharma giant will pay up to $60 million for access to ~3,000 Parkinson’s disease patients in 23andMe’s database. That’s about $20,000 per sample, and a major coup for a company still reeling from the FDA crackdown.

23andMe’s Genetic Database

The company is branding this as their Research Portal Platform and the allure is fairly obvious:

  • They have banked and genotyped samples from 800,000+ paying customers
  • The database continues to grow, especially from customers outside the U.S. who can still get the “full” service
  • So far, about 600,000 customers have agreed (“consented”) to participate in research studies.
  • 23andMe continues collect phenotype data via customer outreach

In other words, 23andMe has a catalogue of 600,000 samples that are (1) already genotyped, (2) broadly consented for research, and (3) easy to recontact as needed. It’s the kind of cohort that genetics researchers are currently salivating over, especially in the era of large-scale sequencing studies.

Suffice it to say that the company stands to make a lot more money from this than from their $99 genetic testing kit.

Sample Consent for Research

23andMe Health History

23andMe: Hey, tell us everything

I will tell you this: when it comes to consenting its customers, 23andMe sure knows how to sell it. The text for the “Basic Research Consent” is as follows:

Giving consent means that your Genetic & Self-Reported Information may be used in an aggregated form, stripped of identifying registration information (such as name, email, address), by 23andMe for peer-reviewed scientific research. To learn more about how 23andMe safeguards your privacy, read our Privacy Statement.

You can change your mind at any time by changing your settings below. Contact us with any questions.

It’s interesting to note that the basic consent is for use by 23andMe for peer-reviewed scientific research. The Genentech deal would appear to fall outside both restrictions, since the data will be used by a third party and seems unlikely to undergo peer review. Of course, it’s possible (likely, even) that 23andMe drew up a different consent for Parkinson’s patients.

Genotype and Phenotype Data

23andMe health questions

23andMe Pops A Question

The high-density SNP array used for 23andMe genotyping is a custom design provided by Illumina. If memory serves, it’s the OmniExpress (700K) chip plus a few hundred thousand markers of interest examined by 23andMe. The company presented a poster at ASHG with a list of some of the types of data that would be available through their research portal:

  • Genotypes
  • Conditions/Diagnoses
  • Medication Usage
  • Response to Medication
  • Family History of Disease
  • Health Behaviors
  • Personality Traits
  • Environmental Exposures
  • Geographic Location

Importantly, nearly all of the phenotype information is freely offered up by 23andMe customers. The company collects it primarily through surveys. There’s a big health history survey when you sign up for the service, and then there are little follow-ups. Like the one at the right, found in the sidebar today when I logged in. It’s inviting and casual… sort of a “Hey, while you’re here, have you ever had….”

On the plus side, it’s a very non-invasive way of collecting information. The customer (me) is logged in and poking around already. With a little planning, 23andMe can ask countless questions like these and add them to a user’s profile. On the down side, this is completely self-reported. We’re not in a doctor’s office here; there’s no requirement for truth. So while I’m more likely to be answering questions like these, I might very easily (1) make a mistake, because I’m not a physician, or (2) lie just because I feel like it. I paid 23andMe, which gives me a little sense of entitlement.

Admittedly, we are willing participants. 23andMe makes no secret of its hopes to use my DNA for research purposes, and I have no problem with that. Then again, I have a better understanding of what it means than most of their customers.

Features of the 23andMe Cohort

The FDA brought the hammer down on 23andMe’s doling out of health-related findings to its customers, but as hinted at in the MIT Review article, the company obviously has a second agenda. They’re assembling one of the most valuable human genetics research cohorts in the world. Some quick highlights of the 600k+ consented participants:

  • Ancestry: 77% European, 10% Latino, 5% African American, 4% Asian, 2% South Asian, 2% Other
  • Gender: 52% male, 48% female.
  • Cancer: 33,000 cases, comprising breast (6,000), prostate (5,000), colorectal (1,700) and other cancers. Of these, 5,000 have undergone chemotherapy. The cohort also has 405,000 “confirmed” controls, i.e. people who indicated they’ve never had cancer.
  • 120,000 APOE e4 allele carriers (the risk allele for Alzheimer’s)
  • 10,000 Parkinson’s patients
  • 10,000 patients with autoimmune diseases (rheumatoid arthritis, IBD, lupus, etc.).

Perhaps the most important consideration is that these individuals can be easily re-contacted. That means 23andMe can keep building phenotype data and recruiting candidates for genetic studies. They’ve banked saliva from every customer, and could presumably try to get blood or other tissue from agreeable participants.

It may not be the most ethnically diverse, carefully stratified, or rigorously phenotyped cohort. But at 600,000 individuals, it’s certainly one of the largest. We in the genetics community will be watching with great interest.


Illumina’s New HiSeq X Instruments

Ah, Illumina. You have to admire their marketing savvy. Last year at around this time, they announced the HiSeq X Ten system, a “factory installation” for human whole genome sequencing (only) at an incredible scale: 18,000 genomes per year, at a cost of $1,000 each (for consumables). It’s still the sequencing-by-synthesis technology employed by previous Illumina instruments, but with a new “patterned” flowcell that spaces the clusters more evenly, for a more efficient sequencing yield.

The cost, of course, was considerable: $10 million for 10 instruments, the required minimum. The operating costs are also considerable, not just for reagents (up to $1.8 million per year) but secondary things like disk, compute, and even internet bandwidth. It’s a massive amount of data to store, analyze, and submit.

There’s Always A Catch

Importantly, the HiSeq X systems can only be applied to human whole genome sequencing. Human means no plants, animals, or model organisms. Whole genome means no targeted/exome sequencing, no RNA-seq. All of those applications will have to go on other platforms.

According to their CEO, Illumina sold 18 HiSeq X Ten systems last year. That’s an impressive number, and even more than they expected. The total capacity exceeds 400,000 genomes per year (some installations got more than 10 instruments). Filling all of that capacity is (and remains) a major challenge, because human samples are precious. They require informed consent for sequencing, IRB review boards, privacy protections. Human samples are the new commodity.

The HiSeq X Five

Illumina hiseq x system

The HiSeq X Five

Now there’s a second option: as announced earlier this month, Illumina will be selling HiSeq X Five systems (5 instruments) for $6 million each.The lower buy-in likely means that even more groups can adopt the HiSeq X technology. They’ll have the same restrictions as the HiSeq X Ten, but half of the capacity (9,000 genomes per year). That’s still a considerable number of whole genomes. Probably more than have been sequenced by the research community in the past five years.

The per-genome cost will also be $1,400 per sample. That’s 40% higher than the cost on the X Ten, but I think it’s around half of what it would cost on the HiSeq2500.

HiSeq3000 and HiSeq4000

There’s also a new generation of the HiSeq2500 instrument to become available later this year. The HiSeq3000 will run a single patterned flowcell for 750 Gbp per run. The HiSeq4000 will run two patterned flowcells, for twice that capacity.

Eventually, these will supplant current HiSeq2500 instruments. I expect they’ll be busy, too, running the exomes, the targeted sequencing, the RNA-seq and bisulfite sequencing.

The Promise of Human WGS

But back to the HiSeq X systems. Personally, I don’t like the idea of a single company dominating the market, and essentially attempting to dictate how human genetics research should be conducted. At the same time, I can’t argue with the direction we’re headed. We had high hopes for SNP arrays and GWAS, but as I discussed in my previous post, sequencing at large scale is required to uncover the full scope of genetic variation underlying complex phenotypes.

And let’s face it, exome sequencing lets us conveniently avoid some of the most challenging aspects of human genomics, like detecting complex rearrangements (SVs) and interpreting noncoding regulatory variants. Both are undoubtedly important to human disease, but more difficult to study. Yet the only way we’ll make progress is to study them in large cohorts numbering thousands of samples. Now, at least, we have the tools to do that.