Targeted Sequencing of GWAS Loci for Cleft Lip

In the last decade, genome-wide association studies (GWAS) enabled by cheap, high-throughput SNP genotyping have identified thousands of loci that influence disease susceptibility, quantitative traits, and other complex phenotypes. The genetic markers on high-density SNP arrays are carefully chosen to capture (or “tag”) most common haplotypes in human populations. Common SNPs tend to be more informative in this regard, and most of these fall outside the exons of protein-coding genes.

cleft lip association study

Credit: Leslie et al, AJHG 2015

This efficiency is both the strength and the weakness of SNP arrays: they are well-suited to represent variation across the human genome, but they’re unlikely to be causal variants themselves. In essence, the loci uncovered by GWAS are signposts that tell us where to look for functional variation that influences a trait of interest.

Following up GWAS hits — with sequencing and functional validation — will ultimately be required to understand the mechanism of disease. A paper online at The American Journal of Human  Genetics offers an informative example of how that plays out.

Cleft Lip/Palate as a Complex Trait

Non-syndromic cleft lip with or without cleft palate (NSCL/P) affects about 1 in 700 live births, and represents a global health problem (particularly in the developing world). Multiple genetic and environmental risk factors give rise to a complex etiology for this trait. One candidate gene (IRF6) was known to harbor common variants associated with NSCL/P, and large-scale GWAS efforts have yielded 12 additional loci reproducibly associated with it.

To further investigate the genetic architecture of this phenotype, a group of researchers from several institutions (including the Genome Institute at WashU) sequenced those 13 regions in over 4,000 individuals. At the study design stage, the researchers made two key decisions that undoubtedly contributed to their success:

  1. A case-parent trio design. Most of the samples chosen came in the form of an affected child and two unaffected parents. This structure makes it possible to examine not just the presence or absence of alleles, but whether or not they’re transmitted to the affected child. It also permitted a search for de novo mutations that might contribute to susceptibility.
  2. A wide target region for each locus, including both coding and non-coding regions. The latter type are increasingly important as we delve into complex traits in which a significant fraction (if not a majority) of causal variants will be regulatory rather than coding in nature.

De novo Mutations

Just as one can’t truly identify somatic mutations in a tumor tissue without a matched normal, it’s nearly impossible to distinguish de novo mutations in a patient without sequencing both of his or her parents. This is the only way, people. Filtering dbSNPs is not going to get you there.

The thing about de novo mutations is that they’re exquisitely rare — according to estimates of the de novo mutation rate, any given individual should have around 34 mutations genome-wide. Since the target space for this study represented 0.19% of the genome, that’s a long shot. Then again, we’re talking about a lot of trios, and they’re selected for a trait that’s been linked to this target space.

My back-of-the-envelope calculations based on the amount of target space (6.3 Mbp), the de novo mutation rate (1e-08), and the number of trios sequenced here (1,409) suggest that we’d expect ~89 de novo mutations. The authors came up with 123, which is a little high. We do have an enriched population, but calling de novos is very likely to yield some false positives.

They were able to design assays for 82 mutations and confirmed 66 (80%) by Sanger sequencing. That’s a good validation rate, and it suggests that about 98 of the predicted mutations would hold up (pretty close to my estimate).

Only 3 of 66 confirmed de novo mutations (3.6%) altered protein sequence. The majority (95%) were noncoding, though 11 of these mapped to a predicted regulatory element.

Common Variant Associations

To identify common functional variants, the authors used an allelic transmission disequilibrium test (TDT), which determines if an allele is transmitted more (or less) than we’d expect by chance. All but one of the GWAS regions (PAX7) showed evidence of association with p-values less than 10-5. In general, the results supported the GWAS findings: the variants yielding the lowest p-value were either the lead GWAS SNP or were in perfect LD with the lead GWAS SNP.

A conditional analysis revealed only one locus (ARHGAP29) with evidence for secondary independent signals, suggesting more than one common functional variant.

Rare Variant Associations

The common variants explained only a fraction of the heritability for disease. Yet these GWAS regions were also logical candidates for rare variation that contributes to disease. The challenge with rare variants is that one needs thousands of samples to even see them, much less establish statistically significant association.

To address this, we often collapse individual rare variants to the gene, regulatory element, or genomic interval in which they occur to get the power up. These so-called burden tests can boost the power for detection, but that’s dependent on one’s ability to predict which variants are truly functional. In this study, neither gene-based or regulatory element-based burden tests yielded significant associations.

A ScanTrio analysis of genomic intervals — after experimenting with different window sizes and overlaps — yielded signals for 2 of the 13 regions (NOG and NTN1).

Functional Validation

The challenge of genetic association studies — especially ones for complex phenotypes — is confirming statistical evidence of association with a functional assay. Sequencing and genotyping methods continually become faster and cheaper. With functional validation, the only paradigm shift is that more and more journals want to see it before they publish a genetic study.

PAX7 de novo Mutation

One of the few de novo coding mutations was predicted to disrupt the DNA-binding domain of PAX7. The authors designed an electromobility shift assay (EMSA) to examine how the missense substitution affected PAX7‘s ability to bind a target regulatory sequence. They also used quantitative reporter assays in HeLa cells, with co-transfection of a plasmid containing either wild-type or mutant PAX7.

cleft lip pax7 mutation

PAX7 functional validation (Leslie et al, AJHG 2015)


Both experiments showed that the wild-type allele had greater DNA-binding capacity, and drove higher expression of the reporter gene.

FGFR2 de novo Mutation

One of the noncoding de novo mutations was 254 kilobases downstream of FGFR2, in a noncoding region that looks (according to chromatin marks) like a neural crest enhancer. FGFR2 is known to play a role in craniofacial development, and rare variants in it had been reported in cases of NSCL/P. Here, the authors leveraged a zebrafish model system to examine the role of that enhancer during development. In transient transgenic reporter studies of zebrafish embryos, the +254kb element holds up: the wild-type allele had enhancer activity in 41/82 embryos (50%), whereas the mutant allele had enhancer activity in 3/83 embryos (3.6%).

FGFR enhancer validation

FGFR +254kb enhancer (Leslie et al, AJHG 2015)

This, in my opinion, is one of the most compelling parts of this study: in vivo functional validation of a single base change in a noncoding enhancer that’s hundreds of thousands of bases away from the gene it regulates.

Common Variant at 17q22 (NOG)

Multiple SNPs reached genome-wide significance in the 17q22 region. The greatest significance was detected at rs227727, about 105 kb downstream of the NOG transcriptional start site. This variant was in complete LD with the lead SNP from the prior GWAS. This was interesting because NOG encodes a BMP antagonist that’s expressed primarily in the epithelium during palatal development.

Tandem enhancers in cleft lip

Tandem enhancer disruption (Leslie et al, AJHG 2015)

The authors confirmed NOG expression in the palatal epithelium in mouse embryos. They also noted that rs227727 mapped to one of two enhancers in the region, +105kb (the other being +87 kb). The variant allele disrupts predicted binding sites for two transcription factors (MEF2C and CDX2) and creates possible binding sites for at least two others.

Interestingly, the zebrafish assay did not show epithelial enhancer activity for the +105kb element by itself. However, a tandem construct with both enhancers (+87kb and +105kb) lit things up. The effect was at least additive, and constructs containing the risk allele of rs227727 showed significantly decreased enhancer activity.

Beyond GWAS for Complex Disease

What I like about this study is that it studied GWAS and candidate gene regions in careful investigations that included functional validation components. It’s so easy to take a GWAS hit, look for the nearest gene, and spin a story about how variation in that gene affects the phenotype of interest. Here, the authors have done the difficult and time-consuming work of (1) exhaustive sequencing to identify the possible functional variants, and (2) in vivo functional assays to prove that the implicated variants have a phenotypic effect. That’s a lot of work to pin down the genetic architecture and disease mechanism for a handful of disease loci.

The high-throughput nature of genotyping (and increasingly, sequencing) and the discovery power of large cohorts are going to yield promising new findings. With them comes a strong temptation to take the association hits and run with them. Write up some voodoo in the discussion about the gene and its role, get the paper out, and move on. The problem is that statistical genetic evidence is not enough. You don’t know that your lead SNP is functional, or that the nearest neighboring gene provides the mechanism of phenotypic effect.

More studies like these, with well-planned study designs and compelling functional assays, will be required as we continue to unravel the complex fabric of human genetics.


Leslie, E., Taub, M., Liu, H., Steinberg, K., Koboldt, D., Zhang, Q., Carlson, J., Hetmanski, J., Wang, H., Larson, D., Fulton, R., Kousa, Y., Fakhouri, W., Naji, A., Ruczinski, I., Begum, F., Parker, M., Busch, T., Standley, J., Rigdon, J., Hecht, J., Scott, A., Wehby, G., Christensen, K., Czeizel, A., Deleyiannis, F., Schutte, B., Wilson, R., Cornell, R., Lidral, A., Weinstock, G., Beaty, T., Marazita, M., & Murray, J. (2015). Identification of Functional Variants for Cleft Lip with or without Cleft Palate in or near PAX7, FGFR2, and NOG by Targeted Sequencing of GWAS Loci The American Journal of Human Genetics DOI: 10.1016/j.ajhg.2015.01.004

The Value of the Cohort: 23andMe’s Research Portal

23andMe genetic testing23andMe has been an interesting company to watch over the last five years. For a variety of reasons, they remain the visible direct-to-consumer (DTC) genetic testing company, and also became Illumina’s single biggest customer for high-density SNP arrays. As I’ve written about before, I underwent the 23andMe genetic testing service just months before the FDA’s cease-and-desist letter on the medical/health reporting aspects of that service. So I’ve been able to see things from the consumer side of it as well.

An article this month in the MIT Technology Review examines 23andMe’s new formula for business success: building up and selling access to their ever-growing database of willing research participants. This is not a new direction for the company, but is garnering more attention after they signed a deal with Genentech under which the pharma giant will pay up to $60 million for access to ~3,000 Parkinson’s disease patients in 23andMe’s database. That’s about $20,000 per sample, and a major coup for a company still reeling from the FDA crackdown.

23andMe’s Genetic Database

The company is branding this as their Research Portal Platform and the allure is fairly obvious:

  • They have banked and genotyped samples from 800,000+ paying customers
  • The database continues to grow, especially from customers outside the U.S. who can still get the “full” service
  • So far, about 600,000 customers have agreed (“consented”) to participate in research studies.
  • 23andMe continues collect phenotype data via customer outreach

In other words, 23andMe has a catalogue of 600,000 samples that are (1) already genotyped, (2) broadly consented for research, and (3) easy to recontact as needed. It’s the kind of cohort that genetics researchers are currently salivating over, especially in the era of large-scale sequencing studies.

Suffice it to say that the company stands to make a lot more money from this than from their $99 genetic testing kit.

Sample Consent for Research

23andMe Health History

23andMe: Hey, tell us everything

I will tell you this: when it comes to consenting its customers, 23andMe sure knows how to sell it. The text for the “Basic Research Consent” is as follows:

Giving consent means that your Genetic & Self-Reported Information may be used in an aggregated form, stripped of identifying registration information (such as name, email, address), by 23andMe for peer-reviewed scientific research. To learn more about how 23andMe safeguards your privacy, read our Privacy Statement.

You can change your mind at any time by changing your settings below. Contact us with any questions.

It’s interesting to note that the basic consent is for use by 23andMe for peer-reviewed scientific research. The Genentech deal would appear to fall outside both restrictions, since the data will be used by a third party and seems unlikely to undergo peer review. Of course, it’s possible (likely, even) that 23andMe drew up a different consent for Parkinson’s patients.

Genotype and Phenotype Data

23andMe health questions

23andMe Pops A Question

The high-density SNP array used for 23andMe genotyping is a custom design provided by Illumina. If memory serves, it’s the OmniExpress (700K) chip plus a few hundred thousand markers of interest examined by 23andMe. The company presented a poster at ASHG with a list of some of the types of data that would be available through their research portal:

  • Genotypes
  • Conditions/Diagnoses
  • Medication Usage
  • Response to Medication
  • Family History of Disease
  • Health Behaviors
  • Personality Traits
  • Environmental Exposures
  • Geographic Location

Importantly, nearly all of the phenotype information is freely offered up by 23andMe customers. The company collects it primarily through surveys. There’s a big health history survey when you sign up for the service, and then there are little follow-ups. Like the one at the right, found in the sidebar today when I logged in. It’s inviting and casual… sort of a “Hey, while you’re here, have you ever had….”

On the plus side, it’s a very non-invasive way of collecting information. The customer (me) is logged in and poking around already. With a little planning, 23andMe can ask countless questions like these and add them to a user’s profile. On the down side, this is completely self-reported. We’re not in a doctor’s office here; there’s no requirement for truth. So while I’m more likely to be answering questions like these, I might very easily (1) make a mistake, because I’m not a physician, or (2) lie just because I feel like it. I paid 23andMe, which gives me a little sense of entitlement.

Admittedly, we are willing participants. 23andMe makes no secret of its hopes to use my DNA for research purposes, and I have no problem with that. Then again, I have a better understanding of what it means than most of their customers.

Features of the 23andMe Cohort

The FDA brought the hammer down on 23andMe’s doling out of health-related findings to its customers, but as hinted at in the MIT Review article, the company obviously has a second agenda. They’re assembling one of the most valuable human genetics research cohorts in the world. Some quick highlights of the 600k+ consented participants:

  • Ancestry: 77% European, 10% Latino, 5% African American, 4% Asian, 2% South Asian, 2% Other
  • Gender: 52% male, 48% female.
  • Cancer: 33,000 cases, comprising breast (6,000), prostate (5,000), colorectal (1,700) and other cancers. Of these, 5,000 have undergone chemotherapy. The cohort also has 405,000 “confirmed” controls, i.e. people who indicated they’ve never had cancer.
  • 120,000 APOE e4 allele carriers (the risk allele for Alzheimer’s)
  • 10,000 Parkinson’s patients
  • 10,000 patients with autoimmune diseases (rheumatoid arthritis, IBD, lupus, etc.).

Perhaps the most important consideration is that these individuals can be easily re-contacted. That means 23andMe can keep building phenotype data and recruiting candidates for genetic studies. They’ve banked saliva from every customer, and could presumably try to get blood or other tissue from agreeable participants.

It may not be the most ethnically diverse, carefully stratified, or rigorously phenotyped cohort. But at 600,000 individuals, it’s certainly one of the largest. We in the genetics community will be watching with great interest.


Illumina’s New HiSeq X Instruments

Ah, Illumina. You have to admire their marketing savvy. Last year at around this time, they announced the HiSeq X Ten system, a “factory installation” for human whole genome sequencing (only) at an incredible scale: 18,000 genomes per year, at a cost of $1,000 each (for consumables). It’s still the sequencing-by-synthesis technology employed by previous Illumina instruments, but with a new “patterned” flowcell that spaces the clusters more evenly, for a more efficient sequencing yield.

The cost, of course, was considerable: $10 million for 10 instruments, the required minimum. The operating costs are also considerable, not just for reagents (up to $1.8 million per year) but secondary things like disk, compute, and even internet bandwidth. It’s a massive amount of data to store, analyze, and submit.

There’s Always A Catch

Importantly, the HiSeq X systems can only be applied to human whole genome sequencing. Human means no plants, animals, or model organisms. Whole genome means no targeted/exome sequencing, no RNA-seq. All of those applications will have to go on other platforms.

According to their CEO, Illumina sold 18 HiSeq X Ten systems last year. That’s an impressive number, and even more than they expected. The total capacity exceeds 400,000 genomes per year (some installations got more than 10 instruments). Filling all of that capacity is (and remains) a major challenge, because human samples are precious. They require informed consent for sequencing, IRB review boards, privacy protections. Human samples are the new commodity.

The HiSeq X Five

Illumina hiseq x system

The HiSeq X Five

Now there’s a second option: as announced earlier this month, Illumina will be selling HiSeq X Five systems (5 instruments) for $6 million each.The lower buy-in likely means that even more groups can adopt the HiSeq X technology. They’ll have the same restrictions as the HiSeq X Ten, but half of the capacity (9,000 genomes per year). That’s still a considerable number of whole genomes. Probably more than have been sequenced by the research community in the past five years.

The per-genome cost will also be $1,400 per sample. That’s 40% higher than the cost on the X Ten, but I think it’s around half of what it would cost on the HiSeq2500.

HiSeq3000 and HiSeq4000

There’s also a new generation of the HiSeq2500 instrument to become available later this year. The HiSeq3000 will run a single patterned flowcell for 750 Gbp per run. The HiSeq4000 will run two patterned flowcells, for twice that capacity.

Eventually, these will supplant current HiSeq2500 instruments. I expect they’ll be busy, too, running the exomes, the targeted sequencing, the RNA-seq and bisulfite sequencing.

The Promise of Human WGS

But back to the HiSeq X systems. Personally, I don’t like the idea of a single company dominating the market, and essentially attempting to dictate how human genetics research should be conducted. At the same time, I can’t argue with the direction we’re headed. We had high hopes for SNP arrays and GWAS, but as I discussed in my previous post, sequencing at large scale is required to uncover the full scope of genetic variation underlying complex phenotypes.

And let’s face it, exome sequencing lets us conveniently avoid some of the most challenging aspects of human genomics, like detecting complex rearrangements (SVs) and interpreting noncoding regulatory variants. Both are undoubtedly important to human disease, but more difficult to study. Yet the only way we’ll make progress is to study them in large cohorts numbering thousands of samples. Now, at least, we have the tools to do that.

Common disease genomics by large-scale sequencing

Understanding the genetic basis of common disease is an important goal for human genetics research. Nothing that we do is easy — the ~25% success rate of exome sequencing in monogenic (Mendelian) disorders is proof enough of that — but the challenges of complex disease genetics are considerable.

Cardiovascular and metabolic diseases in particular arise from a complex array of factors beyond genetics, such as age, diet, and lifestyle. We also expect that most of the genetic variants conferring risk will have small effect sizes, which makes their identification all the more difficult.

Common Variation: the GWAS

We do have some powerful tools. Over the last decade, researchers have leveraged high-density SNP array genotyping — which is relatively cheap, high-throughput, and captures the majority of common genetic variation in human populations — to conduct massive genome-wide association studies (GWAS) of common disease.

This approach has yielded thousands of genetic associations, implicating certain loci in the risk for certain diseases.

Rare Variation: Sequencing Required

Yet the variants identified (and genes implicated) explain only a fraction of the genetic component of these diseases, and they generally don’t interrogate rare variation, i.e. variants with a frequency of <1% in the population. The only way to get at these is by sequencing, and the rapid evolution of next-generation sequencing technologies has begun to make that feasible.

A new study in Nature describes such an effort: a search for rare variants associated with risk for myocardial infarction (MI), or in layman’s terms, heart attack. It not only yielded some key discoveries, but showcased some of the challenges and expectations we should have in mind when undertaking large-scale sequencing studies of common disease.

NHLBI’s Exome Sequencing Project

A few years ago, the National Heart, Lung, and Blood Institute of the NIH did something very wise: they funded a large-scale exome sequencing project (referred to by many as “the ESP”) comprising several thousand samples from a number of cohorts. As one of the earliest widely-available exome sequencing datasets at this scale, the NHLBI-ESP quickly became an important resource for the human genetics community.

At the most basic level, it tells us the approximate frequencies of hundreds of thousands of coding variants in European and African populations. Unlike the 1,000 Genomes Project, however, the ESP collected deep phenotyping data, enabling genetic studies of many complex phenotypes.

First Pass: Association and Burden of Rare Variants

Discovery exome sequencing

Discovery phase: case/control selection (R. Do et al, Nature 2015)

Ron Do and his 90+ co-authors designed a discovery study for the extreme phenotype of early-onset MI. Across 11 studies in the ESP, they identified 1088 individuals who’d had a heart attack at an early age (<50 for men, <60 for women).  As a control group, they selected 978 individuals who were at least a decade older than that but had had no heart attack. And with the exome data already in hand, they could search for rare variation associated with the phenotype (early-onset MI) in different ways:

  1. Individual variant associations. Among low-frequency (MAF 1-5%) coding variants, no single variant was significantly associated with the phenotype.
  2. Gene-based associations. Rather than considering individual variants, the authors looked at the “burden” of rare variants at the gene level. For each gene, the authors compared the fraction of samples with at least one rare (MAF<1%) coding variant between cases and controls. No genes had significant associations.

Importantly, gene-based association tests (also called “burden tests”) can be performed in a variety of ways. What frequency threshold should be used? What distinguishes a benign variant from a damaging one? The authors set a MAF ceiling of 1% and considered three sets of variants:

  • Nonsynonymous. All missense, splice site, nonsense, and frameshift variants.
  • Deleterious. The nonsynonymous set above, minus missense variants predicted to be benign by Polyphen2.
  • Disruptive. Nonsense, splice site, and frameshift variants only.

These were reasonable choices, comparable to what we or other groups do in this kind of study. Still, there were no significant results so it was on to phase 2.

Genotype Imputation and Exome Chip

principal components of ESP

PCA analysis (R. Do et al, Nature 2015)

It’s very possible that there are individual variants and genes associated with the phenotype, but the authors didn’t examine enough samples to find them (by their own calculations, in a best-case scenario the power for a study of this size was about 0.2).


So they pursued a few strategies to increase the sample numbers substantially. Across the 11 cohorts there were over 25,000 early-onset MI cases (and an even larger number of suitable controls) but these samples only had SNP array data, and the vast majority of markers on SNP arrays are non-coding.

low frequency variants in MI

Low freq. variant follow-up (R. Do et al, Nature 2015)

So the authors undertook a major effort to impute (statistically predict) the genotypes of 400,000 coding variants based on the SNP array data and a reference panel of samples that had both SNP array and exome data. This was a herculean effort that only merited two sentences in the main text (there are severe restrictions on a “letter” to Nature) because it yielded no finding: no significant association, even with imputed genotypes for 28,068 cases and 36,064 controls.

The authors also performed high-throughput genotyping with the so-called “exome chip,” which looks at ~250,000 known coding variants, in about 15,000 samples. At the time, the cost of running that many exome chips likely exceeded $1 million. Yet there were no significant associations, so that, too, got 2 sentences in the main text.

Targeted Resequencing Follow-up

Rare variant sequencing

Sequencing follow up (R. Do et al, Nature 2015)

The authors needed sequencing data, but they also needed more samples. These things not being free, they decided to choose six of the most promising genes (based on not-entirely-disclosed biologic / statistical evidence) for targeted resequencing in about 1,000 more samples. Once that was done, and the analysis performed yet again, of those (APOA5) looked promising. So the authors sequenced just that gene in three additional studies. This was a mix of PCR-based 3730 sequencing and multiplexed long-range PCR libraries on a MiSeq instrument.

Finally, after sequencing the exons of APOA5 in 6,721 cases and 6,711 controls, the authors had an association that reached genome-wide significance: 5 x 10-7 when a burden test with all nonsynonymous variants was used (the threshold is 8 x 10-7).

More Exome Sequencing Yields Most Obvious Gene Ever

The fourth and final follow-up strategy was simply to do more exome sequencing of ~7,700 individuals, bringing the total to 9,793 samples (4,703 cases and 5,090 controls). After applying a variety of burden test strategies, the authors found exactly one gene with significant evidence of association: LDLR, which encodes the low-density lipoprotein receptor. It’s been known for many years that mutations in LDLR cause autosomal-dominant familial hypercholesteremia, and high LDL cholesterol is one of the top risk factors for MI, so this is both a biologically plausible and completely unsurprising hit.

About 6% of cases carry a nonsynonymous variant in LDLR, compared to 4% of controls, so the odds ratio is about 1.5. This is a classic GWAS result, isn’t it? Very obvious candidate gene achieves statistical significance and the odds ratio is very low.

Interestingly, however, if the authors apply more stringent criteria for variants, the effect becomes more dramatic:

  • Deleterious variants (i.e removing Polyphen’s benign missense) were in 3.1% of cases, 1.3% of controls, yielding the best p-value 1 x 10-11 and an odds ratio of 2.4.
  • Strictly-deleterious missense (requiring 5/5 programs to call a missense variant deleterious) were in 1.9% of cases and 0.45% of controls, yielding a slightly higher p-value of 3 x 10-11 but an odds ratio of 4.2.
  • Disruptive variants had the highest odds ratio (13.0), but with a much higher p-value (9 x 10-5) and affecting just 0.5% of cases. These are basically familial hypercholesteremia carriers.


At first glance, one might wonder how this came to be a Nature paper because there were no truly novel findings. LDLR was already well known, and APOA5, which encodes an apolipoprotein that regulates plasma triglyceride levels, was already a strong candidate gene for MI. In fact, two other genes related to APOA5 function had already been reported for association with plasma TG levels and early-onset MI, and the gene resides in a known locus for plasma TG levels identified by classic GWAS.

True, that region had extensive LD and it wasn’t clear which of the few genes there were involved in the phenotype. And technically, APOA5 had not yet been established as a bona-fide gene for early-onset MI. This is the final nail in the coffin, but look what it took to get here: exome sequencing, followed by targeted sequencing, and then even more targeted sequencing. It’s glossed over in the paper, but every step in the authors’ pursuit of APOA5 required timely, careful analysis of the genetic evidence.

In the last part of their letter, the authors discuss some of their power calculations for large-scale genetic studies of this nature. They sought to answer that pivotal question, “How many samples do we have to sequence to find something?”

Because of the challenge of distinguishing benign from deleterious alleles, and the extreme rarity of the latter, well-powered studies of complex disease will require sequencing thousands of cases. Here’s the authors’ power calculations for a gene harboring a median number of nonsynonymous variants:

Power calculations for sequencing gwas

Power to detect gene with median # of variants (R. Do et al, Nature 2015)

  • In a best-case scenario — a gene harboring large numbers of nonsynonymous variants, each conferring the same direction of effect — we’re talking 7,500 samples to achieve >90% power.
  • In a more likely scenario (i.e. the power calculations above) — a gene harboring median numbers of nonsynonymous variants — it’s 10,000 or more samples.

Generating, managing, and analyzing exome or genome sequencing data for these sample numbers is a massive undertaking. Undoubtedly, this will be the mission for us and other large-scale sequencing centers for years to come.

Do R, Stitziel NO, Won H, Jørgensen AB, Duga S, et al (2014). Exome sequencing identifies rare LDLR and APOA5 alleles conferring risk for myocardial infarction. Nature PMID: 25487149