Rare Variants in Complex Disease: ABCA7 and Alzheimer’s

Although the cost of sequencing continues to fall precipitously (cue the NIH sequencing-versus-Moore’s-Law figure), it’s still expensive relative to high-throughput genotyping. Whole-genome sequencing on the X Ten costs around $2500 per sample by the time you account for basic analysis and data storage. This means that a well-powered genetic association study for complex disease (10,000 samples) would cost over $20 million just for data generation. The same cohort genotyped on a high-density SNP array might only cost about $1 million. Undoubtedly, that’s why most large scale genome-wide association studies to date (>50,000 samples) have relied primarily on SNP array data.

There is a growing body of evidence, however, that rare variants (especially ones not present on SNP arrays) might confer a significant proportion of the genetic risk for complex disease. In age-related macular degeneration (AMD), for example, sequencing studies of moderate size (~5,000 samples) were able to identify rare coding variants in C3 and CFH associated with risk of disease. An important advantage of a sequencing approach is the ability to perform aggregation tests of private and rare coding variants (e.g. with the sequence kernel association test, SKAT) to boost the power to detect association.

A recent paper in Nature Genetics illustrates the feasibility of this approach for sequencing studies of complex disease. Stacy Steinberg and colleagues from deCODE Genetics conducted a search for rare functional variants in the known risk loci for Alzheimer’s disease (AD) using a unique resource: whole-genome sequences of 2,636 Icelanders imputed into 104,220 long-range phased individuals and their relatives.

So here we have a rare variant association study (RVAS) that employs several strategies for an efficient design:

  1. Studying an isolated population (Iceland), whose genetic structure enabled accurate genotype imputation of a large sample set (>100k individuals) with sequencing data for just 2,500.
  2. Analyzing missense variants with SKAT, which aggregates rare variants (i.e. collapses them at the level of the gene) to boost power for association but allows for multiple directions of effect.
  3. Examining only regions known to be associated with AD — which seem likely to harbor [rare] functional variants — to reduce the multiple testing penalty.

Targeted Association Studies

There are, of course, disadvantages to limiting the scope of association testing to known regions. Obviously you won’t be discovering any new associations, especially ones that sequencing (but not genotyping) might be able to uncover. Even so, you’re stacking the deck in your favor because the known GWAS loci almost certainly harbor some functional variation that hasn’t yet been fully interrogated.

Sometimes, sequencing will only serve to replicate the common variant association signal (i.e. not find anything new). Yet these targeted approaches might help narrow the boundaries of the associated region — which could encompass dozens or hundreds of genes — or, even better, identify disruptive variants whose LD with the lead SNP makes them good candidates for causal variants. Thirdly, one might uncover secondary independent association signals in GWAS loci, implicating that there are multiple haplotypes that influence disease risk.

Variant Annotation and Aggregation

As anyone who has done aggregation/burden testing in association studies can tell you, the analysis choices can have a significant impact on results. The annotation tool/source, MAF threshold, and variant mask (definition of what’s deleterious and should be included) can introduce a lot of variability. In this case, the authors tried two variant masks:

  1. Loss of function variants: nonsense, frameshift or canonical splice site variants. These are usually quite rare, and so the authors collapsed them to a single “meta variant” at the level of the gene.
  2. Missense variants: nonsynonymous variants or splice region variants. This latter one is an interesting choice, and not necessarily one I’d have thought to make at the discovery stage.

The burden tests included only variants with MAF<1% and information (call rate) >0.80. The authors tested about 80 genes across the 17 loci, and the top-scoring hit was ABCA7 (p=0.00020).

Splice Region Variation in ABCA7

ABCA7 encodes ATP-binding cassette transporter A7, a member of ABC transporters that move lipids across membranes. The SKAT result was primarily driven by a single variant, c.5570+5G>C. Without it, the test had a p-value of 0.46. If you’re familiar with the notation, then you know that c.5570+5 indicates a noncoding variant 5 bases into an intron. We call this the “splice region” and, unlike the canonical splice site (+/- 2bp) it’s not clear that variants here affect splicing.

But the authors had another NGS tool to look at this: RNA-seq. When they looked at the transcript sequences of c.5570+5G>C carriers, they included a retained intron that eventually included a stop codon.

splicing variant in ABCA7 in Alzheimers

Intron retention in carriers (Steinberg et al, Nat. Gen. 2015, Fig S1)

The image here is from Supplemental Figure 1 (the main text had no figures) and shows the intron retention in c.5570+5G>C carriers. Side note: according to the legend, the coordinates are on NCBI build 36, which practically a crime. But moving on, the RNA-seq results justified including the variant in the loss-of-function test (mask #1), which then yielded a p-value of 5.3e-10 with odds ratio of 1.97.

Follow-up and Replication of Association

With a possible causal variant in hand, the authors next examined the long-range haplotypes to see if this variant was on the same background as rs4147929, the common variant previously associated with AD by GWAS. It was never on the same allele, which is a fascinating result; the common variant signal and this rare variant association appear to be independent. It’s possible, therefore, that the mechanisms are different as well.

To replicate the association, the authors genotyped ABCA7 loss-of-function variants in study groups from Europe and the United States, finding a p-value of 0.0056 with OR of 1.73. When combined with the Icelandic data by meta-analysis, the OR was 2.03 and the p-value 6.8e-15.

What’s Next for AD and Common Disease

ABCA7 certainly merits future studies, both in the genetics realm and in the laboratory for functional evaluation. It’s strongly expressed in the brain, where it promotes the efflux of phospholipids and cholesterol to apoA-I and apoE. But the ortholog of ABCA7 in C. elegans and results from mouse models suggest that regulation of phagocytosis might be the primary function of the gene. The authors tested for correlation between variants in ABCA7 and two disease-associated alleles (in APOE and TREM2), but found none. Thus, the mechanism by which ABCA7 loss-of-function confers susceptibility to AD will need further investigation.

Still, it’s a promising start to detangling the etiology of a complex human disease, and a demonstration of the power of genome sequencing to uncover promising new leads.

Steinberg S, Stefansson H, Jonsson T, Johannsdottir H, Ingason A, Helgason H, Sulem P, Magnusson OT, Gudjonsson SA, Unnsteinsdottir U, Kong A, Helisalmi S, Soininen H, Lah JJ, DemGene, Aarsland D, Fladby T, Ulstein ID, Djurovic S, Sando SB, White LR, Knudsen GP, Westlye LT, Selbæk G, Giegling I, Hampel H, Hiltunen M, Levey AI, Andreassen OA, Rujescu D, Jonsson PV, Bjornsson S, Snaedal J, & Stefansson K (2015). Loss-of-function variants in ABCA7 confer risk of Alzheimer’s disease. Nature genetics PMID: 25807283

Science Fiction: Going Viral

The rapid advance of next-generation sequencing technologies, particularly in the last several years, has almost seemed like something out of a science fiction novel. Think about it: on a HiSeq X Ten instrument, we can sequence a complete human genome in less than a week, at a cost that’s 0.00001% of what it took to fund the Human Genome Project.

It might surprise you to learn that — in addition to my blog posts here, and the grant/paper writing I do for my job — that I dabble in science fiction writing as well. If you think that scientific publication/success is hard (10% acceptance rate for tip-tier journals, or 8% NIH funding level), you should look into the the fiction side of publishing sometime.

The acceptance rate for most professional science fiction magazines (for short fiction) is generally below 1%. The pay is usually $0.05-$0.10 per word, meaning that a 4,000 word story might bring $200-400 in the (unlikely) event that you get it professionally published. The odds of landing a literary agent — which is required, if you want to have your novel shopped to most traditional publishing houses — are about 1 in 1,000.

A few months ago, Third Flatiron Publishing (which does quarterly science fiction anthologies) announced that their Spring 2015 anthology would be themed around world-altering events. As it happened, I’d written a science fiction story that seemed like it might fit — it was about a couple of researchers working in a dusty lab who stumble upon a universal cure for cancer (you remember I said science fiction, right?), and their struggle to make it available to the world.

The Time It Happened

I’m thrilled to say that the editors at Third Flatiron liked my story enough to choose it for their anthology The Time It Happened, which just came out and is available on Amazon in both Kindle and paperback versions. They’ve also bought audio rights, and intend to create a free podcast of my story (as well as a couple of others) sometime in the near future.

Since you readers enjoyed the non-fiction I write for MassGenomics, hopefully you’ll enjoy this as well.


Targeted Sequencing of GWAS Loci for Cleft Lip

In the last decade, genome-wide association studies (GWAS) enabled by cheap, high-throughput SNP genotyping have identified thousands of loci that influence disease susceptibility, quantitative traits, and other complex phenotypes. The genetic markers on high-density SNP arrays are carefully chosen to capture (or “tag”) most common haplotypes in human populations. Common SNPs tend to be more informative in this regard, and most of these fall outside the exons of protein-coding genes.

cleft lip association study

Credit: Leslie et al, AJHG 2015

This efficiency is both the strength and the weakness of SNP arrays: they are well-suited to represent variation across the human genome, but they’re unlikely to be causal variants themselves. In essence, the loci uncovered by GWAS are signposts that tell us where to look for functional variation that influences a trait of interest.

Following up GWAS hits — with sequencing and functional validation — will ultimately be required to understand the mechanism of disease. A paper online at The American Journal of Human  Genetics offers an informative example of how that plays out.

Cleft Lip/Palate as a Complex Trait

Non-syndromic cleft lip with or without cleft palate (NSCL/P) affects about 1 in 700 live births, and represents a global health problem (particularly in the developing world). Multiple genetic and environmental risk factors give rise to a complex etiology for this trait. One candidate gene (IRF6) was known to harbor common variants associated with NSCL/P, and large-scale GWAS efforts have yielded 12 additional loci reproducibly associated with it.

To further investigate the genetic architecture of this phenotype, a group of researchers from several institutions (including the Genome Institute at WashU) sequenced those 13 regions in over 4,000 individuals. At the study design stage, the researchers made two key decisions that undoubtedly contributed to their success:

  1. A case-parent trio design. Most of the samples chosen came in the form of an affected child and two unaffected parents. This structure makes it possible to examine not just the presence or absence of alleles, but whether or not they’re transmitted to the affected child. It also permitted a search for de novo mutations that might contribute to susceptibility.
  2. A wide target region for each locus, including both coding and non-coding regions. The latter type are increasingly important as we delve into complex traits in which a significant fraction (if not a majority) of causal variants will be regulatory rather than coding in nature.

De novo Mutations

Just as one can’t truly identify somatic mutations in a tumor tissue without a matched normal, it’s nearly impossible to distinguish de novo mutations in a patient without sequencing both of his or her parents. This is the only way, people. Filtering dbSNPs is not going to get you there.

The thing about de novo mutations is that they’re exquisitely rare — according to estimates of the de novo mutation rate, any given individual should have around 34 mutations genome-wide. Since the target space for this study represented 0.19% of the genome, that’s a long shot. Then again, we’re talking about a lot of trios, and they’re selected for a trait that’s been linked to this target space.

My back-of-the-envelope calculations based on the amount of target space (6.3 Mbp), the de novo mutation rate (1e-08), and the number of trios sequenced here (1,409) suggest that we’d expect ~89 de novo mutations. The authors came up with 123, which is a little high. We do have an enriched population, but calling de novos is very likely to yield some false positives.

They were able to design assays for 82 mutations and confirmed 66 (80%) by Sanger sequencing. That’s a good validation rate, and it suggests that about 98 of the predicted mutations would hold up (pretty close to my estimate).

Only 3 of 66 confirmed de novo mutations (3.6%) altered protein sequence. The majority (95%) were noncoding, though 11 of these mapped to a predicted regulatory element.

Common Variant Associations

To identify common functional variants, the authors used an allelic transmission disequilibrium test (TDT), which determines if an allele is transmitted more (or less) than we’d expect by chance. All but one of the GWAS regions (PAX7) showed evidence of association with p-values less than 10-5. In general, the results supported the GWAS findings: the variants yielding the lowest p-value were either the lead GWAS SNP or were in perfect LD with the lead GWAS SNP.

A conditional analysis revealed only one locus (ARHGAP29) with evidence for secondary independent signals, suggesting more than one common functional variant.

Rare Variant Associations

The common variants explained only a fraction of the heritability for disease. Yet these GWAS regions were also logical candidates for rare variation that contributes to disease. The challenge with rare variants is that one needs thousands of samples to even see them, much less establish statistically significant association.

To address this, we often collapse individual rare variants to the gene, regulatory element, or genomic interval in which they occur to get the power up. These so-called burden tests can boost the power for detection, but that’s dependent on one’s ability to predict which variants are truly functional. In this study, neither gene-based or regulatory element-based burden tests yielded significant associations.

A ScanTrio analysis of genomic intervals — after experimenting with different window sizes and overlaps — yielded signals for 2 of the 13 regions (NOG and NTN1).

Functional Validation

The challenge of genetic association studies — especially ones for complex phenotypes — is confirming statistical evidence of association with a functional assay. Sequencing and genotyping methods continually become faster and cheaper. With functional validation, the only paradigm shift is that more and more journals want to see it before they publish a genetic study.

PAX7 de novo Mutation

One of the few de novo coding mutations was predicted to disrupt the DNA-binding domain of PAX7. The authors designed an electromobility shift assay (EMSA) to examine how the missense substitution affected PAX7‘s ability to bind a target regulatory sequence. They also used quantitative reporter assays in HeLa cells, with co-transfection of a plasmid containing either wild-type or mutant PAX7.

cleft lip pax7 mutation

PAX7 functional validation (Leslie et al, AJHG 2015)


Both experiments showed that the wild-type allele had greater DNA-binding capacity, and drove higher expression of the reporter gene.

FGFR2 de novo Mutation

One of the noncoding de novo mutations was 254 kilobases downstream of FGFR2, in a noncoding region that looks (according to chromatin marks) like a neural crest enhancer. FGFR2 is known to play a role in craniofacial development, and rare variants in it had been reported in cases of NSCL/P. Here, the authors leveraged a zebrafish model system to examine the role of that enhancer during development. In transient transgenic reporter studies of zebrafish embryos, the +254kb element holds up: the wild-type allele had enhancer activity in 41/82 embryos (50%), whereas the mutant allele had enhancer activity in 3/83 embryos (3.6%).

FGFR enhancer validation

FGFR +254kb enhancer (Leslie et al, AJHG 2015)

This, in my opinion, is one of the most compelling parts of this study: in vivo functional validation of a single base change in a noncoding enhancer that’s hundreds of thousands of bases away from the gene it regulates.

Common Variant at 17q22 (NOG)

Multiple SNPs reached genome-wide significance in the 17q22 region. The greatest significance was detected at rs227727, about 105 kb downstream of the NOG transcriptional start site. This variant was in complete LD with the lead SNP from the prior GWAS. This was interesting because NOG encodes a BMP antagonist that’s expressed primarily in the epithelium during palatal development.

Tandem enhancers in cleft lip

Tandem enhancer disruption (Leslie et al, AJHG 2015)

The authors confirmed NOG expression in the palatal epithelium in mouse embryos. They also noted that rs227727 mapped to one of two enhancers in the region, +105kb (the other being +87 kb). The variant allele disrupts predicted binding sites for two transcription factors (MEF2C and CDX2) and creates possible binding sites for at least two others.

Interestingly, the zebrafish assay did not show epithelial enhancer activity for the +105kb element by itself. However, a tandem construct with both enhancers (+87kb and +105kb) lit things up. The effect was at least additive, and constructs containing the risk allele of rs227727 showed significantly decreased enhancer activity.

Beyond GWAS for Complex Disease

What I like about this study is that it studied GWAS and candidate gene regions in careful investigations that included functional validation components. It’s so easy to take a GWAS hit, look for the nearest gene, and spin a story about how variation in that gene affects the phenotype of interest. Here, the authors have done the difficult and time-consuming work of (1) exhaustive sequencing to identify the possible functional variants, and (2) in vivo functional assays to prove that the implicated variants have a phenotypic effect. That’s a lot of work to pin down the genetic architecture and disease mechanism for a handful of disease loci.

The high-throughput nature of genotyping (and increasingly, sequencing) and the discovery power of large cohorts are going to yield promising new findings. With them comes a strong temptation to take the association hits and run with them. Write up some voodoo in the discussion about the gene and its role, get the paper out, and move on. The problem is that statistical genetic evidence is not enough. You don’t know that your lead SNP is functional, or that the nearest neighboring gene provides the mechanism of phenotypic effect.

More studies like these, with well-planned study designs and compelling functional assays, will be required as we continue to unravel the complex fabric of human genetics.


Leslie, E., Taub, M., Liu, H., Steinberg, K., Koboldt, D., Zhang, Q., Carlson, J., Hetmanski, J., Wang, H., Larson, D., Fulton, R., Kousa, Y., Fakhouri, W., Naji, A., Ruczinski, I., Begum, F., Parker, M., Busch, T., Standley, J., Rigdon, J., Hecht, J., Scott, A., Wehby, G., Christensen, K., Czeizel, A., Deleyiannis, F., Schutte, B., Wilson, R., Cornell, R., Lidral, A., Weinstock, G., Beaty, T., Marazita, M., & Murray, J. (2015). Identification of Functional Variants for Cleft Lip with or without Cleft Palate in or near PAX7, FGFR2, and NOG by Targeted Sequencing of GWAS Loci The American Journal of Human Genetics DOI: 10.1016/j.ajhg.2015.01.004

The Value of the Cohort: 23andMe’s Research Portal

23andMe genetic testing23andMe has been an interesting company to watch over the last five years. For a variety of reasons, they remain the visible direct-to-consumer (DTC) genetic testing company, and also became Illumina’s single biggest customer for high-density SNP arrays. As I’ve written about before, I underwent the 23andMe genetic testing service just months before the FDA’s cease-and-desist letter on the medical/health reporting aspects of that service. So I’ve been able to see things from the consumer side of it as well.

An article this month in the MIT Technology Review examines 23andMe’s new formula for business success: building up and selling access to their ever-growing database of willing research participants. This is not a new direction for the company, but is garnering more attention after they signed a deal with Genentech under which the pharma giant will pay up to $60 million for access to ~3,000 Parkinson’s disease patients in 23andMe’s database. That’s about $20,000 per sample, and a major coup for a company still reeling from the FDA crackdown.

23andMe’s Genetic Database

The company is branding this as their Research Portal Platform and the allure is fairly obvious:

  • They have banked and genotyped samples from 800,000+ paying customers
  • The database continues to grow, especially from customers outside the U.S. who can still get the “full” service
  • So far, about 600,000 customers have agreed (“consented”) to participate in research studies.
  • 23andMe continues collect phenotype data via customer outreach

In other words, 23andMe has a catalogue of 600,000 samples that are (1) already genotyped, (2) broadly consented for research, and (3) easy to recontact as needed. It’s the kind of cohort that genetics researchers are currently salivating over, especially in the era of large-scale sequencing studies.

Suffice it to say that the company stands to make a lot more money from this than from their $99 genetic testing kit.

Sample Consent for Research

23andMe Health History

23andMe: Hey, tell us everything

I will tell you this: when it comes to consenting its customers, 23andMe sure knows how to sell it. The text for the “Basic Research Consent” is as follows:

Giving consent means that your Genetic & Self-Reported Information may be used in an aggregated form, stripped of identifying registration information (such as name, email, address), by 23andMe for peer-reviewed scientific research. To learn more about how 23andMe safeguards your privacy, read our Privacy Statement.

You can change your mind at any time by changing your settings below. Contact us with any questions.

It’s interesting to note that the basic consent is for use by 23andMe for peer-reviewed scientific research. The Genentech deal would appear to fall outside both restrictions, since the data will be used by a third party and seems unlikely to undergo peer review. Of course, it’s possible (likely, even) that 23andMe drew up a different consent for Parkinson’s patients.

Genotype and Phenotype Data

23andMe health questions

23andMe Pops A Question

The high-density SNP array used for 23andMe genotyping is a custom design provided by Illumina. If memory serves, it’s the OmniExpress (700K) chip plus a few hundred thousand markers of interest examined by 23andMe. The company presented a poster at ASHG with a list of some of the types of data that would be available through their research portal:

  • Genotypes
  • Conditions/Diagnoses
  • Medication Usage
  • Response to Medication
  • Family History of Disease
  • Health Behaviors
  • Personality Traits
  • Environmental Exposures
  • Geographic Location

Importantly, nearly all of the phenotype information is freely offered up by 23andMe customers. The company collects it primarily through surveys. There’s a big health history survey when you sign up for the service, and then there are little follow-ups. Like the one at the right, found in the sidebar today when I logged in. It’s inviting and casual… sort of a “Hey, while you’re here, have you ever had….”

On the plus side, it’s a very non-invasive way of collecting information. The customer (me) is logged in and poking around already. With a little planning, 23andMe can ask countless questions like these and add them to a user’s profile. On the down side, this is completely self-reported. We’re not in a doctor’s office here; there’s no requirement for truth. So while I’m more likely to be answering questions like these, I might very easily (1) make a mistake, because I’m not a physician, or (2) lie just because I feel like it. I paid 23andMe, which gives me a little sense of entitlement.

Admittedly, we are willing participants. 23andMe makes no secret of its hopes to use my DNA for research purposes, and I have no problem with that. Then again, I have a better understanding of what it means than most of their customers.

Features of the 23andMe Cohort

The FDA brought the hammer down on 23andMe’s doling out of health-related findings to its customers, but as hinted at in the MIT Review article, the company obviously has a second agenda. They’re assembling one of the most valuable human genetics research cohorts in the world. Some quick highlights of the 600k+ consented participants:

  • Ancestry: 77% European, 10% Latino, 5% African American, 4% Asian, 2% South Asian, 2% Other
  • Gender: 52% male, 48% female.
  • Cancer: 33,000 cases, comprising breast (6,000), prostate (5,000), colorectal (1,700) and other cancers. Of these, 5,000 have undergone chemotherapy. The cohort also has 405,000 “confirmed” controls, i.e. people who indicated they’ve never had cancer.
  • 120,000 APOE e4 allele carriers (the risk allele for Alzheimer’s)
  • 10,000 Parkinson’s patients
  • 10,000 patients with autoimmune diseases (rheumatoid arthritis, IBD, lupus, etc.).

Perhaps the most important consideration is that these individuals can be easily re-contacted. That means 23andMe can keep building phenotype data and recruiting candidates for genetic studies. They’ve banked saliva from every customer, and could presumably try to get blood or other tissue from agreeable participants.

It may not be the most ethnically diverse, carefully stratified, or rigorously phenotyped cohort. But at 600,000 individuals, it’s certainly one of the largest. We in the genetics community will be watching with great interest.