dbSNP exceeds a ridiculous 150 million variants

Earlier this week, I took a look at the dbSNP VCF file for build 147 (human) with Ben Kelly from the White Lab at NCH. Even summary statistics took a while to generate, and soon we realized why: dbSNP now contains a jaw-dropping 152.7 million reference variants. Roughly speaking, that’s one variant for every 20.5 base pairs in the human genome. They’re not all rare variants, either: 86 million variants are classified as common (G5, G5A, or COMMON), with minor allele frequencies >1% in at least one population.

dbSNP’s Ridiculous Growth

During the HapMap project in 2003-2004, we were astonished to see dbSNP hit 10 million variants. My boss at the time, Ray Miller, told me that some thought it could one day hit 50 million. We thought it might take decades, but dbSNP surpassed 50 million refSNPs just seven years later.

dbSNP growth 2016

dbSNP Growth, 2002-2016

Even more astonishing are the 6+ million coding variants: depending on how you define the exome, that’s about one variant every 5 or 6 base pairs in coding regions. Compounded by the fact that a single variant might affect multiple transcripts/genes, the number of observed human coding variants exceeds 22 million.

dbSNP coding variants

That being said, the fact remains that the vast majority of known variants in our genome lie outside of protein-coding exons. When annotated with snpEff, there are more than 80 million variants within or nearby genes, where they might play a regulatory role (again, multiple transcripts = multiple annotations per variant).

noncoding dbSNP 147

Noncoding dbSNP annotations.

The Evolving Utility of dbSNP

In the early days of next-generation sequencing, dbSNP provided a vital discriminatory tool. In exome sequencing studies of Mendelian disorders, any variant already present in dbSNP was usually common, and therefore unlikely to cause rare genetic diseases. Some of the first high-profile disease gene studies therefore used dbSNP as a filter. Similarly, in cancer genomics, a candidate somatic mutation observed at the position of a known polymorphism typically indicated a germline variant that was under-called in the normal sample. Again, dbSNP provided an important filter.

Now, the presence or absence of a variant in dbSNP carries very little meaning. The database includes over 100,000 variants from disease mutation databases such as OMIM or HGMD. It also contains some appreciable number of somatic mutations that were submitted there before databases like COSMIC became available. And, like any biological database, dbSNP undoubtedly includes false positives.

On the bright side, however, the rapid generation of genomic data worldwide has enabled deeper characterization of the variants that we know about. The 1,000 Genomes Project contributed genome-wide data for 2,504 individuals from several continental groups, while the Exome Aggregation Consortium (ExAC) has compiled gene-centric data from 60,706 individuals at the time of writing.

The Value of Variant Allele Frequencies

As a central repository for variant allele frequency (VAF) data, dbSNP can be a powerful resource for human genetics studies. Of particular relevance for rare disease genetics are the variant allele frequencies (VAFs) in worldwide populations. For a rare autosomal recessive disorder affecting 1 in 100,000 individuals, compound-heterozygous variants with VAFs of 0.01 in a certain population are too common: their combined frequency is 0.0001, or 1 in 10,000. In contrast, most known disease-causing variants — mutations that have been imported from OMIM, for example — are exceedingly rare.

Thus, while the mere presence of a variant in dbSNP is a blunt tool for variant filtering, dbSNP’s deep allele frequency data make it incredibly powerful for genetics studies: it can rule out variants that are too prevalent to be disease-causing, and prioritize ones that are rarely observed in human populations. This discriminatory power will only increase as ambitious large-scale sequencing projects like CCDG make their data publicly available.

Ancestry and Inclusion

Importantly, allele frequency data are most useful when the population matches the ancestry of the sample(s) being studied. In their current form, our databases are skewed towards major population groups (northwest European, East African, and East Asian). Many important geographic and ethnic groups are still under-represented. The reasons for this are complex, and not the focus of this post, but I think we can all agree that it’s vital to seek out and include samples from diverse ancestries as large-scale sequencing efforts move forward.

In Summary

At 150+ million variants, dbSNP is a massive beast, and still offers a useful discriminatory tool when used correctly. Proceed with caution.


A New Era for MassGenomics

When I started MassGenomics in 2008, next-generation sequencing was in its infancy. We’d sequenced AML1 — the first cancer genome — with two nascent platforms: Illumina/Solexa (32-bp reads) and 454 FLX (450-bp reads). Already, we had a glimpse of the bioinformatics challenges that these technologies brought forth.

Sequencing for Common Disease

It’s astonishing how far the field has come in just eight years. Factory-scale sequencing now makes it practically and economically feasible to sequence tens of thousands of (whole human) genomes in a single year. Washington University and other large-scale sequencing institutions are currently applying it to ambitious studies of cardiovascular, autoimmune, and neurological conditions. By studying tens of thousands of genomes, it should be possible to comprehensively define the genetic architecture underlying each of these common complex diseases.

Rare Disease and Clinical Applications

Yet there are other important applications of next-gen sequencing, such as:

  • The identification of genes underlying rare inherited disorders
  • Molecular diagnosis and characterization of undiagnosed diseases
  • Utilization of genomic information to improve clinical care

The distribution model for these applications is different from the factory-scale sequencing operation required for common disease. The democratization of NGS has empowered hundreds of smaller labs to carry out such research, and enabled rapid clinical sequencing at the point of care. That’s where the rubber hits the road, and it’s also where I want to be.

A New Position: Nationwide Children’s Hospital

Thus, after 13 years at Washington University, I’ve accepted a position as Principal Investigator at Nationwide Children’s Hospital. If the name of that institution sounds familiar, it’s because they’ve recruited Rick Wilson and Elaine Mardis to establish a new Institute for Genomic Medicine (IGM). Under their leadership, I’ll help build up the research program for the genetic basis of rare pediatric disorders.

Future Directions

So, what does this mean for MassGenomics? The blog will continue, hopefully at a greater frequency, and with a new emphasis into pediatric and clinical genomics. I should state for the record that the blog does not represent the views of Nationwide Children’s Hospital or the Ohio State University (where I’m now an assistant professor).

The McDonnell Institute at Washington University will continue on, by the way. The talented faculty and staff have already begun work on the common complex disease genomics (CCDG) program, while University leadership has initiated a search for a new director. They have capacity to spare, so if you’re looking for high-quality exome or genome sequencing (human or non-human), please reach out to Bob Fulton.

So that’s my news, and I hope to have more to share in the weeks to come.


The Genetic Architecture of Complex Disease

Genetics of complex disease

Fuchsberger et al, Nature 2016

It’s no secret that while genome-wide association studies (GWAS) have implicated thousands of genetic loci in human phenotypes, the variants uncovered collectively explain only a fraction of the observed variance between individuals. The reasons for this “missing heritability” are a subject of vigorous debate in the scientific community. One possible explanation is that rare (low-frequency) variants — which are poorly represented on the arrays used for GWAS — underlie a substantial proportion of the variability.

This idea is intuitive: in theory, large-effect variants would be kept at low frequency by natural selection, a pattern that’s well established for mutations that cause rare single-gene disorders. It also makes a strong argument for large-scale sequencing for common complex disease, which is the purpose of the NHGRI’s flagship CCDG program. The problem, of course, is that we can’t really understand the contribution of low-frequency variants to human disease without actually performing such an experiment.

A new study in this week’s issue of Nature represents one of the first and highest-profile attempts to do so for a common disease. Type II diabetes (T2D) affects 29 million people in the United States (according to the CDC), which is about 9.3% of the entire population. It also has a strong genetic component, and has thus been a priority GWAS target for over a decade. So far, GWAS efforts have reported 80 robust associations, largely involving common (MAF>5%) variants that have very small effects on disease risk.

In the current study, Christian Fuchsberger and his 300+ co-authors used a combination of genome sequencing, exome sequencing, genotyping, and imputation to examine the genetic architecture of type II diabetes. This report is the fruit of a years-long collaboration between two consortium efforts: GoT2D, which applied whole-genome sequencing to individuals of European ancestry, and T2D-GENES, which performed exome sequencing in multi-ethnic cohorts. Here’s a summary of the data generated:

Genome-wide Data (European ancestry) Cases Controls
Low-coverage (5x) whole genome sequencing: 1,326 1,331
Genotype imputation in 13 other cohorts: 11,645 32,769
Total: 12,971 34,100
Exome-centric Data (5 ancestry groups) Cases Controls
Deep (82x) exome sequencing: 6,504 6,436
SNP array genotyping (2.5 million sites): 28,305 51,549
Total: 34,809 57,985

Genome Sequencing Coverage Matters

I think it’s important to point out the nuance of whole-genome sequencing coverage. Generally, we target 30x coverage for a whole-genome of a germline (i.e. non-tumor) sample, which provides excellent power for variant detection. Some groups have touted 20x as a possible minimum threshold, and I’m comfortable with that.

But low-coverage (5x) whole genome sequencing is a whole different animal. WGS coverage is like a bell curve: while many positions will have 5x coverage, some will have 1-3x and some will have 7-10x. Even for this group of authors, which include some of the top experts on NGS variant calling, this presents a significant challenge for variant detection.

Simply put, at 5x coverage, a number of rare and/or hard-to-call variants (e.g. SVs) will be missed.

Useful WGS Metrics

In spite of my concerns, I’m a sucker for summary metrics in large-scale WGS datasets. Here are some highlights from low-coverage WGS of 2,657 European-ancestry individuals:

  • 26.7 million variants were detected, genotyped, and phased, including 1.5 million small indels and 8,876 large (>100 bp) deletions.
  • Individuals harbored an average of 3.30 million genetic variants, including 271,245 indels and 669 deletions.
  • 420,473 common SNVs and 2.4 million low-frequency SNVs were poorly tagged by genotype arrays (r-squared < 0.30), and thus haven’t been interrogated by any T2D GWAS to date.

Genome-wide Single-variant Associations

The primary association analysis uncovered 126 variants at 4 loci that were associated with T2D, three of which were known. EML4 was novel, but when the authors imputed sequencing variants into a much larger sample collection (44,414 individuals from 17 other studies), the association didn’t hold up. Another novel signal (CENPW) did appear, and this was replicated in an independent cohort.

Associations with T2D

Fuchsberger et al, Nature 2016

In summary, the meta-analysis of sequencing and imputed data examined 26.7 million variants in over 47,000 individuals of European ancestry. That’s a massive association study with extremely high resolution, yet it recapitulated only 13/80 loci (16%) known to be “robustly associated” with T2D, and uncovered only one new locus. I find that a bit discouraging, and I’m sure the authors did, too.

Coding Variation in Type II Diabetes

The analysis of exome data fared little better, I’m afraid. The authors combined exome sequencing data from 10,437 individuals representing five ancestry groups (European, South Asian, East Asian, Hispanic, and African American) with equivalent data from the WGS study for a joint dataset comprising 12,940 individuals. They identified:

  • 3.04 million variants overall, of which 1.19 million were protein-altering
  • ~9,243 synonymous, 7,636 missense, and 250 protein-truncating variants per individual

Single-variant testing yielded only a single significant result, (PAX4 p.Arg192His, a.k.a. rs2233580) that was only observed in East Asian individuals. Gene-level aggregation testing yielded no exome-side significant finding. Limiting the analysis to 634 genes in known associated loci uncovered an association (FES in South Asians, driven by a single likely-causal variant) that met the more forgiving threshold for significance.

To increase power, the authors integrated SNP genotypes from 2.5 million sites in about 79,000 additional cases and controls (all European ancestry) obtained using a custom Illumina SNP chip. Integrating these with the exome data yielded an exome-centric dataset of more than 90,000 individuals. Some 18 variants at 13 loci exceeded genome-wide significance, but all were common (MAF>5%), and only one (MTMR3) was outside of known GWAS loci.

No Evidence for Synthetic Association

Back in 2010, Goldstein and colleagues proposed the concept of “synthetic association” — the idea that common GWAS signals may be due to individually rare causal variants which cluster on certain common haplotypes. The thinking was that sequencing in GWAS regions might therefore reveal all of these causal variants. This would offer an intriguing explanation for the fact that most lead GWAS hits lie outside of coding regions. It might be possible that nearby rare causal variants were in LD with the tag SNP, and these (not the tag SNP) exerted the causal effect on disease risk.

The authors tested this hypothesis in T2d using the WGS dataset for 2,657 individuals, which they describe as having “near-complete ascertainment of genetic variation.” They took the 10 T2D GWAS loci with the strongest support in their study, and looked for low-frequency missense variants within 2.5 million base pairs of the common index SNV. None of the loci showed supporting evidence of “synthetic association,” and 8/10 were convincingly not consistent with the proposed phenomenon.

Thus, while synthetic association might well underline common GWAS signals for other phenotypes, it does not appear to do so for T2D.

The Contribution of Rare and Common Variants

To model the disease architecture of T2D, the authors conducted an elegant experiment. They simulated three possible models which had seemed plausible prior to large-scale sequencing, and computed the number of associated low-frequency and rare variants that would be uncovered with their study design.

genetic models of T2d

Simulated models of T2D genetics (Fuchsberger et al, Nature 2016)

In the first two models, low-frequency variants explain a significant proportion of the heritability, and over a hundred of them should have been uncovered at the more forgiving significance threshold. In a third model, where rare variants make a minority contribution, they’d uncover only a few dozen.

T2D genetics results

Actual results for T2D (Fuchsberger et al, Nature 2016)

Next, the authors compared these outcomes to their actual results. Only 23 low-frequency and rare variants achieved significance, which is nowhere close to the first two models (the ones that suggest a major role). It’s most similar to the common polygenic model of disease for T2D, suggesting that this study supports a minor role for rare and low-frequency variants.

In Summary

Overall, I found this to be a comprehensive and extremely well-written paper of the caliber we’d expect to see in Nature. It represents years of work by more than 300 contributing authors, and probably the first study of many to come. While the number of new discoveries may be a tad disappointing, the authors have uncovered novel loci and secondary signals. They’ve also done a great deal to shed light on the genetic architecture of this common complex disease, particularly as far as coding variants are concerned.


We will need, and I hope to see, many efforts like this to understand the genetic architecture of other diseases and important human traits.

Fuchsberger C, Flannick J, Teslovich TM, Mahajan A, Agarwala V, Gaulton KJ, Ma C, Fontanillas P, Moutsianas L, McCarthy DJ, Rivas MA, Perry JR, Sim X, Blackwell TW, Robertson NR, Rayner NW, Cingolani P, Locke AE, Tajes JF, Highland HM, Dupuis J, Chines PS, Lindgren CM, Hartl C, Jackson AU, Chen H, Huyghe JR, van de Bunt M, Pearson RD, Kumar A, Müller-Nurasyid M, Grarup N, Stringham HM, Gamazon ER, Lee J, Chen Y, Scott RA, Below JE, Chen P, Huang J, Go MJ, Stitzel ML, Pasko D, Parker SC, Varga TV, Green T, Beer NL, Day-Williams AG, Ferreira T, Fingerlin T, Horikoshi M, Hu C, Huh I, Ikram MK, Kim BJ, Kim Y, Kim YJ, Kwon MS, Lee J, Lee S, Lin KH, Maxwell TJ, Nagai Y, Wang X, Welch RP, Yoon J, Zhang W, Barzilai N, Voight BF, Han BG, Jenkinson CP, Kuulasmaa T, Kuusisto J, Manning A, Ng MC, Palmer ND, Balkau B, Stančáková A, Abboud HE, Boeing H, Giedraitis V, Prabhakaran D, Gottesman O, Scott J, Carey J, Kwan P, Grant G, Smith JD, Neale BM, Purcell S, Butterworth AS, Howson JM, Lee HM, Lu Y, Kwak SH, Zhao W, Danesh J, Lam VK, Park KS, Saleheen D, So WY, Tam CH, Afzal U, Aguilar D, Arya R, Aung T, Chan E, Navarro C, Cheng CY, Palli D, Correa A, Curran JE, Rybin D, Farook VS, Fowler SP, Freedman BI, Griswold M, Hale DE, Hicks PJ, Khor CC, Kumar S, Lehne B, Thuillier D, Lim WY, Liu J, van der Schouw YT, Loh M, Musani SK, Puppala S, Scott WR, Yengo L, Tan ST, Taylor HA Jr, Thameem F, Wilson G, Wong TY, Njølstad PR, Levy JC, Mangino M, Bonnycastle LL, Schwarzmayr T, Fadista J, Surdulescu GL, Herder C, Groves CJ, Wieland T, Bork-Jensen J, Brandslund I, Christensen C, Koistinen HA, Doney AS, Kinnunen L, Esko T, Farmer AJ, Hakaste L, Hodgkiss D, Kravic J, Lyssenko V, Hollensted M, Jørgensen ME, Jørgensen T, Ladenvall C, Justesen JM, Käräjämäki A, Kriebel J, Rathmann W, Lannfelt L, Lauritzen T, Narisu N, Linneberg A, Melander O, Milani L, Neville M, Orho-Melander M, Qi L, Qi Q, Roden M, Rolandsson O, Swift A, Rosengren AH, Stirrups K, Wood AR, Mihailov E, Blancher C, Carneiro MO, Maguire J, Poplin R, Shakir K, Fennell T, DePristo M, Hrabé de Angelis M, Deloukas P, Gjesing AP, Jun G, Nilsson P, Murphy J, Onofrio R, Thorand B, Hansen T, Meisinger C, Hu FB, Isomaa B, Karpe F, Liang L, Peters A, Huth C, O’Rahilly SP, Palmer CN, Pedersen O, Rauramaa R, Tuomilehto J, Salomaa V, Watanabe RM, Syvänen AC, Bergman RN, Bharadwaj D, Bottinger EP, Cho YS, Chandak GR, Chan JC, Chia KS, Daly MJ, Ebrahim SB, Langenberg C, Elliott P, Jablonski KA, Lehman DM, Jia W, Ma RC, Pollin TI, Sandhu M, Tandon N, Froguel P, Barroso I, Teo YY, Zeggini E, Loos RJ, Small KS, Ried JS, DeFronzo RA, Grallert H, Glaser B, Metspalu A, Wareham NJ, Walker M, Banks E, Gieger C, Ingelsson E, Im HK, Illig T, Franks PW, Buck G, Trakalo J, Buck D, Prokopenko I, Mägi R, Lind L, Farjoun Y, Owen KR, Gloyn AL, Strauch K, Tuomi T, Kooner JS, Lee JY, Park T, Donnelly P, Morris AD, Hattersley AT, Bowden DW, Collins FS, Atzmon G, Chambers JC, Spector TD, Laakso M, Strom TM, Bell GI, Blangero J, Duggirala R, Tai ES, McVean G, Hanis CL, Wilson JG, Seielstad M, Frayling TM, Meigs JB, Cox NJ, Sladek R, Lander ES, Gabriel S, Burtt NP, Mohlke KL, Meitinger T, Groop L, Abecasis G, Florez JC, Scott LJ, Morris AP, Kang HM, Boehnke M, Altshuler D, & McCarthy MI (2016). The genetic architecture of type 2 diabetes. Nature PMID: 27398621

Transitions and Excuses

My sincere apologies to the dedicated MassGenomics readers who’ve noticed the recent decline in new posts here. It’s a busy and tumultuous time for our institute.

Leadership Transition at MGI

For those who missed the announcement earlier this month: our center’s director Rick Wilson and co-director Elaine Mardis announced that they’re leaving to establish a new Institute for Genomic Medicine at Nationwide Children’s Hospital / Ohio State University in Columbus, Ohio.

We are still figuring out the transition plan, but the Washington University School of Medicine remains very committed to supporting our center and the people who work here. In other words, the McDonnell Genome Institute will continue on.

Large-scale Sequencing Opportunities

In the meantime, we are in the midst of large-scale sequencing efforts for the Centers for Complex Disease Genomics (CCDG), Alzheimers Disease Sequencing Project (ADSP), and Gabriella Miller Kids First (GMKF) initiatives. These are all ambitious projects in which I’m intimately involved, which means they consume a lot of my time. On the bright side, they keep me at the forefront of genomics where I can continue to be useful to you.

Important note for fellow scientists: Even with our current commitments, the HiSeq X Ten remains a hungry beast, so please get in touch if you’re looking for low-cost genome sequencing. With the X Ten and other instruments, we can provide custom-targeted, exome, whole genome, and/or transcriptome sequencing for humans and model organisms.

More Science Fiction

Last but not least, some personal news that may help explain why I’ve had less free time to write on MassGenomics. As you know, Harper Voyager (an imprint of HarperCollins) published my debut novel earlier this year. I’m thrilled to announce that I’ve accepted an offer from my publisher for two more books, effectively making The Rogue Retrieval into a trilogy.


All of you have been enormously supportive of my science fiction writing as well as my science writing, and I hope that will continue.

Once the dust settles from this transition period, I should be posting at a more regular schedule. So please stick around!