Contents: dbSNP Growth • Build 135 Stats • Variant Composition • Function Classes • SNPs and Indels • Coding/Noncoding Tiers
Less than a decade ago, the leading experts estimated that there were approximately 10 million SNPs in the human genome. Those were the early days of post-genome research, when “The SNP Consortium” was formed and began BAC overlap comparisons to routinely identify and report SNPs. Believe it or not, in my old lab there were binders full of paper records documenting the evidence for each newly discovered SNP. These variants were submitted to a central repository of human sequence variation hosted at NCBI, appropriately named dbSNP.
The database has grown substantially, already exceeding the 10 million mark by 2006:
I highlighted some of the key driving forces of this growth that I happen to know about. These include the “BAC overlap” project of the SNP Consortium and similar SNP discovery efforts (2001-2003), The HapMap Project Phases I (2003-2005) and II (2005-2007), the advent of next-generation sequencing, of course, and most recently the 1,000 Genomes Project. You probably noticed a few trends in the figure above:
- Less-frequent dbSNP updates. In 2003-2004 when the HapMap consortium direly needed new loci, dbSNP was updating almost every month. New build releases have slowed down considerably, probably because (1) they’re less critical, and (2) it’s a much bigger job.
- Overall, and quite obviously, there’s been a rapid increase in submissions over time, with some phases of near exponential growth.
- The relationship between submissions (blue) and unique refSNP clusters (red). You’ll note that dbSNP gets more and more submissions, of which a shrinking fraction are truly novel loci.
Still, by 2009, there were about 18 million unique SNPs, nearly twice the predicted number. And large variant discovery projects fueled by next-generation sequencing, such as the 1,000 Genomes Project were just ramping up.
Downloading the dbSNP database is not for the faint of heart. Even for bioinformaticians, the file formats offered (ASN1?) are somewhat intractable compared to BED files. I prefer instead to wait until the excellent team at the UCSC Genome Browser Database releases their annotation tracks for dbSNP builds, which contain the necessary information in far more accessible formats. They have just done so for build 135, and I did some quick-and-dirty parsing to come up with some statistics.
You might be surprised to learn that dbSNP contains not just SNPs, but several types of DNA sequence variation:
In the current build there are 54,212,076 unique variants with RS numbers, of which 47.8 million, or 88%, were single nucleotide polymorphisms. The remainder comprises insertion-deletion variants (indels, 11%), multiple nucleotide polymorphisms (MNPs, 0.1%), as well as ~420,000 other classes (named, mixed, and microsatellite). The named variants are old-school genetic markers (e.g. DS128384). Mixed polymorphisms are messy loci where multiple variant types (e.g. DNP and indel) are seen. Microsatellites, of course, are long stretches of repetitive sequences, such as di-nucleotide or tri-nucleotide repeats, whose length varies between individuals. Among these are the 15 short tandem repeats (STRs) utilized for forensic DNA profiling in CODIS, the FBI’s national DNA database.
Variants in dbSNP are classified by their relationship to NCBI’s view of known protein-coding genes. There are about a dozen “function class” categories, but they can be grouped together into five types of sequences:
You will note that the vast majority have function classification of “Unknown” suggesting that these are non-coding variants not immediately adjacent from NCBI protein-coding genes. Even for variants in or around genes, 90% are classified as intronic. If we break down the variants that are in coding regions according to dbSNP:
You can see that the majority of coding variants (just over half a million) are classified as “missense”, meaning that they’re predicted to cause an amino acid substitution in the encoded protein. Most of the remainder are silent (synonymous), though there are also around 40,000 variants predicted to cause premature termination (nonsense) or a shift in translation frame (frameshift) in the encoded protein.
For next-generation sequencing analysis, I’m generally interested in two types of variation represented in dbSNP: SNPs and small (<50 bp) indels.
The other types are either uncommon or too large to be readily detected with short reads, and further, there are curated, devoted databases that probably do a better job of representing them (e.g. Database of Genomic Variants for large indels and structural variants). Further, although the dbSNP functional classification is useful, we use an internal “tiering” system to represent variants according to their locations in the genome:
- Tier 1 variants affect coding sequences, including exons, splice sites, and non-coding RNA genes
- Tier 2 variants occur in evolutionarily conserved or putative regulatory sequences
- Tier 3 variants are in non-coding, non-conserved, unique regions of the genome
- Tier 4 variants are in repetitive regions of the genome
Every base in the reference sequence falls into one, and only one tier. Build 36 (hg18) of the human reference sequence is broken down to the right. There are 44 megabases of “tier 1” coding sequence in the human genome; that’s 1.53%, straight out of the textbooks. Tier 2 comprises 248 megabases, or 8.6%, which is slightly higher than the 5% expected rate of evolutionary conservation, probably because we’re fairly inclusive with what constitutes a putative regulatory element.
Next, we look at the distribution of dbSNP’s ~48 million SNPs and ~6 million small indels among the four tiers of genome space:
Strikingly, less than 10% of variants of both types fall into regions that are “interpretable” whereas the rest are in noncoding regions. The proportions of variants in tier 1 (1.3% of SNPs, 1% of indels) remains lower than the tier 1 fraction (above right), presumably due to purifying selection against changes to coding sequences. Many studies have shown this through far more careful analyses that account for ascertainment bias, population allele frequency, and other factors. It’s just fascinating to see the signature of natural selection in your basic pie chart.
I’m uncertain why the distributions in tiers 3 and 4 differ between variant types above, but there are likely a number of contributing factors. From a biological perspective, indels are both less frequent and subjected to greater natural selection than SNPs. From a technical perspective, SNP discovery algorithms are far more mature than indel discovery algorithms, owing in part to the difficulties of detecting the latter in relatively short sequence reads. We are currently, and have always been, better at finding SNPs than indels. With luck, the “accuracy gap” between SNPs and indels will diminish as sequencing technologies and detection algorithms continue to evolve.
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, & Sirotkin K (2001). dbSNP: the NCBI database of genetic variation. Nucleic acids research, 29 (1), 308-11 PMID: 11125122