The Fruits of a Thousand Genomes

Last week saw the publication of the 1,000 Genomes Project, which has characterized ~15 million SNPs, 1 million short insertions/deletions (indels), and 20,000 structural variants in seven human populations. This is discovery and genotyping at unprecedented scale, with an astonishing 4.9 terabases (trillion bases) sequenced – the equivalent of about 1,500 human genomes – across three pilot projects:

Deep whole-genome sequencing of trios (mother-father-daughter) from 2 populations
Low-coverage sequencing of 179 unrelated individuals from 4 populations
Exon sequencing of 906 randomly-selected genes in 697 individuals from 7 populations.

The three pilots have shed new light on sequence variation in human genomes and its distribution among human populations. Perhaps unsurprisingly, variation was not evenly distributed in the genome – certain regions (e.g. HLA and sub-telomeres) show high rates of variation, whereas (e.g. a 5 Mbp, gene-dense, highly-conserved region on chromosome 3) show very little. At the chromosomal level, different forms of variation were highly correlated (e.g. SNPs and indels), but there were exceptions for some types of structural variants implicating different mechanisms of mutation.

Novelty and Population-Specificity

The vast majority of SNPs detected were already known to dbSNP. Among known variants, 56% were present in all population panels while 25% were found in only a single panel. In contrast, only 4% of novel variants were found in all panels and 84% were found in only one. This difference supports the notion that the majority of common SNPs in human populations have already been found. There’s more work to do for other forms of variation, though. Many of the novel SVs were detected in all population panels. Half of the common short indels had never been reported.

The smallest two chromosomes – mitochondrial and Y – seemed to benefit the most. There was a lot of heteroplasmy in mitochondrial DNA within individuals – 79% of samples had length heteroplasmy, and 45% had substitution heteroplasmy. On the Y-chromosome, there were 2,870 variable sites, most of which (74%) were novel to public databases. These new variants helped identify several clear, significant sub-clades within the 12 haplotype groups represented in 1,000 Genomes samples.

Coding Regions and Loss-of-Function Variants

In total, the three pilots identified 68,300 non-synonymous variants, almost half of which were novel. Genotyping a subset of these in 620 samples revealed novel NSS variants had dramatically lower minor allele frequency (2.2%) than known ones (26.2%). From this I can draw two conclusions: most novel nonsynonymous variants are rare, and the majority could only have been identified by population-scale sequencing projects like these.

The authors estimate that an individual genome differs from the reference at 10,000 to 11,000 nonsynonymous sites and perhaps 12,000 synonymous sites. A typical genome harbors a much smaller number of loss-of-function (LOF) variants — inframe/frameshift indels, early stops, and splice-site variants — perhaps 340-400 LOF variants per individual, affecting 250-300 genes. Compared to synonymous variants, putative functional variants (nonsynonymous and LOF) tend to have lower allele frequencies and be more population-specific, presumably due to the action of purifying selection against deleterious mutations. Which means, of course, that the really important variants are much harder to find.

Signatures of Natural Selection

Looking in and around genes, the authors found diversity is lowest in exons (50% that of introns) and slightly reduced in 5′ and 3′ UTRs, compared to intronic and intergenic sequences. This signature of natural selection acting upon genes actually has a broad effect; diversity is reduced by 10% in the vicinity of genes compared to gene-distant loci, and that reduction extends up to 85 kbp away. Thus, selection on linked sites appears to restrict variation across the majority of the human genome. Looking across panels, the authors observed that SNPs with large allele frequency differences between populations were enriched for nonsynonymous sites, likely reflecting local adaptation and selection by different continental groups.

Finally, the authors examined the trios to look at a different environment for mutation and selection – immortalized cell lines. Some 952/1001 new mutations in the CEU daughter and 634/669 new mutations in the YRI daughter were not present in the germline, indicating that they occurred either in somatic cells or in the cell lines. Further, the higher number of mutations in the CEU sample may be related to the age of the lines – the CEU line is decades older than the YRI line.

Implications for Future Studies

The findings of the 1,000 Genomes Project thus far have immediate, significant impact on genetic association studies. Using publicly available gene expression data and their expanded catalogue of variants, the authors identified 20-30% more significant expression quantitative trait loci (eQTLs) than had previously been detectable. Thus, it is clear that while existing SNP arrays represent the majority of common variation, a significant amount of rare, phenotypically-relevant variation remains to be incorporated.

References
1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing. Nature, 467 (7319), 1061-73 PMID: 20981092