ENCODE Reveals Regulatory Variation in the Human Genome

I’m still working through the outstanding series of articles published by members of the ENCODE Consortium. After my previous post highlighting their marker paper on genome content and function, I came across two companion papers that demonstrate the power of ENCODE data to characterize functional regulatory variants in the human genome. The first article, from authors at the University of Washington, combines genome-wide maps of regulatory DNA with whole-genome sequencing data to characterize patterns of regulatory variation. The second, from authors at Stanford University, investigates the association between multiple ENCODE data types with disease-associated SNPs from GWAS studies.

Personal and Population Genomics of Regulatory Variation

Benjamin Vernot and colleagues combined two powerful genome-wide datasets to explore the landscape of variation in regulatory regions of the human genome:

DNase I hypersensitive sites (DHSs) from 138 cell and tissue types obtained via DNase-Seq
Whole-genome sequencing data for 53 geographically diverse individuals obtained from Complete Genomics

DNase I Data

The DNase I data exploits a structural feature of regulatory DNA that’s been known for decades: the binding of sequence-specific transcriptional regulators in place of nucleosomes creates DNase I hypersensitive sites, or DHSs. Large-scale localization of these signatures via deep sequencing reveals a pattern within a pattern: within each DNAase I “peak” (100 to 200 bp) are small “footprints” (6 to 20 bp) of sequences that were bound by the regulatory protein, and hence protected from cleavage. This is nicely illustrated in Figure 1A:

DNase I Footprinting (Vernon et al., Genome Res. 2012)

Across all cell and tissue types, ENCODE identified ~2.9 million DNase I peaks and 8.4 million footprints within them. Collectively, these span 577 megabases and 156 megabases (respectively) of unique sequence (18.7% and 5.1% of the genome).

Whole-genome Sequence Data

The authors obtained publicly-available WGS data for 53 unrelated individuals representing 5 geographically diverse populations, shown in this phylogenetic-like tree in Figure 1B:

WGS Individuals (Vernon et al, Genome Res. 2012)

Each individual was sequenced to ~40x coverage by Complete Genomics (now BGI-owned). These are samples from populations initially characterized by the International HapMap Project:

CEU CEPH Utah residents of north/west Europe ancestry (n=9)
YRI Yoruba in Ibadan, Nigeria (n=9)
CHB Han Chinese in Beijing, China (n=4)
JPT Japanese in Tokyo, Japan (n=4)
LWK Luhya in Webuye, Kenya (n=4)
MKK Maasai in Kinyawa, Kenya (n=3)
ASW African ancestry in Southwest USA (n=4)
TSI Tuscans in Italy (n=4)
GIH Gujarati Indian in Houston, Texas, USA (n=4)
MXL Mexican ancestry in Los Angeles, California (n=5)
PUR Puerto Rican in Puerto Rico (n=2)

The variants from each individual were filtered to remove variants with missing data, low coverage across samples, and departures from Hardy-Weinberg equilibrium. GERP scores, which measure evolutionary constraint, were obtained for each passing variant.

Pervasive Regulatory Variation

The overlap between sequence variants (WGS data) and DNase I peaks/footprints (ENCODE data) was astonishing: 3.85 million and 1.01 million variants overlapped DHS peaks and footprints, respectively. There were far more variants in these regulatory regions than in the protein-coding exome (just 150,000 variants). However, if one considers only variants with high (>3) GERP conservation scores, the differences are far less dramatic: 146,570 variants, 61,933 variants, and 36,935 variants were found in DHS peaks, footprints, and the exome. The density of variants (an informal proxy for purifying selection) differed as well: 6.7 and 6.5 variants per kilobase in DHS peaks and footprints, respectively, compared to 4.2 variants per kilobase in the exome.

These findings support the central conclusion of this paper: regulatory variants are pervasive in the human genome, far more numerous but less deleterious (on average) than coding variants. Interestingly, however, the DHS footprints for certain developmental regulators, such as HOX genes, exhibited high levels of evolutionary constraint comparable to those seen in coding regions.

Signatures of Positive Selection

To identify signatures of geographically restricted natural selection, the authors searched for DNase I peaks containing variants with significant allele frequency differences between individuals of African, European, and Asian ancestry. For each such peak, they identified genes within 50 kb as potential targets of regulation. This yielded about 3,000 genes in each population; a KEGG analysis identified a total of 15 significantly enriched pathways. Interestingly, the most significant pathway for European ancestry was melanogenesis, a process related to skin pigmentation. This signature of recent positive selection had previously been observed for coding variants in melanogenesis pathway genes. Now, evidence suggests that regulatory variants, too, were acted upon by selection and thus contribute to the phenotype.

Linking Disease to Regulatory Variation

The second study, undertaken by Marc A. Schaub and colleagues, took a complementary approach to characterize regulatory variants. They took 5,694 curated genotype-phenotype associations from the NHGRI GWAS catalog (comprising 4,724 unique SNPs associated with 470 different phenotypes), and performed a functional annotation of those variants using data from ENCODE and other sources.

Annotation of GWAS-Associated SNPs

First, they took the SNPs that were reported to be associated with a phenotype, and found that ~45% of them overlap at least some ENCODE data:

7.8% overlap the coding (4.7%) or nonoding (3.1%) portions of known exons
36.3% overlap a DHS peak while 7.5% overlap a DHS footprint
19.9% overlap a CHiP-seq peak, indicating a region bound by one of the 114 assessed transcription factors.

Annotation of Variants in LD with Associated SNPs

Although the “associated” SNP identified by GWAS could be a sequence variant that contributes to the phenotype, it may simply be “tagging” a region harboring such functional variation. Remember that GWAS studies use “SNP chips” — high density microarrays which genotype 500,00 to 1 million SNPs simultaneously — whose markers are generally chosen for their informativeness, not necessarily their potential causation. In fact, the most informative markers are likely to have high minor allele frequencies, which likely makes them neutral with respect to selection, and therefore not functional variants. Yet those that are statistically associated with a phenotype must be linked to functional variation.

The authors exploited linkage disequilibrium (the tendency of variants on the same chromosome to be inherited together) to identify, in Europeans, variants that tend to be inherited together with associated SNPs. As expected, when one considers not just the associated SNP but the nearby variants it represents, a much larger proportion (58%) overlaps some kind of ENCODE-annotated regulatory region. When the authors restricted their analysis to GWAS studies of populations with European ancestry, and use European-only LD information, that proportion jumps to 81%. Figure 2 from the paper provides a colorful view of the annotation breakdown of lead (associated) SNPs, variants in LD with lead SNPs (r^2 > 0.8), and European-only variants in LD.

ENCODE annotation of GWAS SNPs (Schaub et al, Genome Res. 2012)

The authors go on to provide some nice vignettes of functional annotation of certain GWAS associations. Bottom line, their main findings are:

GWAS associations are significantly enriched for DHS peaks, DHS footprints, and CHiP-Seq peaks
Candidate functional variants were found for as many as 81% of GWAS associations
Integrating multiple types of functional and expression data yields more likely candidate functional variants in an LD region.

Conclusion: The Landscape of Regulatory Variation

Taken together, these studies both highlight a fact that I find myself repeating many times: a significant fraction of functional variation in the human genome lies outside the exons of known protein-coding genes. The ENCODE provides a resource for us to begin exploring possible regulatory roles of variants in human disease. As our understanding of the noncoding portion of the genome improves, it will become even more apparent that whole-genome sequencing (and not exome sequencing) will be required to characterize the full extent of phenotypically-relevant genetic variation in humans.

References

Vernot B, Stergachis AB, Maurano MT, Vierstra J, Neph S, Thurman RE, Stamatoyannopoulos JA, & Akey JM (2012). Personal and population genomics of human regulatory variation. Genome research, 22 (9), 1689-97 PMID: 22955981

Schaub MA, Boyle AP, Kundaje A, Batzoglou S, & Snyder M (2012). Linking disease associations with regulatory information in the human genome. Genome research, 22 (9), 1748-59 PMID: 22955986