Two papers in Nature this month leverage the power of second-generation sequencing technologies to investigate gene expression variation in human cell lines. By performing RNA-Seq in HapMap cell lines, the authors generated the most extensive gene expression data to date for these samples, and were able to use publicly available HapMap genotypes to associate expression differences with genetic variation. This strategy was applied to the HapMap samples two years ago using expression microarrays. Using RNA-Seq instead of microarrays, however, offers a few key advantages:
- More accurate quantification of highly abundant transcripts, where microarrays reach saturation
- Access to rare transcripts below the sensitivity threshold for microarrays
- Detection of novel gene structure from alternative splicing and unannotated exons
- Identification of allele-specific expression
The first study, from Jonathan Pritchard’s lab at the University of Chicago, sequenced RNA from 69 Yoruban (African) individuals on the Illumina GAII platform. They generated at least two lanes per individual, for a total of 1.2 billion reads, of which 964 million (80%) mapped uniquely to the genome or to exon-exon boundaries. The second study, from Emmanouil Dermitzakis’s group at the Sanger center, sequenced RNA from 60 CEU (CEPH Europeans from Utah) individuals, also on the Illumina GAII platform. They generated one lane of paired-end data per individual, for a total of about 1.0 billion reads. Since neither study provided a table summarizing their data (which I’d have liked), I put one together:
Study | Pickrell et al. | Montgomery et al. |
Samples | 69, African descent (YRI) | 60, European descent (CEU) |
Sequencing | Illumina 1X35 or 1X46 | Illumina 2X37 |
Reads/Sample | 17.4 million | 16.9 million |
SNP Dataset | HapMap II/III | HapMap III |
Total SNPs | 3.8 million | 1.2 million |
Up to this point, the two studies sounded nearly identical. For the data analysis, however, each group went in a different (and interesting) direction.
Pooled Data for Discovery of Novel Gene Structures
Novel Exons. The Pritchard group pooled all data to examine the completeness of current gene annotations. Some 86% of uniquely mapped reads corresponded to known exons. Using conservation data from alignments of 28 vertebrate exomes, the authors identified 4,031 regions that are evolutionarily conserved and show evidence of transcription. About one-quarter of these appear to be part of spliced transcripts, but most appeared to be novel untranslated regions (UTRs). Some 115 regions, however, had sequences consistent with protein-coding exons. To investigate the possibility that their novel exons are real, the authors used RNA-Seq data from several human tissues and chimpanzee cell lines. The evidence suggests that their regions do represent novel exons, but ones that are expressed in a more tissue-specific fashion than annotated exons.
Novel Poly-A Sites. The authors next screened the ~70 million unmapped sequence reads for long runs of A or T nucleotides, which might indicate novel poly-adenylation sites. Of the ~8,000 novel sites that they identified, some 45% fell within 10 bp of a known cleavage site. To further validate their findings, they screened their poly-A regions for the binding site of the CPSF polyadenylation factor, and found a 32-fold enrichment for the CPSF target hexamer. The net result was a high confidence set of 3,481 cleavage sites that show evidence of poly-A (from RNA-Seq data) and CPSF binding.
RNA-Seq: 10 million Reads Is All You Need
The Dermitzakis study generated 16.9 million (+/- 5.9 million) reads per individual, which were mapped to the NCBI 36 reference sequence using Maq with a maximum insert size of 2 megabases). The resulting alignments were filtered to remove alignments with low mapping quality or to the X, Y, or MT chromosomes. Discordant read pairs (by distance or orientation) were also removed. To quantify the expression of known exons/transcripts/genes, the authors scaled read counts for each individual to a theoretical yield of 10 million reads, and only considered exons with data in >90% of individuals. This resulted in data for 90,064 exons from 10,777 genes, of which 95% had at least 10 reads (on average) per individual. While the normalization seems to reduce the dataset to less than half of known genes, it nevertheless provided an extensive view of gene expression across these 60 individuals.
Cis-Regulatory Effects on Gene Expression
Using HapMap genotypes for 1.2 million SNPs, the Dermitzakis group identified 836 genes associated with cis-regulatory variants (compared to 539 genes identified in microarray studies of the same individuals). Even when normalized for the number of genes tested, the increased resolution of RNA-Seq over microarrays yielded a larger number of genetic regulatory effects. The RNA-Seq exon eQTLs (expression quantitative trait loci) were enriched for abundant transcripts, suggesting that saturation of highly expressed exons reduces the sensitivity for microarrays to detect some cis-regulatory effects.
The Pritchard group searched for cis-regulatory variation with an even larger dataset – RNA-Seq for 69 individuals and 3.8 million HapMap SNPs. They identified 929 genes with local eQTLs (4.6% of annotated genes); consistent with previous findings, virtually all SNPs associated with expression level were near the corresponding gene. They also reported the overlap with the CEU study results: the top 500 associations reported in CEU samples were enriched 10 to 40-fold for significant eQTLs in YRI samples. Given the marked genetic differences between these two populations, this result suggests that these studies are identifying replicable cis-regulatory events.
Mechanism of Cis-Regulatory Effects
An important feature of RNA-Seq data is that it can be used not only to detect cis-regulatory variation, but to assess the mechanism by which these variants act. The Pritchard group looked at 222 of their 929 eQTLs for which the associated SNPs fell within the gene exons. They classified the RNA-Seq reads as originating from the high-expression haplotype or the low-expression haplotype, and found that for 195 of the genes (88%), more than 50% of the expressed transcripts carried the allele associated with high expression. Therefore, the modulation of gene expression is a direct result of the associated variation (probably by activating nearby cis-regulatory elements). In other words: the eQTL tells us that variants near the gene are associated with its expression. That means something nearby is regulating it. The fact that the haplotype associated with increased expression is the haplotype that predominates tells us that the high-expression allele is what drives the expression of its nearby gene. As opposed to, say, driving expression of the gene from both chromosomes.
Allelic Effects on Splicing
Finally, both groups looked at the actual content of expressed transcripts, to find SNPs associated with alternative splicing. The Pritchard group calls these splicing quantitative trait loci (sQTLs), and found 187 genes with significant associations. Binding sites for known splice factors (U1 snRNP and U2AF) were enriched for sQTLs, as were SNPs within 2 bp of a canonical splice site. The Dermitzakis group found 110 genes with significant associations, and stratified splicing-associated variants according to their position in the gene structure. When tested against the exons upstream and downstream of where they resided, splice donor variants were enriched 3.17-fold with the upstream (5′) exon, while splice acceptor variants were enriched 7.02 fold with the downstream (3′) exon. Thus, these SNPs affect the inclusion/exclusion of their exons in the mature transcript.
Dermitzakis’s group visually examined their most significant associations to characterize the mechanism of splicing regulation. Of the 110 significant sQTLs identified in CEU samples:
- 41% were single exon skipping events
- 17% created an alternate acceptor
- 13% were double or triple exon skipping events
- 6% created an alternate donor
- 5% were mutually exclusive exons
- 5% were retained introns.
In summary, these studies establish the feasibility of transcriptome sequencing to assess gene expression and characterize regulatory variation. Indeed, as the title of one study suggests, RNA sequencing is a powerful tool for studying the mechanisms underlying human gene expression variation, and will undoubtedly yield better understanding of the complex relationships between genotype and phenotype.
References
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, & Pritchard JK (2010). Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 464 (7289), 768-72 PMID: 20220758
Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, & Dermitzakis ET (2010). Transcriptome genetics using second generation sequencing in a Caucasian population. Nature, 464 (7289), 773-7 PMID: 20220756