Transcriptome Genetics with HapMap and RNA-Seq

April 22, 2010 by Dan Koboldt

Two papers in Nature this month leverage the power of second-generation sequencing technologies to investigate gene expression variation in human cell lines. By performing RNA-Seq in HapMap cell lines, the authors generated the most extensive gene expression data to date for these samples, and were able to use publicly available HapMap genotypes to associate expression differences with genetic variation. This strategy was applied to the HapMap samples two years ago using expression microarrays. Using RNA-Seq instead of microarrays, however, offers a few key advantages:

More accurate quantification of highly abundant transcripts, where microarrays reach saturation
Access to rare transcripts below the sensitivity threshold for microarrays
Detection of novel gene structure from alternative splicing and unannotated exons
Identification of allele-specific expression

The first study, from Jonathan Pritchard’s lab at the University of Chicago, sequenced RNA from 69 Yoruban (African) individuals on the Illumina GAII platform. They generated at least two lanes per individual, for a total of 1.2 billion reads, of which 964 million (80%) mapped uniquely to the genome or to exon-exon boundaries. The second study, from Emmanouil Dermitzakis’s group at the Sanger center, sequenced RNA from 60 CEU (CEPH Europeans from Utah) individuals, also on the Illumina GAII platform. They generated one lane of paired-end data per individual, for a total of about 1.0 billion reads. Since neither study provided a table summarizing their data (which I’d have liked), I put one together:

Study	Pickrell et al.	Montgomery et al.
Samples	69, African descent (YRI)	60, European descent (CEU)
Sequencing	Illumina 1X35 or 1X46	Illumina 2X37
Reads/Sample	17.4 million	16.9 million
SNP Dataset	HapMap II/III	HapMap III
Total SNPs	3.8 million	1.2 million

Up to this point, the two studies sounded nearly identical. For the data analysis, however, each group went in a different (and interesting) direction.

Pooled Data for Discovery of Novel Gene Structures

Novel Exons. The Pritchard group pooled all data to examine the completeness of current gene annotations. Some 86% of uniquely mapped reads corresponded to known exons. Using conservation data from alignments of 28 vertebrate exomes, the authors identified 4,031 regions that are evolutionarily conserved and show evidence of transcription. About one-quarter of these appear to be part of spliced transcripts, but most appeared to be novel untranslated regions (UTRs). Some 115 regions, however, had sequences consistent with protein-coding exons. To investigate the possibility that their novel exons are real, the authors used RNA-Seq data from several human tissues and chimpanzee cell lines. The evidence suggests that their regions do represent novel exons, but ones that are expressed in a more tissue-specific fashion than annotated exons.

Novel Poly-A Sites. The authors next screened the ~70 million unmapped sequence reads for long runs of A or T nucleotides, which might indicate novel poly-adenylation sites. Of the ~8,000 novel sites that they identified, some 45% fell within 10 bp of a known cleavage site. To further validate their findings, they screened their poly-A regions for the binding site of the CPSF polyadenylation factor, and found a 32-fold enrichment for the CPSF target hexamer. The net result was a high confidence set of 3,481 cleavage sites that show evidence of poly-A (from RNA-Seq data) and CPSF binding.

RNA-Seq: 10 million Reads Is All You Need

The Dermitzakis study generated 16.9 million (+/- 5.9 million) reads per individual, which were mapped to the NCBI 36 reference sequence using Maq with a maximum insert size of 2 megabases). The resulting alignments were filtered to remove alignments with low mapping quality or to the X, Y, or MT chromosomes. Discordant read pairs (by distance or orientation) were also removed. To quantify the expression of known exons/transcripts/genes, the authors scaled read counts for each individual to a theoretical yield of 10 million reads, and only considered exons with data in >90% of individuals. This resulted in data for 90,064 exons from 10,777 genes, of which 95% had at least 10 reads (on average) per individual. While the normalization seems to reduce the dataset to less than half of known genes, it nevertheless provided an extensive view of gene expression across these 60 individuals.

Cis-Regulatory Effects on Gene Expression

Using HapMap genotypes for 1.2 million SNPs, the Dermitzakis group identified 836 genes associated with cis-regulatory variants (compared to 539 genes identified in microarray studies of the same individuals). Even when normalized for the number of genes tested, the increased resolution of RNA-Seq over microarrays yielded a larger number of genetic regulatory effects. The RNA-Seq exon eQTLs (expression quantitative trait loci) were enriched for abundant transcripts, suggesting that saturation of highly expressed exons reduces the sensitivity for microarrays to detect some cis-regulatory effects.

The Pritchard group searched for cis-regulatory variation with an even larger dataset – RNA-Seq for 69 individuals and 3.8 million HapMap SNPs. They identified 929 genes with local eQTLs (4.6% of annotated genes); consistent with previous findings, virtually all SNPs associated with expression level were near the corresponding gene. They also reported the overlap with the CEU study results: the top 500 associations reported in CEU samples were enriched 10 to 40-fold for significant eQTLs in YRI samples. Given the marked genetic differences between these two populations, this result suggests that these studies are identifying replicable cis-regulatory events.

Mechanism of Cis-Regulatory Effects

An important feature of RNA-Seq data is that it can be used not only to detect cis-regulatory variation, but to assess the mechanism by which these variants act. The Pritchard group looked at 222 of their 929 eQTLs for which the associated SNPs fell within the gene exons. They classified the RNA-Seq reads as originating from the high-expression haplotype or the low-expression haplotype, and found that for 195 of the genes (88%), more than 50% of the expressed transcripts carried the allele associated with high expression. Therefore, the modulation of gene expression is a direct result of the associated variation (probably by activating nearby cis-regulatory elements). In other words: the eQTL tells us that variants near the gene are associated with its expression. That means something nearby is regulating it. The fact that the haplotype associated with increased expression is the haplotype that predominates tells us that the high-expression allele is what drives the expression of its nearby gene. As opposed to, say, driving expression of the gene from both chromosomes.

Allelic Effects on Splicing

Finally, both groups looked at the actual content of expressed transcripts, to find SNPs associated with alternative splicing. The Pritchard group calls these splicing quantitative trait loci (sQTLs), and found 187 genes with significant associations. Binding sites for known splice factors (U1 snRNP and U2AF) were enriched for sQTLs, as were SNPs within 2 bp of a canonical splice site. The Dermitzakis group found 110 genes with significant associations, and stratified splicing-associated variants according to their position in the gene structure. When tested against the exons upstream and downstream of where they resided, splice donor variants were enriched 3.17-fold with the upstream (5′) exon, while splice acceptor variants were enriched 7.02 fold with the downstream (3′) exon. Thus, these SNPs affect the inclusion/exclusion of their exons in the mature transcript.

Dermitzakis’s group visually examined their most significant associations to characterize the mechanism of splicing regulation. Of the 110 significant sQTLs identified in CEU samples:

41% were single exon skipping events
17% created an alternate acceptor
13% were double or triple exon skipping events
6% created an alternate donor
5% were mutually exclusive exons
5% were retained introns.

In summary, these studies establish the feasibility of transcriptome sequencing to assess gene expression and characterize regulatory variation. Indeed, as the title of one study suggests, RNA sequencing is a powerful tool for studying the mechanisms underlying human gene expression variation, and will undoubtedly yield better understanding of the complex relationships between genotype and phenotype.

References
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, & Pritchard JK (2010). Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 464 (7289), 768-72 PMID: 20220758

Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, & Dermitzakis ET (2010). Transcriptome genetics using second generation sequencing in a Caucasian population. Nature, 464 (7289), 773-7 PMID: 20220756

AGBT: Focus on Cancer Genomics

February 26, 2010 by Dan Koboldt

As usual, the quality of the scientific presentations at this meeting has been outstanding. The weather, too, has improved at last:

p_00014

There are too many to cover (or even attend) completely, but one area of interest with a strong focus this year is cancer genomics. Yesterday during plenary sessions, Stacey Gabriel of the Broad Institute of MIT and Harvard presented sequencing of multiple myeloma, a liquid tumor affecting 50,000 people in the US. Around 5,200 gigabases of sequence was generated across 26 tumor samples and matched controls, yielding ~30x average depth per genome. Their mutation detection pipeline achieved an admirable validation rate for somatic SNVs (95%). Short indels were more challenging (~50% validated), and candidate rearrangements even more so (30-50% validated). However, their study validated ~40 somatic mutations per tumor, implicating known MM genes (NRAS, KRAS, TP53) as well as novel ones (DIS3, FAM46C).

Elliott Margulies on Melanoma

Last night, there was a concurrent session devoted to cancer genomics. Eliott Margulies (NIH/NHGRI) led the lineup with his work sequencing the tumor genome and matched normal of a melanoma patient. Using the Illumina platform (2×100 bp), his group achieved 36x and 43x haploid coverage for tumor and normal, respectively, with ~99% of the genome covered by at least one read. Much of the talk was devoted to their analysis pipeline, summarized as:

Initial alignment of Illumina reads with ELAND
Partitioning the reads into “genome” bins of several kilobases
Local realignment with cross_match in highly parallelized fashion
SNV calling with their “Most Probable Genotype” (MPG) method
Removal of variants with any evidence in the Germline, or ones in dbSNP

The 175,768 novel tumor-specific SNVs were classified as coding (807) or noncoding (174,961). Some 513 of 807 coding variants were nonsynonymous. Of these, 101 were selected for validation; 84 got validation results and 75 somatic coding mutations (89%) were confirmed. Unsurprisingly, Dr. Margulies used his group’s expertise in comparative genomics to closely examine the noncoding variants as well. His group recently annotated “Chai” regions of the human genome, which bear evidence of evolutionary constraint that suggest functional relevance. Some 10,285 of the 174,961 fell within Chai regions, and among them were ~2,000 variants predicted to dramatically alter the local structure of DNA (suggesting regulatory changes).

Sequencing Pre- and Post-Treatment Lung Cancer

Ian Bosdet of BC Cancer Agency presented some very interesting work on mutational profiling of pre- and post-treatment lung cancer tumors. His group had the opportunity to participate in a clinical trial at BCCA in which carefully-selected, treatment-naive NSCLC patients underwent a standard therapeutic program. First, each patient underwent a pre-treatment evaluation and biopsy. Next, they received erlotinib (an EGFR inhibitor) until the disease inevitably progressed. Then, another biopsy that was sent for pathology review, as well as DNA/RNA extraction for sequencing. Transcriptome sequencing yielded some interesting findings. For example, the expression of one gene (IER5L or IER5C, it’s hard to read my own handwriting) was highly expressed in smokers that did not respond to treatment. A screen of unmapped transcript reads against viral genomes revealed the presence of Epstein-Barr Virus transcripts in one tumor that was later re-classified as EBV-positive lymphadenocarcinoma (?).

Mutational profiling for three patients was obtained via exome capture (Agilent) and sequencing of normal, pre-treatment tumor, and post-treatment tumor samples. Somatic mutations in PHACTR2 were seen only in pre-treatment samples. Mutations in a few genes (PRMT10, RanBP2) were found at both times, but a few (YY1AP1, SNX9) were only present after treatment, suggesting a role for these genes in progressive disease.

VarScan 2 Released on SourceForge

February 4, 2010 by Dan Koboldt

Accurate variant detection in massively parallel sequencing data is a significant bioinformatics challenge. Not only do new sequencers offer unprecedented breadth (whole genome) and depth (30x or more), but they suffer coverage biases and error rates that make variant calling difficult. Last year, we published VarScan, our in-house algorithm for SNP and indel detection on next-generation platforms. NGS analysis has changed somewhat since that time; SAM/BAM format was widely adopted, for one thing, and data throughput has skyrocketed.

To address these issues as well as the requests of many users, we have released VarScan 2 on SourceForge.net. The new version features many improvements and enhancements, including:

SAM/BAM compatibility. Rather than reading various native alignment formats, VarScan now accepts as input the “pileup” format of SAMtools. Since most widely used aligners can be converted to SAM format or output it directly, this makes VarScan compatible with a wide array of tools.
Java implementation. To increase speed and performance in the face of ever-increasing sequencing throughput, we’ve implemented the new VarScan in Java. This also means that it can run on any operating system through the Java virtual machine (VM).
New filtering and comparison tools. VarScan 2 now has commands to limit variants to a list of positions or chromosomal regions, which is useful for targeted sequencing projects. It also has a comparison tool that intersects or merges two sets of variants.
Somatic variant detection. This is the flagship feature of VarScan 2 – given pileup files from a tumor sample and matched control (normal), VarScan calls variants (SNPs and indels) and determines their somatic status (Germline, Somatic, LOH) using heuristic and statistical approaches.

Software and Documentation on SourceForge.net

VarScan joins the ranks of some of the most widely used tools for NGS analysis – Bowtie, Maq, BWA, SAMtools, and Picard – that are hosted on sourceforge.net. The download page, user’s manual, and Java documentation for VarScan are already online. There’s a new wiki site and discussion forum for VarScan as well, to help us developers keep in touch with users and the NGS community. The project page will have information about known issues and new software releases.

You’ll find it all at http://varscan.sourceforge.net.

References

Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK, & Ding L (2009). VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics (Oxford, England), 25 (17), 2283-5 PMID: 19542151

SNP Discovery in NGS Data, Atlas-SNP2, and VarScan

December 29, 2009 by Dan Koboldt

A paper in this month’s Genome Research sheds light on predictors of sequencing error in next-generation sequencing. Using data from both 454 and Illumina platforms, Shen et al applied logistic regression models to identify sequence- and platform-related factors that contribute to substitution (SNP) errors.

atlas-snp2-paper

The results, I think, offer new insight into the challenge of accurate variant detection in massively parallel sequencing data. On the 454 platform, four factors were significantly correlated with erroneous SNP calls:

1.) Base quality of the substitution. No surprise there. Quality scores are inversely correlated with sequencing error rate.

2.) Dinucleotide polymorphism events (DNPs). Runs of 2+ consecutive substitutions are enriched for erroneous calls. In particular, so-called “swap-base” events, in which two adjacent positions invert their nucleotides, are evidently the result of “loss of synchrony” in 454 pyrosequencing.

3.) Neighboring quality standard (NQS). Requiring Q>20 at the SNP position and Q>15 at flanking positions (+/- 5bp) improved SNP calling accuracy. This is not a novel idea, as tools like ssahaSNP incorporated NQS into their algorithms a few years ago. However, I’m intrigued by the application of quality “windows” to reduce false positives in next-gen sequencing.

4.) Distance of base from 3′ end of read, normalized against the read length. This is an interesting source of erroneous SNP calls that I personally have observed in both 454 and Illumina datasets.

Not Significant: Homopolymers (!?), GC Content, Sequence Context

What I find most intriguing about this list is what’s missing: homopolymers, which have long plagued 454 pyrosequencing. The authors noted that the new Titanium base-caller had much improved performance calling homopolymers, so much improvement, in fact, that homopolymer-associated errors failed to reach statistical significance in their analysis. Other factors that were examined but not significant included GC content, sequence context, and substitution type. The Illumina platform had the same significant factors with one exception: swap-base events were not a significant source of error.

Variant Caller Comparison: Atlas-SNP2, Maq, and VarScan

The authors applied their linear regression models on E. coli and built sets of prior probabilities to tune their SNP caller, Atlas-SNP2. Then, they compared the performance of Atlas-SNP2 on to that of two other tools: VarScan and Maq. The comparison datasets included 454 data (about 1/2 region) and Illumina data (1 lane of 76-bp reads) from a highly-characterized strain of S. aureus. The variant calls from each tool were compared to known SNP positions to determine sensitivity and specificity.

DataSet	Caller	Sensitivity	Specificity
454	Atlas-SNP2	97.6%	88.4%
454	VarScan	97.6%	96.8%
Illumina	Atlas-SNP2	98.8%	99.9%
Illumina	VarScan	85.7%	99.9%
Illumina	Maq	4.8%-88.1%*	99.9%
* Maq was applied with various values of -D, from 100 (default) to 618

Please correct me if I’m wrong, but it seems that VarScan out-performed Atlas-SNP2 on 454 data. Baylor is (in my opinion) a leader in 454 sequencing, and published the first whole genome with that platform, so I’m pretty pleased. As one would expect, Atlas-SNP did shine in one area, the high-read-depth Illumina datasets. The authors’ tuning eventually produced a higher sensitivity than VarScan or Maq while maintaining similar specificity. While I acknowledge this achievement, I’m a little wary of any results that report 99.9% specificity for SNP calling in Illumina data.

Will Atlas-SNP2 Be Widely Adopted?

There do appear to be benefits of tuning variant detection software to account for platform-specific sources of error. It’s my opinion, however, that Atlas-SNP2 must overcome two obstacles prior to widespread adoption by the NGS community:

Ruby implementation. The software package comes as a set of Ruby scripts with little documentation. It will take a bioinformatician to run it. That limits the potential of expanding the tool, and/or integrating it with existing pipelines. Also, scripting languages like Perl and Ruby, while easy to write code in, are unlikely to scale to where NGS throughput is headed. No performance data was included in the study, so we don’t know how long it takes to train Atlas-SNP2 on a considerable dataset.
Archaic read mapping/alignment pipeline. This is the central limitation of Atlas-SNP2. The authors rely upon the same pipeline – anchoring with BLAT, then genome partitioning, then local alignment with cross_match – that was utilized in the Watson/454 paper. Come on, guys! This may have been feasible for their limited dataset and small bacterial genomes. But here’s the problem with using old-school aligners: BLAT and cross_match won’t scale to the large genomes and Illumina-like throughput (~20m reads per lane). They also cannot use read pairing information. It looks like there may be preliminary support for alternative alignments in SAM format, but only for 454 data.

The paper itself is well-written and clear, but there’s a lot missing. The methods are brief, and don’t describe the versions or parameters of any software that was used. The sequence datasets don’t appear to be publicly available. The numbers of expected and observed SNPs for S. aureus that were used for sensitivity/specificity calculations are not clear. It’s my understanding that there’s no limit on supplemental materials, so this information should have been provided. I’m surprised, in fact, that this publication made it to Genome Research. The novelty and scope are limited, and probably more appropriate for something like BMC Genomics. I say this, admittedly, with just a touch of envy ;-).

Insights into Next-Generation Sequencing

The publication was informative, however, in correlating several of the key factors contributing to sequencing error. While some of these (quality score, for example) were obvious, the swap-base phenomenon and NQS window approach are worth consideration for those of us developing variant detection algorithms. The linear regression training model may also have some value, especially as we continue to generate variant calls in large datasets and submit them for validation.

References
Shen Y, Wan Z, Coarfa C, Drabek R, Chen L, Ostrowski EA, Liu Y, Weinstock GM, Wheeler DA, Gibbs RA, & Yu F (2009). A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome research PMID: 20019143

« Previous Page