Somatic Mutations in Four Human Cancers

July 30, 2010 by Dan Koboldt

In a letter to Nature this week, a group from Genentech presents an elegant analysis of 2,576 somatic mutations across 441 tumors comprised of breast, lung, ovarian, and prostate cancer types and subtypes. Using something called “mismatch repair detection” (MRD) technology, the authors surveyed 1,507 candidate genes spanning some 4 megabases of sequence, largely comprised of known cancer genes and “druggable” genes. MRD apparently uses E. coli to isolate amplicons that contain mutations relative to a reference sequence, which are then assessed for variations by a resequencing tiling array. Matched normal samples were also screened (in pools of five) to eliminate germline events.

I admit to knowing little about MRD or its capabilities, but I’m very familiar with the validation platform (Sequenom), which has proven its value in the HapMap, Cancer Genome Atlas (TCGA), and 1,000 Genomes projects.

Significantly Mutated Genes

Any doubts I had concerning a study from the private sector were quickly swept away, not just by the quality of the journal, but by the analysis that the authors presented. Simply put, I was enchanted. Figure 1, for example, illustrates the significantly mutated gene (SMG) analysis with a grid of eight bubble plots, one per cancer subtype. Significant genes are notable not just by their position on the Y-axis (Mutation q-score), but the size of the bubble, which corresponds to the number of mutations.

The set of SMGs varied across type and subtype, but some patterns immediately jump out. PIK3CA and TP53 were the most significant across three breast cancer subtypes. TP53, in fact, was significant across all eight subtypes, most strikingly in lung and ovarian cancer. KRAS stood out in pancreatic cancer and lung adenocarcinoma, but not squamous lung carcinoma.

On average across all tumors studied here, the authors found 1.8 protein-altering mutations per megabase, with the highest rates seen in lung adenocarcinomas (3.5/Mb) and squamous carcinomas (3.9/Mb). The lowest mutation rate (0.33/Mb) was in prostate tumors, 75% of which harbored the TMPRSS2-ERG gene fusion. These patterns are consistent with Figure 1, where prostate shows a sparse handful of significant genes, while lung cancers have large and diverse sets of them.

Integrated Copy Number and Mutation Analysis

Next, the authors integrated their mutations with Agilent 244K array CGH copy number data to identify genes that were significantly altered, either by mutation, copy number, or both. In Figure 2a, the authors plotted significantly altered genes by their copy number gain or loss, which nicely separated oncogenes and tumor-suppressor genes. The integrated analysis identified 35 additional cancer genes including STK11, EPHB1, and notably GNAS (the G-protein alpha subunit). GNAS proved an important finding, as it was mutated and amplified across several human cancers.

Pathway-based and Recurrency Analyses

The integrated dataset identified two pathways – RTK signaling and RAS/MAPK as the most significantly altered across all tumor types. Furthermore, when the authors compared their dataset with the COSMIC database and the findings of recent cancer sequencing studies, they pinpointed novel recurrent mutations in several genes including HER2, NOTCH4, and PIK3R1.

The authors conclude that their study “represents a substantial expansion of the knowledge base of cancer somatic mutations,” and I tend to agree. They not only generated a rich dataset, but also analyzed and presented it in comprehensive fashion. Furthermore, they (perhaps unsurprisingly) identify numerous cancer genes that are druggable targets, thereby translating these findings into actionable information.

References

Kan Z, Jaiswal BS, Stinson J, Janakiraman V, Bhatt D, Stern HM, Yue P, Haverty PM, Bourgon R, Zheng J, Moorhead M, Chaudhuri S, Tomsho LP, Peters BA, Pujara K, Cordes S, Davis DP, Carlton VE, Yuan W, Li L, Wang W, Eigenbrot C, Kaminker JS, Eberhard DA, Waring P, Schuster SC, Modrusan Z, Zhang Z, Stokoe D, de Sauvage FJ, Faham M, & Seshagiri S (2010). Diverse somatic mutation patterns and pathway alterations in human cancers. Nature PMID: 20668451

Not-so-whole Exome Sequencing

June 30, 2010 by Dan Koboldt

There is growing interest in applying next-generation sequencing to targeted regions of interest, particularly the “exome” – the set of coding exons in the human genome. A paper in Genome Biology from Matthew Bainbridge and colleagues at Baylor describes solution-phase exome capture and sequencing of a HapMap sample with just 3 GB of data. The 1,000 Genomes Project recently announced a new pilot study focused on exome sequencing for hundreds of individuals. A few studies of human exome resequencing to identify disease genes have been published, and more are sure to come as genome centers ramp up their exome capabilities.

Yet this week’s In Sequence magazine writes that there are concerns about what exome capture is missing. For example, at CHI’s Beyond Sequencing meeting this week, researchers from NCI reported that current exome capture projects omit some medically important genes, such as insulin, ABO blood group, and HLA. Of course, some of this can be attributed to GC-rich exons and other tough-to-capture regions. The concern is that many RefSeq coding sequences aren’t even targeted by the two commercial platforms – 23% are missing from Nimblegen’s 2.1m array, and 17% are missing from Agilent’s SureSelect (according to the NCI group).

Exome Sequencing on Illumina and SOLiD

Even so, exome sequencing is rapidly reaching maturity. The Baylor study, led by Matt Bainbridge, used a customized Nimblegen solution-phase capture product to target 36 Mbp of consensus coding sequence (CCDS), and sequenced capture libraries on both ABI SOLiD and Illumina GAII platforms. Six individual capture libraries were generated from HapMap sample NA12812. Four were sequenced as technical replicates on SOLiD, while two more libraries went to Illumina single-end and paired-end sequencing.

On average, some 49.6% of mappable reads from the four SOLiD libraries were derived from target regions, with the remainder mapping elsewhere in the genome. The target coverage correlation between the four replicates was 98%, suggesting that reproducibility across capture and SOLiD sequencing was pretty good.

Duplication Rates in Exome Capture

The authors performed a detailed analysis of duplication rates in their data, a metric that is critical to the unique coverage and downstream analysis. The duplication rate for three SOLiD libraries with 3GB of data was ~22%, and highly consistent between replicates. Duplication was higher (~33%) in the fourth SOLiD library, which is not surprising since it had more than three times (10 GB) the data.

Intriguingly, the authors used simulations to demonstrate that the “expected” duplication rates for 3GB and 10GB of data are 14% and 22% by random chance, suggesting that as many as one-third of observed duplicates are not artifactual, but chance events.

Paired-end sequencing offers the opportunity to identify duplicates using both reads in a read pair. Theoretically, this should help distinguish artifacts from chance events. Indeed, the authors observed a dramatic difference in duplication rate between the Illumina fragment-end (30.97%) and paired-end (8.3%) libraries, even though both generated about 2.5 GB of data. They surmised that the improved identification of duplicates from paired-end sequencing, not a difference in library construction, was the reason. When pairing information was ignored, the duplication rate in the PE library nearly quadrupled to 27.6%.

SNP Discovery and HapMap Concordance

Because this was a HapMap sample, the authors were able to compare SNPs identified in sequencing to known genotypes from the HapMap Project. Genotype concordance in the target regions was 82% for 3GB libraries and 92% for 10GB libraries, but importantly, this considered all sites regardless of coverage. When the authors limited comparisons to sites with >=9x unique read depth, concordance was ~95%. That’s still a bit low for my taste, but within the realm of expectation for sequence-to-genotype comparisons.

SOLiD Versus Illumina Sequencing

I was pleased that Bainbridge and his colleagues made some direct comparisons between SOLiD and Illumina sequencing. This is a delicate issue, from the point of view of the sequencing vendors, but one of great interest to the NGS community. The Illumina PE data yielded ~25% more SNP calls in target regions, with higher HapMap concordance (98%) than ABI SOLiD data (95%). The authors attribute this to the better mapping, higher coverage, and low duplication rate made possible by paired-end sequencing. Considering only HapMap heterozygous SNPs, SOLiD out-performed Illumina at low (<9x) coverage, but Illumina consistently yielded 2-3% higher concordance at high coverage.

In their concluding section, the authors write “Interestingly, Illumina sequencing consistently shows higher levels of enrichment than SOLiD sequencing. This is unexpected because both sequencing platforms yield similar coverage distributions in whole genome sequencing data… therefore we suspect that differences in efficiency are due to an increase in initial library complexity from better annealing efficiencies of the Illumina adapter.”

Such a frank conclusion, from a group that’s highly invested in SOLiD sequencers, is especially poignant. When it comes to exome sequencing, Illumina seems to have the advantage.

References
Bainbridge MN, Wang M, Burgess DL, Kovar C, Rodesch MJ, D’Ascenzo M, Kitzman J, Wu YQ, Newsham I, Richmond TA, Jedeloh JA, Muzny D, Albert TJ, & Gibbs RA (2010). Whole exome capture in solution with 3Gbp of data. Genome biology, 11 (6) PMID: 20565776

Transcriptome Genetics with HapMap and RNA-Seq

April 22, 2010 by Dan Koboldt

Two papers in Nature this month leverage the power of second-generation sequencing technologies to investigate gene expression variation in human cell lines. By performing RNA-Seq in HapMap cell lines, the authors generated the most extensive gene expression data to date for these samples, and were able to use publicly available HapMap genotypes to associate expression differences with genetic variation. This strategy was applied to the HapMap samples two years ago using expression microarrays. Using RNA-Seq instead of microarrays, however, offers a few key advantages:

More accurate quantification of highly abundant transcripts, where microarrays reach saturation
Access to rare transcripts below the sensitivity threshold for microarrays
Detection of novel gene structure from alternative splicing and unannotated exons
Identification of allele-specific expression

The first study, from Jonathan Pritchard’s lab at the University of Chicago, sequenced RNA from 69 Yoruban (African) individuals on the Illumina GAII platform. They generated at least two lanes per individual, for a total of 1.2 billion reads, of which 964 million (80%) mapped uniquely to the genome or to exon-exon boundaries. The second study, from Emmanouil Dermitzakis’s group at the Sanger center, sequenced RNA from 60 CEU (CEPH Europeans from Utah) individuals, also on the Illumina GAII platform. They generated one lane of paired-end data per individual, for a total of about 1.0 billion reads. Since neither study provided a table summarizing their data (which I’d have liked), I put one together:

Study	Pickrell et al.	Montgomery et al.
Samples	69, African descent (YRI)	60, European descent (CEU)
Sequencing	Illumina 1X35 or 1X46	Illumina 2X37
Reads/Sample	17.4 million	16.9 million
SNP Dataset	HapMap II/III	HapMap III
Total SNPs	3.8 million	1.2 million

Up to this point, the two studies sounded nearly identical. For the data analysis, however, each group went in a different (and interesting) direction.

Pooled Data for Discovery of Novel Gene Structures

Novel Exons. The Pritchard group pooled all data to examine the completeness of current gene annotations. Some 86% of uniquely mapped reads corresponded to known exons. Using conservation data from alignments of 28 vertebrate exomes, the authors identified 4,031 regions that are evolutionarily conserved and show evidence of transcription. About one-quarter of these appear to be part of spliced transcripts, but most appeared to be novel untranslated regions (UTRs). Some 115 regions, however, had sequences consistent with protein-coding exons. To investigate the possibility that their novel exons are real, the authors used RNA-Seq data from several human tissues and chimpanzee cell lines. The evidence suggests that their regions do represent novel exons, but ones that are expressed in a more tissue-specific fashion than annotated exons.

Novel Poly-A Sites. The authors next screened the ~70 million unmapped sequence reads for long runs of A or T nucleotides, which might indicate novel poly-adenylation sites. Of the ~8,000 novel sites that they identified, some 45% fell within 10 bp of a known cleavage site. To further validate their findings, they screened their poly-A regions for the binding site of the CPSF polyadenylation factor, and found a 32-fold enrichment for the CPSF target hexamer. The net result was a high confidence set of 3,481 cleavage sites that show evidence of poly-A (from RNA-Seq data) and CPSF binding.

RNA-Seq: 10 million Reads Is All You Need

The Dermitzakis study generated 16.9 million (+/- 5.9 million) reads per individual, which were mapped to the NCBI 36 reference sequence using Maq with a maximum insert size of 2 megabases). The resulting alignments were filtered to remove alignments with low mapping quality or to the X, Y, or MT chromosomes. Discordant read pairs (by distance or orientation) were also removed. To quantify the expression of known exons/transcripts/genes, the authors scaled read counts for each individual to a theoretical yield of 10 million reads, and only considered exons with data in >90% of individuals. This resulted in data for 90,064 exons from 10,777 genes, of which 95% had at least 10 reads (on average) per individual. While the normalization seems to reduce the dataset to less than half of known genes, it nevertheless provided an extensive view of gene expression across these 60 individuals.

Cis-Regulatory Effects on Gene Expression

Using HapMap genotypes for 1.2 million SNPs, the Dermitzakis group identified 836 genes associated with cis-regulatory variants (compared to 539 genes identified in microarray studies of the same individuals). Even when normalized for the number of genes tested, the increased resolution of RNA-Seq over microarrays yielded a larger number of genetic regulatory effects. The RNA-Seq exon eQTLs (expression quantitative trait loci) were enriched for abundant transcripts, suggesting that saturation of highly expressed exons reduces the sensitivity for microarrays to detect some cis-regulatory effects.

The Pritchard group searched for cis-regulatory variation with an even larger dataset – RNA-Seq for 69 individuals and 3.8 million HapMap SNPs. They identified 929 genes with local eQTLs (4.6% of annotated genes); consistent with previous findings, virtually all SNPs associated with expression level were near the corresponding gene. They also reported the overlap with the CEU study results: the top 500 associations reported in CEU samples were enriched 10 to 40-fold for significant eQTLs in YRI samples. Given the marked genetic differences between these two populations, this result suggests that these studies are identifying replicable cis-regulatory events.

Mechanism of Cis-Regulatory Effects

An important feature of RNA-Seq data is that it can be used not only to detect cis-regulatory variation, but to assess the mechanism by which these variants act. The Pritchard group looked at 222 of their 929 eQTLs for which the associated SNPs fell within the gene exons. They classified the RNA-Seq reads as originating from the high-expression haplotype or the low-expression haplotype, and found that for 195 of the genes (88%), more than 50% of the expressed transcripts carried the allele associated with high expression. Therefore, the modulation of gene expression is a direct result of the associated variation (probably by activating nearby cis-regulatory elements). In other words: the eQTL tells us that variants near the gene are associated with its expression. That means something nearby is regulating it. The fact that the haplotype associated with increased expression is the haplotype that predominates tells us that the high-expression allele is what drives the expression of its nearby gene. As opposed to, say, driving expression of the gene from both chromosomes.

Allelic Effects on Splicing

Finally, both groups looked at the actual content of expressed transcripts, to find SNPs associated with alternative splicing. The Pritchard group calls these splicing quantitative trait loci (sQTLs), and found 187 genes with significant associations. Binding sites for known splice factors (U1 snRNP and U2AF) were enriched for sQTLs, as were SNPs within 2 bp of a canonical splice site. The Dermitzakis group found 110 genes with significant associations, and stratified splicing-associated variants according to their position in the gene structure. When tested against the exons upstream and downstream of where they resided, splice donor variants were enriched 3.17-fold with the upstream (5′) exon, while splice acceptor variants were enriched 7.02 fold with the downstream (3′) exon. Thus, these SNPs affect the inclusion/exclusion of their exons in the mature transcript.

Dermitzakis’s group visually examined their most significant associations to characterize the mechanism of splicing regulation. Of the 110 significant sQTLs identified in CEU samples:

41% were single exon skipping events
17% created an alternate acceptor
13% were double or triple exon skipping events
6% created an alternate donor
5% were mutually exclusive exons
5% were retained introns.

In summary, these studies establish the feasibility of transcriptome sequencing to assess gene expression and characterize regulatory variation. Indeed, as the title of one study suggests, RNA sequencing is a powerful tool for studying the mechanisms underlying human gene expression variation, and will undoubtedly yield better understanding of the complex relationships between genotype and phenotype.

References
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, & Pritchard JK (2010). Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 464 (7289), 768-72 PMID: 20220758

Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, & Dermitzakis ET (2010). Transcriptome genetics using second generation sequencing in a Caucasian population. Nature, 464 (7289), 773-7 PMID: 20220756

Mapping Bias in Short Read Alignment

December 11, 2009 by Dan Koboldt

A recent paper in Bioinformatics investigates the effect of read-mapping biases on detecting allele-specific expression (ASE) from RNA-Seq data. The authors generated 16 million 36-bp cDNA reads in each of two HapMap individuals on the Illumina/Solexa platform. When evaluating known SNPs for evidence of ASE, they observed that heterozygous SNPs exhibited a mapping bias favoring the reference allele.

mapping-bias-header

This alone is perhaps not surprising, as we already knew that indels suffer from such bias. Initially, most short read aligners simply ignored gapped alignments. Now, even with aligners like BWA and Novoalign that allow for gaps when mapping short reads, alignments supporting the reference allele (ungapped) will be favored over alignments supporting an indel (gapped). The longer the indel, the larger the gap, and the less likely a short read would be to be mapped across it.

It is easy to see how SNPs might have a similar effect. Clusters of SNPs in close proximity, for example, may result in reads with more mismatches than are permitted by the aligner. In simulations, the authors found that random error (i.e. sequencing error) exacerbated the mapping bias. At an error rate of 0.01, some 51.4% of reads at heterozygous sites supported the reference allele, while an error rate of 0.05 increased the proportion to 59%. My own conclusion based on these results is that a variant allele, combined with nearby sequence changes that result from random error, pushes the mismatch profile of certain reads above the threshold at which alignments are discarded.

SNP-Masking Reveals Inherent Bias

What is surprising in this study by Degner et al is that even after they masked SNP positions in the reference sequence, some 5-10% of SNPs still had an inherent mapping bias favoring one allele. For 1.4% of SNPs, in fact, all of the reads came from a single allele. This obviously has important implications for evaluating ASE in RNA-Seq data, since the relative frequency of alleles from read mapping is used to infer allelic expression. It also affects the now-widespread application of Illumina/Solexa and ABI/SOLiD sequencing to characterize genetic variation from genomic DNA. Because virtually every variant calling algorithm relies on the ratio of reads supporting variant versus reference alleles, an inherent mapping bias favoring the reference allele will reduce the detection sensitivity.

Mapping Bias and Sequence Homology

To better understand the causes of inherent mapping bias, the authors investigated some of the most severely affected SNPs. The strongest biases occured among SNPs in regions of the genome with homology to other locations. When the SNP position was not masked, variant-containing reads matched another locus equally or even better than the true location. When the SNP position was masked, both reference- and variant-containing reads had a 1-bp mismatch to the reference, but either allele might match better elsewhere in the genome. In Figure 3, two examples of such SNPs demonstrate how variant-containing reads either mapped incorrectly or were “not mapped.” Some of these “not mapped” reads may have exceeded the number of allowable mismatches, while others may have become non-unique (i.e. matching multiple places). The authors filtered any alignments with mapping quality of 0, so it’s unclear which caused the mapping failure.

I should point out here that the masking approach may have contributed to this result. The authors “masked” heterozygous SNPs by changing the reference base to a third allele that matched neither reference nor the known variant. A superior approach might be to mask heterozygous SNPs to N, so that any base call at that position is considered a match. This would reduce the number of read mismatches overall, and might help improve the bias. Then again, some read aligners may consider any base at “N” to be a mismatch, which would have essentially no effect. What might have been interesting, though, is increasing the # and base-quality-sum of mismatches allowed by Maq to see if the read bias was removed.

Implications Moving Forward for ASMB

Your reaction might be to shrug, since Illumina/Solexa now routinely generates 76-bp and 100-bp reads. There are, however, a number of reasons why this might not address the bias issue. First, while read lengths are getting longer, alignment “seeds” for short reads are essentially unchanged, and if the SNP occurs in the ~22-25 bp alignment seed, it can still have an effect. Second, many published datasets these days are still based on read lengths of 50 bp or less, especially from groups running ABI/SOLiD or older Illuminas. Third, at least one promising single-molecule sequencer is still generating reads in the 30 bp range. And finally, there’s a practical reason that we’ll continue to see short read datasets: running a 75-bp or 100-bp Illumina flowcell takes several days and multiple kits – expenses of time and dollars that may not always be available. Thus, allele-specific mapping bias (ASMB) [acronym invented, D. Koboldt, 12/11/09] in short reads will remain a key issue in next-generation sequencing.

References
Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, & Pritchard JK (2009). Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics (Oxford, England), 25 (24), 3207-12 PMID: 19808877

« Previous Page