Integrative Copy Number Analysis of Cancer Exomes

May 25, 2012 by Dan Koboldt

Exome sequencing has proven to be an incredibly powerful tool in the field of cancer genomics. It’s not just for mutations, either. The extensive coverage and continued improvement of current exome kits enables them to survey >90% of coding regions, as well as microRNAs, untranslated regions (UTRs), and other parts of the genome likely to harbor functional variation. Enrichment for such regions, coupled with the growing throughput of next-gen sequencing, yields sufficient coverage to identify small sequence variants (SNVs and indels) with great accuracy. New analysis methods tailored to exome data have also shown that it’s possible to search for gene fusions and even copy number alterations (CNAs).

One of our analysis tools, VarScan 2, offers a comprehensive suite of analysis tools for cancer exome sequencing. Given exome data for a tumor and its matched (usually blood) normal sample, VarScan 2 identifies SNVs and indels and classifies them according to somatic status: Germline (inherited), loss of heterozygosity (LOH), and somatic (acquired). Also, by directly comparing exome sequence depth, VarScan 2 detects changes in copy number in the tumor relative to the normal. This works well because the two samples (tumor and normal) are largely identical at a genetic level, and usually they’re sequenced under identical protocols. We published the algorithm in Genome Research late last year.

Segmented Exome CNA analysis of chr17

The output of the algorithm is a set of regions, each with chromosomal coordinates and a log2-ratio of copy number change. It’s very similar to array-based copy number data, and amenable to the same segmentation methods. Thus, we apply circular binary segmentation (CBS) to smooth the data and detect significant copy number change-points.

See the example at right, which is from a Her2+ breast cancer. You’re looking at several alterations on chromosome 17; the right-most peak is a focal amplification of ERBB2, the gene encoding the Her2 receptor.

Integrative Analysis of Tumor Exomes

Allele freqs of heterozygous SNPs in normal (blue) and tumor (green).

Cancer genomes often harbor numerous types of genetic alterations – mutations, structural variation, gene conversion events, etc. No single approach can survey everything at once, but exome sequencing is advantageous because mutations, copy number changes, and zygosity changes can be characterized simultaneously. Look at the allele frequency plot for the same chromosome offers some clarification:

A large hemizygous deletion with concurrent LOH (red lines) spanning the first ~20 Mbp of the chromosome including the TP53 gene.
Allele frequency shifts under the amplifications: it looks like ERBB2 has 2 extra copies, and the other amplification has 3 or 4.
Another possible event spanning the last ~25 Mbp of the chromosome, possibly a deletion with incomplete LOH.

This type of analysis, integrated with the somatic mutation data, helps distinguish driver genes that are targeted for alteration in tumors, e.g. amplification of ERBB2 and deletion with LOH of TP53.

Why Survey Exome Copy Number?

I can think of at least three reasons why exome-based copy number analysis is appealing:

High resolution. Surprisingly, between the number of gene targets and the “shoulder effect” of hybrid capture, exome data provides fairly comprehensive coverage genome-wide. Thus, exome-based CNA analysis provides a high-resolution survey of copy number alterations in a tumor genome.
Unexploited datasets. In my post cancer genome and exome sequencing in 2011, I summarized dozens of studies of leukemia, melanoma, carcinoma, and other cancers enabled by exome sequencing. None of these, to my knowledge, included a copy number analysis from exome data. Most did array-based CNA analysis (usually SNP arrays), which provides more uniform coverage of the genome, but has far lower resolution than exome-based CNA analysis.
Lots of data will be out there. The exomes of thousands of tumors have been sequenced already, and many of these datasets are in the public domain. The Cancer Genome Atlas Consortium, for example, has sequenced >1,000 tumor-normal exome pairs. Its study of ovarian carcinoma was published last year; studies of breast carcinoma, endometrial cancer, acute leukemia, and other tumor types will likely be out within a year.

Just a friendly reminder for everyone mining these rich datasets: the data are made available with the understanding that you can’t publish on them before the “marker paper” from TCGA comes out for that dataset. But you can certainly get started.
References
Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, & Wilson RK (2012). VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome research PMID: 22300766

Integrating copy number and gene expression data in breast cancer

May 11, 2012 by Dan Koboldt

A study in Nature reports the genomic and transcriptomic architecture of breast cancer from a survey of ~2,000 tumors. These samples were collected in Canada and the UK; what makes the collection particularly valuable is that they were fresh-frozen and clinically annotated, with long-term follow-up. Patients whose tumors were ER-negative and/or lymph-node-positive had received systematic chemotherapy, ER-positive or LN-negative patients had not, and none of the patients with Her2+ tumors received Herceptin (trastuzumab). Thus, the tumors were all clinically homogeneous within subgroups, making this a great resource to study the genomic landscape of breast cancer.

Breast Cancer Subtypes

A quick overview of breast cancer subtypes seems appropriate here. Most breast cancers are carcinomas, meaning that they arise from epithelial cells. A histology review typically classifies these as originating from the milk-passage (ductal) or milk-producing glands (lobules) of the breast. Tumors can also be assigned to subgroups on the basis of gene expression: a 50-gene assay called PAM50 is widely used to classify tumors as one of 4-5 “intrinsic” subtypes. Among the most important genes from a clinical perspective are those encoding estrogen receptor (ER), progesterone receptor (PR), and Her2 (ERBB2) receptor. The four most common intrinsic subtypes:

Subtype	Typical ER/PR/Her2 Status	Prevalence	Notes
Luminal A	ER+ and/or PR+, Her2-	42-59%	Most common and best prognosis
Luminal B	ER+ and/or PR+, Her2+	6-19%	Slightly worse prognosis
Her2-enriched	ER-, PR-, Her2+	14-20%	Often poor prognosis
Basal-like/Triple-negative	ER-, PR-, Her2-	7-12%	Often aggressive, poorer prognosis
Source: Susan G. Komen Foundation

There is substantial but incomplete overlap between basal-like and triple-negative breast cancer. Their genetic basis is not as well understood, and they typically don’t respond to targeted hormone therapies because they don’t express ER, PR, or Her2.

Integrating SNP and Copy Number Data with Gene Expression

In this study, the authors assessed the impact of SNPs, inherited copy number variants (CNVs), and acquired copy number alterations (CNAs) on the gene expression landscape. With the statistical power of 2,000 samples (half in a discovery set, half in a validation set), they were able to search for both cis-regulatory (variants affecting nearby genes) and trans-regulatory (variants affecting distant genes) relationships. Genome-wide analysis of variance (ANOVA) revealed that germline SNPs/CNVs and somatic CNAs influenced >39% of gene expression probes, roughly half acting in cis and half in trans.

Somatic CNAs dominated the regulatory picture, contributing to >96% of significant expression associations
On a gene-by-gene basis, germline SNPs rivaled CNAs in explaining a greater proportion of the variation.
The contribution of inherited CNVs was minimal by comparison

Although the dominating influence of somatic CNAs is understandable, the relatively small contribution of CNVs to the expression picture is rather surprising. It’s possible that inherited regions of CNV with strong influence on gene expression are targeted for amplification/deletion by cancer cells, which might obscure their effect in an otherwise normal cell. Otherwise, it does seem to suggest that germline SNPs have a greater influence than CNVs when it comes to modulating gene expression.

Cis versus Trans Regulation

Some ~20% of loci examined exhibited cis-regulatory assocations between somatic CNAs and gene expression. In other words, acquired copy number alterations influence the expression of genes within them or nearby. The authors undertook a higher-resolution survey of these associations within tumor subtypes, finding known driver events, such as amplifications of MYC, CCND1, ERBB2, and CCNE1 and deletions of PTEN and MDM2, as well as putative but suggestive events involving MDM1, MDM4, CDK3, CDK4, PI4KB, NCOR1, and others. They also highlight three apparently novel cis-regulatory associations that may influence breast cancer development and progression:

Loss of PPP2R2A, a regulatory sub-unit of a complex that governs mitotic exit. Somatic mutations in another subunit of the same complex (PPP2R1A) were recently identified in clear cell ovarian cancers and endometrioid cancers.
Frequent deletion of MTAP that co-occurs with deletion of known tumor suppressors CDKN2A and CDKN2B.
Recurrent deletion of MAP2K4 concomitant with outlying expression in ER-positive cases.

To examine trans-regulatory events, the authors plotted matrices of CNA-expression relationships by chromosome (gene location on the Y-axis, CNA location on the X-axis). Visualized in this manner, any patterns off of the diagonal (where a CNA influences a gene on the same chromosome) indicate a trans-acting event. There was strong-evidence of such patterns on chromosomes 1q, 7p, 8, 11q, 14q, 16, 17q, and 20q, all of which are the targets of frequent large-scale copy number alteration in breast cancer.

The “hotspots” of these trans associations, when grouped by pathway, highlight known targets of dysregulation in breast cancer such as ERBB2 and MYC. You might notice that these two were also cis-regulatory association list above, and make the intuitive leap to conclude that amplifications targeting ERBB2 (on chr17) and MYC (on chr8) increase the expression of these genes, which in turn drives expression changes for genes elsewhere in the genome.

Integrative Clustering Reveals Novel Subgroups

The authors next took 997 tumors in the discovery set, integrated copy number and gene expression data, and performed clustering analyses to identify subgroups of tumors with distinct features and clinical outcomes. They came up with 10 “integrative clusters”, which they replicated in the validation set (995 cases). Among these clusters are some interesting subsets:

A high-risk, ER-positive subgroup with a steep mortality trajectory (bad), composed of 11q13/14 cis-acting luminal tumors that harbor other common alterations. The authors note that 11q13 contains the CCND1 gene, frequently targeted for amplification in breast cancer. This is an important exception to the often favorable prognosis for ER+ tumors.
A subgroup of predominantly luminal A cases with low genomic instability that was enriched for histology types with good prognoses (e.g. lobular and tubular carcinomas).
Another subgroup with favorable prognosis, but containing a mixture of ER statuses and subtypes. Their common feature was a nearly flat copy number landscape. The authors note that this “CNA-devoid” subgroup is “ripe for mutational profiling.”
A stable, mostly high-genomic-instability subgroup comprising nearly all basal-like tumors with good long-term outcomes.
A group of Her2-enriched and ER-positive tumors with ERBB2 amplification. These patients were all enrolled before Herceptin (trastuzumab) became available, and had the worst disease-specific survival.

These findings demonstrate how useful it is to construct a cohort, not just of many cases, but with long-term follow-up so that researchers can link the genomic architecture of tumors to the eventual death or survival of the patients.

References

Curtis, C., Shah, S., Chin, S., et al. (2012). The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups Nature DOI: 10.1038/nature10983

The genetic architecture of triple-negative breast cancer

April 19, 2012 by Dan Koboldt

Triple-negative breast cancer (TNBC), a tumor type defined by its lack of estrogen receptor, progesterone receptor, and Her2 (ERBB2) amplification, accounts for 16% of breast cancers. This clinically defined tumor type overlaps substantially but not completely with “basal-like” breast cancer, a classification based upon gene expression signature. This is a highly heterogeneous disease with a higher risk of recurrence in the absence of systemic therapy.

This month in Nature, researchers from BC Cancer Agency have characterized the landscape of genomic aberrations in 104 TNBC cases with a combination of whole-genome sequencing, exome sequencing, RNA-seq, and high-density SNP arrays. Using ultra-deep targeted resequencing, the authors validated ~2,500 somatic mutations and characterized their frequencies among heterogeneous clonal populations in each tumor.

Genomic architecture (Basal TNBC), Shah et al, Nature 2012

This is a complex study that’s hard to digest (the supplemental material had over 140 pages – come on, is that really necessary?) so I’ll do my best to break it down. I believe there are three highlights: frequent gene alterations in TNBC, under-representation of mutations in mRNA sequences, and a continuous distribution of mutation frequencies within tumors.

Genetic Alterations in Triple-Negative Breast Cancer

The most frequently mutated gene should be familiar to you: TP53, which harbored validated somatic mutations in 62% of basal and 43% of non-basal TNBCs. Other patterns of alteration were as follows:

Significantly mutated genes included TP53, PIK3CA, RB1, PTEN, MYO3A, and GH1. Here, significant means that the gene harbored more mutations than expected from background random mutation processes. The larger the gene, the more likely it is to catch a random mutation. That’s why USH2A, which was mutated in 9.2% of cases, was not significant (it’s a large gene).
Recurrent but not statistically significant mutations were observed in the synuclein genes (SYNE1/SYNE2), BRCA2, BRAF, NRAS, ERBB2, and ERBB3.

Somatic Copy Number Alterations (CNAs)

The patterns of somatic copy number changes, as assessed by high-density SNP array, suggest widespread segmental CNA instability:

Shah et al, Nature 2012 (Supp Fig 3)

These results are largely consistent with a separate study in the same issue that examined CNAs in 2,000 triple-negative breast cancers. I’ll have to cover that another time. Some of the known CNA patterns evident above include gains of chromosomes 1q, 3q, and 8q (where MYC is located). Note frequent deletions across many chromosome arms or entire chromosomes, many of which contain tumor suppressor genes (e.g. TP53 on chromosome 17).

Expression of Somatic Mutations

This group at BC Cancer Agency is a leader in transcriptome sequencing (RNA-Seq), which is a key component of this study. Strikingly, the authors found that just 36% of validated somatic mutations discovered in genomic DNA were present in mRNA transcripts. This number is a little deceptive, and I’ll tell you why. Supplementary figure 2 offers a summary of the expression of validated mutations across all cases with RNA-seq data:

Mutation Expression Patterns (Shah et al, Nature 2012)

Notably, 23% of somatic mutations occur in genes with no observed transcripts: there’s no allelic effect of the mutation; the genes just aren’t expressed in any form. That leaves:

40.56% of genes where only the wild-type allele is expressed. Here, it’s possible that the mutation alters mRNA expression or stability and thus only the non-mutated allele is seen.
31.48% where both alleles are expressed. The mutation may not affect expression, but it could still alter the translation or function of the encoded protein.
5% where only the mutant allele is expressed. This could be due to genomic loss of the wild-type allele (LOH), mutations on the X-chromosome (one copy of which is inactivated), or even a gain-of-function mutation causing aberrant gene expression.

Bottom line, just over one-third of somatic mutations in the genome are present in the transcriptome. This has important implications for clinical cancer genome sequencing: just because a druggable mutation is present doesn’t mean it’s expressed.

Continuous Distribution of Somatic Mutations

With ultra-deep targeted sequencing, it’s possible to estimate the allele frequency of a somatic mutation with high accuracy, and from that, to infer the relative proportion of tumor cells harboring that mutation. A heterozygous founder mutation, for example, would be present in virtually all tumor cells and have a mutation frequency of 50% in diploid cells. Perhaps surprisingly, the authors find that somatic mutations occur at a continuous distribution in TNBC, and this appears independent of copy number alterations and tumor cellularity.

Part of this observation may technical in nature (i.e. false negatives in mutation discovery). However, this phenomenon has been noted in other epithelial cancers suggesting that the mutation content of cells within a single tumor may be differently shaped by biological processes and mutational mechanisms. It reinforces the notion that tumors (and TNBC in particular) are not a homogeneous mass of identical cells, but a collection of distinct sub-populations of cells evolving somewhat independently of one another. This is probably why they’re sometimes difficult to eliminate: you might destroy most of the subpopulations with therapy, but one or more minor clones could persist.

References

Shah SP, Roth A, Goya R, Oloumi A, Ha G, Zhao Y, Turashvili G, Ding J, Tse K, Haffari G, Bashashati A, Prentice LM, Khattra J, Burleigh A, Yap D, Bernard V, McPherson A, Shumansky K, Crisan A, Giuliany R, Heravi-Moussavi A, Rosner J, Lai D, Birol I, Varhol R, Tam A, Dhalla N, Zeng T, Ma K, Chan SK, Griffith M, Moradian A, Cheng SW, Morin GB, Watson P, Gelmon K, Chia S, Chin SF, Curtis C, Rueda OM, Pharoah PD, Damaraju S, Mackey J, Hoon K, Harkins T, Tadigotla V, Sigaroudinia M, Gascard P, Tlsty T, Costello JF, Meyer IM, Eaves CJ, Wasserman WW, Jones S, Huntsman D, Hirst M, Caldas C, Marra MA, & Aparicio S (2012). The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature PMID: 22495314

Genetic Evolution of Secondary AML from MDS

March 14, 2012 by Dan Koboldt

Contents: Whole-genome Sequencing • Recurrently Mutations • Clonal Evolution • References
Myelodysplastic syndromes (MDS) are a group of disorders of ineffective blood production and the most common cause of acquired bone marrow failure in adults. One-third of cases go on to develop secondary AML (sAML), yet there remains uncertainty among patients, insurers, and funding agencies about whether the myelodysplastic syndromes are actually cancers. A study online today at the New England Journal of Medicine has characterized the genetic evolution from MDS to sAML using whole-genome sequencing.

Whole-genome Sequencing of sAML

Matthew J. Walter and colleagues of the Washington University School of Medicine performed whole-genome sequencing of tumor samples and matched normal DNA from seven patients with secondary AML. For each subject, hundreds of somatic mutations were genotyped in sAML and MDS-stage samples to characterize the clonal architecture of each tumor. Figure 1A from the paper demonstrates the resolution that can be obtained from deep resequencing of somatic mutations in both sAML and MDS samples:

Notice the five clusters (differently colored) representing five clonal populations. In yellow (cluster 1) are mutations present in virtually all cells of both the MDS and the sAML sample. In orange (cluster 2) are mutations present at low frequency in MDS but enriched in sAML. Three more clusters (red, purple, and black) along the y-axis represent mutations that were absent in the MDS sample but acquired during the progression to sAML. The patterns of these mutations suggest that sAML evolved from a clonal population of MDS cells that acquired new mutations along the way.

Identification of Recurrently Mutated Genes

In the very near future, it may become feasible and cost-effective to perform whole-genome sequencing (WGS) on hundreds or thousands of tumors of a certain type to exhaustively identify recurrently mutated genes. Until then, WGS of a discovery cohort followed by extension screening in a larger cohort offers a powerful and cost-effective strategy. Two genes were already recurrently mutated in the 7 WGS cases: RUNX1, a known myeloid tumor suppressor, and UMODL1, for which mutations were recently reported in multiple myeloma and ovarian cancer. The authors extended their findings via targeted screening for additional coding mutations in 200 AML cases. This enabled the identification of 9 more recurrently mutated genes, for a total of 11.

Recurrently Mutated Genes in MDS and sAML

Gene	Mutation(s)
CDH23	1235insL
NPM1	W288fs
PTPN11	G60R
RUNX1	G170fs; del21q22.11
SMC3	e8-1 splice
STAG2	H738fs
TP53	V272M
U2AF1	S34F
UMODL1	T533P; V882M
WT1	D436E
ZSWIM4	P18A

Notably, four of the genes (CDH23, SMC3, UMODL1, and ZSWIM4) had not been implicated in MDS or AML. A specific codon (34) in U2AF1 harbored missense mutations in multiple AML tumors, suggesting a gain-of-function for the splicing factor encoded by that gene. The recurrent mutations in STAG2, a gene located on the X-chromosome, were all protein truncation mutations (nonsense or frameshift) suggesting that a loss-of-function of this gene contributes to MDS and AML pathogenesis.

Clonal Evolution: from MDS to AML

By characterizing mutations from secondary AML tumors in the MDS precursors for the same patient, the authors reconstructed the clonal architecture of the disease from early to advanced stages. The findings are summarized in Figure 2A:

In all 7 cases, the results suggest a linear model of clonal evolution, in which progression from MDS to sAML was characterized by persistence of a single founder clone (defined by ~200-700 mutations) and the outgrowth of at least one new subclone which contained dozens or hundreds of additional mutations. In other words, a single population of MDS cells underwent multiple rounds of mutation and selection, giving rise to multiple subpopulations present in full-blow secondary AML.

Please go read this fascinating study at the New England Journal of Medicine.

References

Walter MJ, Shen D, Ding L, Shao J, Koboldt DC, Chen K, Larson DE, McLellan MD, Dooling D, Abbott R, Fulton R, Magrini V, Schmidt H, Kalicki-Veizer J, O’Laughlin M, Fan X, Grillot M, Witowski S, Heath S, Frater JL, Eades W, Tomasson M, Westervelt P, DiPersio JF, Link DC, Mardis ER, Ley TJ, Wilson RK, & Graubert TA (2012). Clonal architecture of secondary acute myeloid leukemia New England Journal of Medicine

« Previous Page