Driver Mutations and Metastasis

November 30, 2010 by Dan Koboldt

Two recent papers used very different appraoches to shed light on the genetic alterations underlying tumor growth and progression in human cancers. Peter Campbell and colleagues from the Wellcome Trust Sanger Institute employed Illumina paired-end sequencing to survey the landscape of structural variation in metastatic pancreatic cancer. Ivana Bozic and colleagues from Harvard University took a different approach – they constructed mathematical models of tumor progression via the accumulation of driver and passenger mutations. I happened to read both papers on a long airplane ride, and learned a great deal about mutations and metastasis in human cancers.

Pancreatic Cancer: Bad News

You learn a lot from the introduction sections of these papers, even if the Letter to Nature format keeps them short. I knew that pancreatic cancer had, in general, a poor prognosis. It turns out that the five year mortality for this cancer is 97-98%, usually due to “widespread metastatic disease.” These tumors also appear to carry a heavy mutational load. A 2008 survey of 24 pancreatic cancers (by Bert Vogelstein’s group at Johns Hopkins) found that tumors had ~63 genetic alterations on average, the majority of which were point mutations. Copy number changes are also common in this cancer type. Frequently mutated genes include tumor suppressors (TP53, SMAD4, CDKN2A) as well as oncogenes (KRAS, MYC). Less was known about the patterns of structural variation in pancreatic cancer.

Detecting Rearrangements by Paired-End Sequencing

Peter Campbell’s group has developed a very nice strategy for identifying somatically acquired rearrangments by massively parallel paired-end sequencing on the Illumina platform. They’ve already applied it to the characterization of SVs in several cancer cell lines. In this study, they generated 50-150 million read pairs (2 x 37 bp) per patient, which, in their experience, enables detection of 50-60% of rearrangements in a sample. Across the 13 pancreatic tumors, they identified 381 somatic and 177 germline rearrangements across seven categories: amplicon, deletion, tandem duplication, inversion, fold-back inversion, interchromosomal (translocation), and “other” intrachromosomal.

Many rearrangements corresponded with a change in copy number. In one metastasis, for example, numerous rearrangements (some inverted, some not) combine to amplify the KRAS oncogene.

Rearrangement/Amplification of KRAS (Credit: Nature).

Fold-back Inversions and Inter-Lesion Genetic Heterogeneity

One sixth of the rearrangements identified fell into a class the authors call “fold-back” inversions. These are genomic regions that are duplicated, but the two copies face in opposite directions from the breakpoint (as opposed to a tandem duplication). The authors suggest breakage-fusion-bridge cycles as the likely mechanism that creates such an event. Basically, a double-stranded break that occurs during G0-G1 phase is replicated (in S phase), creating two duplicated end sequences. These are fused together by DNA repair processes, resulting in a sort of inverted duplication (fold-back inversion) with two centromeres. These “dicentric” chromosomes are unstable, and frequently initiate the amplification of oncogenes.

Each rearrangement was [laboriously] genotyped by PCR in both the index tumor sample and matched normal control to verify the somatic status. Further, PCR and capillary sequencing were employed to resolve breakpoints, and some 206 rearrangements were genotyped across multiple lesions (metastases) in the 10 patients for which metastatic samples were available. There was a considerable amount of genetic heterogeneity among samples from the same patient. While the majority of rearrangements were present in all samples but not the germline (omnipresent); several were present in some samples but not others (partially shared) or unique to the index tumor sample (private).

Telomere Loss and Breakpoint-Fusion-Bridge Cycles

Fold-back inversions were significantly more likely than other classes of rearrangement to be omnipresent, suggesting that they occur early during tumor progression, before cancer cells disseminate. Because breakage-fusion-bridge cycles are often initiated by telomere loss, the activity of telomerase to maintain telomeres may play a pivotal role in the development of pancreatic cancer. Other studies have shown that telomerase expression is low in early tumor stages, but markedly increased in the invasive tumor. The increased expression likely suppresses breakage-fusion-bridge cycles, which may help explain why fold-back inversions are more likely to occur earlier in the development of the disease.

Ongoing Evolution in Tumors and Mets

In several patients, the authors found rearrangements that were in the primary tumor and some metastases, but not all of them. The most likely explanation for such a pattern is that the metastases were “seeded” by different cells from the primary tumor. This is intriguing, because it suggests ongoing clonal evolution, in the primary tumor, among cells capable of initiating metastases. There were also rearrangements in some metastases that weren’t detected in the primary tumor, suggesting that secondary lesions, too, are undergoing clonal evolution.

Overall, the authors demonstrated that pancreatic cancers and secondary invasions show a substantial amount of genetic heterogeneity within the same patient. There’s certainly more to be done to get the full picture of genetic alterations in these tumors, but at just ~4-10 Gbp of data per sample, the scope and nature of what the authors have uncovered is pretty impressive.

Drivers and Passengers

The other paper (contributed by Bert Vogelstein to PNAS) took a theoretical approach to modeling the accumulation of driver and passenger mutations during tumor progression. In contrast to previous models that account for only 1-2 mutations, the authors develop a model in which mutations occur sequentially in tumor cells, with each new driver mutation conferring a slightly faster growth rate. This more closely reflects recently-characterized solid tumors, which harbor 40-100 coding gene alterations, of which 5-15 are considered “driver” mutations.

Based on the assumption that any human cell contains 286 tumor suppressor genes and 91 oncogenes, the authors estimate that ~34,000 positions in the human genome could host a driver mutation. By this estimate, the driver mutation rate is approximately 3.4 x 10-5 per cell division. Under the authors’ assumption that each driver speeds tumor growth, the rate at which drivers accumulate becomes faster and faster, because the more drivers a cell has, the faster it divides. Not all mutations are successful, because they only reduce the probability that a cell will senesce or die (they don’t guarantee it). The authors considered a mutation in a tumor suppressor gene to be the central rate-limiting factor, since the other working copy tends to be lost relatively quickly due to large-scale LOH events.

Six simulated patients were modeled and presented in this study. All of them started with one driver mutation. Strikingly, though all of the input values (mutation rate, division rate) were the same, there was enormous variation in the rates of tumor progression between simulated patients. Patient 1, for example, went 20 years before acquiring a second driver mutation, and the size of the tumor remained small (<5 g). In contrast, patient 6 had a secondary driver mutation in less than 5 years; by the end of the simulation, that tumor weighed hundreds of grams. While this model is undoubtedly an oversimplification, it does highlight the importance of, well, random chance. Given the large size of the human genome and the relatively small number of potential driver mutations, an individual’s fate hinges on stochastic processes. If you’re lucky, you go decades without picking up that crucial second hit. If you’re unlucky, you don’t.

Intuitively, this seems reasonable, given the anecdotal evidence of de novo cancers, which seem to strike somewhat randomly. Of course, the older you are, the more times your cells divide, and the better chance you have of picking up additional driver mutations. And environmental exposures (like smoking and radiation exposure) certainly have a role to play, because they increase cellular mutation rates. Even so, if you believe in the model, chance plays a significant role.

Here’s to hoping you’re one of the lucky ones.

References

Bozic I, Antal T, Ohtsuki H, Carter H, Kim D, Chen S, Karchin R, Kinzler KW, Vogelstein B, & Nowak MA (2010). Accumulation of driver and passenger mutations during tumor progression. Proceedings of the National Academy of Sciences of the United States of America, 107 (43), 18545-50 PMID: 20876136

Campbell PJ, Yachida S, Mudie LJ, Stephens PJ, Pleasance ED, Stebbings LA, Morsberger LA, Latimer C, McLaren S, Lin ML, McBride DJ, Varela I, Nik-Zainal SA, Leroy C, Jia M, Menzies A, Butler AP, Teague JW, Griffin CA, Burton J, Swerdlow H, Quail MA, Stratton MR, Iacobuzio-Donahue C, & Futreal PA (2010). The patterns and dynamics of genomic instability in metastatic pancreatic cancer. Nature, 467 (7319), 1109-13 PMID: 20981101

Genetics and Epigenetics of Leukemia

November 10, 2010 by Dan Koboldt

A study online at the New England Journal of Medicine reports that DNMT3A mutations in acute myeloid leukemia are common and associated with poor outcome for intermediate-risk patients. Previously, our group had characterized the genomes of two patients with cytogenetically normal AML (AML1 and AML2). The first genome (AML1) was initially sequenced with Illumina short reads (1×36 bp), revealing eight novel acquired (somatic) mutations but none that were recurrent. The second genome, AML2, harbored a recurrent mutation in the isocitrate dehydrogenase 1 gene (IDH1), which had recently been implicated in glioblastoma. Subsequent work has demonstrated that mutations in IDH1 and related gene IDH2 highly recurrent in AMLs with intermediate risk karyotypes (20-30% frequency).

Resequencing the Relapse with Current NGS Technology

For this study, we resequenced relapse tumor from patient AML1 using current Illumina technology (2×100 bp paired-end reads), achieving higher diploid coverage and enabling the identification of several novel nonsynonymous mutations. One of these was a 1-bp deletion in the DNA methyltransferase gene DNMT3A predicted to cause a frameshift resulting in a truncated protein.

: DNMT3A/DNMT3A Complex with DNA

Resequencing showed that it was present in the original tumor sample, and probably missed due to alignment difficulties for short reads. Screening the exons of DNMT3A in 281 additional AML tumors revealed that 61 (22.1%) also had DNMT3A mutations with translational consequences. The most common of these was a missense mutation at residue 882, found in 37 tumors.

Mutation, Methylation, and Disease

When we realized how common this mutation was, and considered that the gene involved is a DNA methyltransferase, I have to admit that a tantalizing picture emerged. In my mind, at least. Mutation could lead to aberrant methylation of the tumor genome. Demethylation unmasks oncogene expression. Hyper-methylation leads to genome-wide instability, causing more mutations that activate oncogenes and disable tumor suppressors. DNA methylation has long been suspected in cancer, but the relationship between mutation, methylation, and disease progression has not been definitively established. At last, it seemed like we would bridge that gap.

We performed a number of experiments to determine if DNMT3A mutation status affected mutation rate, genome-wide methylation, or gene expression. First, we examined the 38 AML tumors that had undergone whole-genome sequencing (WGS) to ~25x coverage. Eleven of these carried DNMT3A mutations. There was no apparent correlation, however, between DNMT3A mutation status and the number of high-confidence mutations called genome-wide. Next, we assessed gene expression in 188 AML tumors and matched (normal) controls on microarrays. DNMT3A was expressed in all 188 tumors and matched normals, regardless of mutation status. Unsupervised clustering of gene expression patterns did reveal distinct clusters, but none correlated with DNMT3A status. We further performed targeted cDNA resequencing in tumors with mutations, and confirmed expression of most mutant alleles at the expected 50% frequency (though some were not seen in any cDNA, probably due to nonsense-mediated decay).

So no effect on mutation, and no changes in gene expression. Hold your breath, and let’s look at methylation. MeDIP assays revealed 182 regions that were differentially methylated between DNMT3A-mutated and non-mutated tumors. All were hypomethylated in the mutated samples. But there was no consistent effect on the expression of nearby genes. And, sadly, there was no global effect of DNMT3A mutaiton on DNA methylation. We were 0 for 3. Last but not least, we turned to the clinical data.

Clinical Correlation: DNMT3A and Prognosis

When we stratified AML patients by risk (based on cytogenetics) and DNMT3A mutation status, some interesting patterns emerged. First, DNMT3A mutations were completely absent from the favorable-prognosis group. Mutations were enriched, however, among patients classified as “intermediate risk” – normal or unclear karyotypes. And the outcome for DNMT3A-mutated patients was significantly poorer. The adverse-outcome association was independent of age, although older patients with DNMT3A had the worst outcomes of any group. And the association held true regardless of the presence of other commonly-mutated AML genes (NPM1, FLT3, IDH1/IDH2). Thus, DNMT3A mutation clearly contributes to AML pathogenesis, even if the mechanism by which it does so remains elusive. The fact that DNMT3A mutations are selected against in favorable-outcome patients suggests a true biological association.

A lot of work remains to be done. We still need to uncover the mechanistic effect of DNMT3A mutations that underlies the pathogenesis. But this work has furthered our understanding of AML, by identifying a highly recurrently mutated gene and providing a marker to help stratify patients of intermediate risk. As highlighted in a perspective by Shannon and Armstrong, clinical trials of DNA methlytransferase inhibitors in AML are already under way. It may not be long before genomic discoveries are translated into actionable information for the treatment of cancer patients.

Related Articles

Researchers discovery key mutation in acute myeloid leukemia (NIH News)

Mutations in single gene predict poor outcomes in adult leukemia (WashU Record)

References
Ley, T., Ding, L., Walter, M., McLellan, M., Lamprecht, T., Larson, D., Kandoth, C., Payton, J., Baty, J., Welch, J., Harris, C., Lichti, C., Townsend, R., Fulton, R., Dooling, D., Koboldt, D., Schmidt, H., Zhang, Q., Osborne, J., Lin, L., O’Laughlin, M., McMichael, J., Delehaunty, K., McGrath, S., Fulton, L., Magrini, V., Vickery, T., Hundal, J., Cook, L., Conyers, J., Swift, G., Reed, J., Alldredge, P., Wylie, T., Walker, J., Kalicki, J., Watson, M., Heath, S., Shannon, W., Varghese, N., Nagarajan, R., Westervelt, P., Tomasson, M., Link, D., Graubert, T., DiPersio, J., Mardis, E., & Wilson, R. (2010). DNMT3A Mutations in Acute Myeloid Leukemia
New England Journal of Medicine DOI: 10.1056/NEJMoa1005143

Shannon, K., & Armstrong, S. (2010). Genetics, Epigenetics, and Leukemia New England Journal of Medicine DOI: 10.1056/NEJMe1012071

The Fruits of a Thousand Genomes

November 1, 2010 by Dan Koboldt

Last week saw the publication of the 1,000 Genomes Project, which has characterized ~15 million SNPs, 1 million short insertions/deletions (indels), and 20,000 structural variants in seven human populations. This is discovery and genotyping at unprecedented scale, with an astonishing 4.9 terabases (trillion bases) sequenced – the equivalent of about 1,500 human genomes – across three pilot projects:

Deep whole-genome sequencing of trios (mother-father-daughter) from 2 populations
Low-coverage sequencing of 179 unrelated individuals from 4 populations
Exon sequencing of 906 randomly-selected genes in 697 individuals from 7 populations.

The three pilots have shed new light on sequence variation in human genomes and its distribution among human populations. Perhaps unsurprisingly, variation was not evenly distributed in the genome – certain regions (e.g. HLA and sub-telomeres) show high rates of variation, whereas (e.g. a 5 Mbp, gene-dense, highly-conserved region on chromosome 3) show very little. At the chromosomal level, different forms of variation were highly correlated (e.g. SNPs and indels), but there were exceptions for some types of structural variants implicating different mechanisms of mutation.

Novelty and Population-Specificity

The vast majority of SNPs detected were already known to dbSNP. Among known variants, 56% were present in all population panels while 25% were found in only a single panel. In contrast, only 4% of novel variants were found in all panels and 84% were found in only one. This difference supports the notion that the majority of common SNPs in human populations have already been found. There’s more work to do for other forms of variation, though. Many of the novel SVs were detected in all population panels. Half of the common short indels had never been reported.

The smallest two chromosomes – mitochondrial and Y – seemed to benefit the most. There was a lot of heteroplasmy in mitochondrial DNA within individuals – 79% of samples had length heteroplasmy, and 45% had substitution heteroplasmy. On the Y-chromosome, there were 2,870 variable sites, most of which (74%) were novel to public databases. These new variants helped identify several clear, significant sub-clades within the 12 haplotype groups represented in 1,000 Genomes samples.

Coding Regions and Loss-of-Function Variants

In total, the three pilots identified 68,300 non-synonymous variants, almost half of which were novel. Genotyping a subset of these in 620 samples revealed novel NSS variants had dramatically lower minor allele frequency (2.2%) than known ones (26.2%). From this I can draw two conclusions: most novel nonsynonymous variants are rare, and the majority could only have been identified by population-scale sequencing projects like these.

The authors estimate that an individual genome differs from the reference at 10,000 to 11,000 nonsynonymous sites and perhaps 12,000 synonymous sites. A typical genome harbors a much smaller number of loss-of-function (LOF) variants — inframe/frameshift indels, early stops, and splice-site variants — perhaps 340-400 LOF variants per individual, affecting 250-300 genes. Compared to synonymous variants, putative functional variants (nonsynonymous and LOF) tend to have lower allele frequencies and be more population-specific, presumably due to the action of purifying selection against deleterious mutations. Which means, of course, that the really important variants are much harder to find.

Signatures of Natural Selection

Looking in and around genes, the authors found diversity is lowest in exons (50% that of introns) and slightly reduced in 5′ and 3′ UTRs, compared to intronic and intergenic sequences. This signature of natural selection acting upon genes actually has a broad effect; diversity is reduced by 10% in the vicinity of genes compared to gene-distant loci, and that reduction extends up to 85 kbp away. Thus, selection on linked sites appears to restrict variation across the majority of the human genome. Looking across panels, the authors observed that SNPs with large allele frequency differences between populations were enriched for nonsynonymous sites, likely reflecting local adaptation and selection by different continental groups.

Finally, the authors examined the trios to look at a different environment for mutation and selection – immortalized cell lines. Some 952/1001 new mutations in the CEU daughter and 634/669 new mutations in the YRI daughter were not present in the germline, indicating that they occurred either in somatic cells or in the cell lines. Further, the higher number of mutations in the CEU sample may be related to the age of the lines – the CEU line is decades older than the YRI line.

Implications for Future Studies

The findings of the 1,000 Genomes Project thus far have immediate, significant impact on genetic association studies. Using publicly available gene expression data and their expanded catalogue of variants, the authors identified 20-30% more significant expression quantitative trait loci (eQTLs) than had previously been detectable. Thus, it is clear that while existing SNP arrays represent the majority of common variation, a significant amount of rare, phenotypically-relevant variation remains to be incorporated.

References
1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing. Nature, 467 (7319), 1061-73 PMID: 20981092

Great Mutation. Is It Functional?

October 22, 2010 by Dan Koboldt

As promised, NGS instruments are yielding thousands of new genome sequences. Read lengths and throughputs are increasing. Alignment and analysis algorithms are getting more mature. Databases of sequence variants are growing exponentially. Things are looking pretty good, right? Sure, there are lots of variants still waiting to be discovered. Sure, some of those already reported simply aren’t real. But I think we’re rapidly approaching a point where finding the variants won’t be much of a problem.

Instead, we are facing two significant challenges. First, identifying the subset of variants have functional significance – separating the wheat from the chaff, if you will. Second, understanding how these functional variants contribute to a phenotype. This is soon to be the frontier in genetics and genomics. It merits, I think, a discussion of some of the strategies that have been used to go beyond variant detection, to isolate disease-causing variants and assess their functional impact.

Strategy 1: Process of Elimination

This approach (to my knowledge) is best demonstrated in whole-genome, exome, or pooled sequencing of samples from individuals with rare inherited diseases. It’s essentially a filtering strategy where you start with a list of candidate variants and whittle it down using several criteria:

Pedigree information, especially variants that do not segregate with the disease in Mendelian disorders.
Control variants, usually identified in HapMap samples or other individuals not affected by the disease.
Gene structure information, which serves to eliminate synonymous or non-coding variants.
Evolutionary conservation, to prioritize variants in sequences that are conserved across species.

This strategy has worked well for a handful of rare, inherited diseases like Miller syndrome and severe hypercholesterolemia. There are, however, so many things that can go wrong. The pedigree or assumed mode of inheritance could be wrong. The causal variant might be synonymous or even noncoding (e.g. in a transcription factor binding site). The conservation trick in particular worries me. True, many of the known disease-causing mutations map to conserved amino acid residues, but certainly not all of them.

Strategy 2: Recurrence

This is a developing strategy to identify key mutations and pathway alterations in cancer genomes. Because tumors are genetically unique, and often possess thousands of acquired (somatic) mutations, pedigree analysis and control samples are less informative. Instead, we reason that passenger mutations should occur randomly, mutations key to tumor development and progression are likely to be recurrent (i.e. found in other tumors of the same type). By this reasoning, the more important a mutation, the higher its rate of recurrence. TP53 mutations are a good example of this; in ovarian cancer, more than 80% of tumors carry a TP53 mutation. This is why databases like Sanger’s Catalogue of Somatic Mutations in Cancer (COSMIC) are such powerful tools. As these catalogues grow, having an available panel of additional tumors to screen for novel mutations may become less critical.

Strategy 3: Computational Evaluation

A growing suite of tools and annotation databases enable computational assessments of putative variants to predict their effect in vivo. SIFT and Polyphen are well-known examples of these. The UCSC Genome Browser Database contains dozens of genome-wide annotation datasets (both computational and experimental); many of these are presumed-regulatory regions that form the basis for our “Tier 2” classification (non-coding conserved/regulatory variants). There are also motif-scoring algorithms that evaluate a mutation’s effect on the binding affinity of trancription or splicing factors. These types of inferences are both interesting and helpful, when assessing a mutation’s functional effect. They’re not convincing, however, without supporting experimental evidence.

Strategy 4: Molecular Validation

This may be the most difficult strategy, but potentially the most informative one. A myriad of experimental techniques can be applied to assess a mutation’s functional effect in vivo or in vitro. For coding mutations, the first thing we typically assess is mRNA expression (by RT-PCR or RNA-Seq), to determine (1) if the affected gene is expressed in the tissue of interest (e.g. the retina for studies of retinitis pigmentosa) and (2) whether the mutant allele affects it. Many known disease-causing mutations ablate expression of the mature mRNA, because they introduce splicing defects, mRNA instability, or other effects. A number of other molecular biology tools can also be applied:

Western blot, to determine protein expression
Enzyme activity assays, such as the complex I rescue technique that has been applied to characterize mutations in patients with complex I deficiency (see my last post).
Recombinant DNA techniques, such as a luciferase assay to assess mutations in gene promoters
Colony growth assays, especially for somatic mutations, to determine if mutations confer a growth advantage or invasion potential.

Specialized Sequencing Techniques

A number of recently-developed applications of massively parallel sequencing can be used to assess the functional impact of candidate mutations. RNA-Seq can detect allele-specific expression and alternative splicing. ChIP-Seq can assess protein-DNA interactions and theoretically detect allele-specific DNA binding. Methyl-Seq can be used to profile DNA methylation, either at specific loci or (for methylation pathway mutations) genome-wide. MiRNA-Seq and HITS-CLIP, techniques that measure microRNA expression or isolate miRNA-transcript interactions, also have potential for characterizing mutation effects. Many of these high-throughput techniques stand poised to supplant their traditional experimental counterparts.

Given the wide array of experimental tools, it’s disappointing when reports of new (possible) disease-causing mutations lack sufficient functional validation. I find myself unconvinced when the answer is supported by “it segregates with the disease” or worse, “we filtered everything else.” So when I read new papers that claim to have identified disease-causing variants, my answer is this: Great mutation. Is it functional?

« Previous Page