Functional Validation of Genomic Discoveries

July 12, 2013 by Dan Koboldt

Credit: Riken Research

Next-gen sequencing technologies have enabled rapid identification of many genes contributing human disease. Rapid, inexpensive exome sequencing quickly gave us access to the low-hanging fruit: rare Mendelian disorders with single, highly penetrant coding mutations. Since 2009, we’ve seen an avalanche of reports of disease-causing mutations and novel disease genes. Family studies, case-control studies, and population cohorts are picking up this kind of signal everywhere.

The trouble, as anyone who’s analyzed this kind of data understands all too well, is that there are a lot of possibilities out there. You can take just about any gene from a sequencing study or GWAS and — with the assistance of a nice resource like Gene Cards — come up with a story that might connect mutations/variants in that gene to your phenotype of choice. But the burden of proof remains.

Functional Validation Required

It should now be obvious to most that publication of novel disease genes in top-tier journals requires more than just genetic or genomic data. It requires some kind of functional validation, an assay that demonstrates how genetic differences have a measurable phenotypic effect that makes sense for the disease. Genomic and statistical approaches are hypothesis generation tools. Those hypotheses, well-supported as they may be, must be tested in vitro or in vivo to see if they hold up. Because, as I said, you can spin a story about almost any gene.

Let’s say that you’ve identified a new possible cancer susceptibility gene, a candidate tumor-suppressor. You found it by looking for rare germline variants in a cohort of patients with a specific form of cancer. You’ve already done the genomics to establish that:

Rare variants in the gene are enriched among cases (maybe 5% of patients harbored rare deleterious variants in that gene, compared to 0.1% of 1000 Genomes or NHLBI-ESP populations).
In tumors, the gene is a target for biallelic inactivation by somatic mutation, deletion, LOH, or epigenetic silencing
Expression of the gene is reduced or ablated in affected patients or tissues.

Everything looks right, it sure looks like a tumor suppressor, but where’s the proof? With over 20,000 known protein-coding genes, widespread genetic variation, and the continual accumulation of mutations in somatic tissues, there are plenty of candidates that will meet these criteria by chance alone. Editors and reviewers of top-tier journals know this, and they want more. They want functional tests demonstrating that defects in your gene improve the growth, survival, proliferation, or metastatic potential of cells. They want a null mouse for your gene that’s prone to tumors. As much as it pains me to say it, the following statement is true.

Genomics is not enough.

Options for Functional Validation

This bitter medicine undoubtedly tastes sweet to the molecular biologists and bench scientists whose efforts may have been overshadowed by genomics in recent years. Because now, after all of our fancy high-throughput instruments, robust informatics and clever statistics have provided some answers, we have to leave the computer and head back to the laboratory. And many of us, including the author, have little to no experience there.

Even so, I’ll do my best to summarize some of the options for functional validation, and ask you readers to comment with the things I’ve gotten wrong or forgotten.

Molecular Assays

Functional validation by subcellular localization

Subcellular localization (Weiqiao, PNAS 1998)

Additional evidence can be garnered at the molecular level by showing that your gene functions

mRNA expression. Genome wide (RNA-seq) or targeted (RT-PCR) mRNA expression assays provide insight about gene expression at the transcript level, including exon usage and alternative splicing.
Transcript/protein localization. It has been possible for some time to examine the tissue and/or intracellular location of a protein using specific dye-tagged antibodies, which may lend support to the idea that your gene of interest plays an important role at that location.
Protein-DNA interaction. New, high-throughput chromatin immunoprecipitation and sequencing (CHiP-Seq) make it possible to identify sequences bound by specific proteins. This can be used to evaluate the protein that does the binding (showing that a variant alters when/where/how it binds) or the target regulatory sequence (showing that variants affect binding of an important regulatory protein, such as a transcription factor).
Protein-Protein interaction. Another intriguing possibility for functional validation is showing that your suspect gene encodes a protein that interacts with a known key player in your disease pathway, such as BRCA1/2 for homologous DNA repair in breast and ovarian cancers.

Biological Assays

Morpholino Knockdown (Wikipedia)

Functional validation of a candidate disease gene can also be performed in living cells or organisms. Often this garners more compelling evidence of a gene’s importance, because it demonstrates the relationship between a genetic entity and phenotype visible at the cellular level or above. Some of the approaches here include:

Human cell lines. Gene knockdown (by siRNA or other methods) or transfection (infection of cells with a virus genetically engineered to carry a certain gene) in cell lines serves to demonstrate its importance for measurable cellular phenotypes, such as apoptosis, growth, proliferation, and contact inhibition.
Animal models. We are lucky enough to control the fates of lesser organisms, which means we can use reverse genetics techniques to alter their genomes and see what happens. The advantage here is that you get to study a gene’s effect on a complete organism, which more closely resembles what could be happening in humans. Mouse models are often the method of choice, though some other model organisms provide good experimental systems for certain phenotypes, such as morpholinos (antisense oligos) in zebrafish.
Human patients. This generally isn’t possible, but in some cases genetic information (i.e. specific tumor alterations) has been used to tailor treatment to individual patients, in which case the outcome of the treatment validates the genomic finding. Case in point: the use of whole-genome sequencing to diagnose a cryptic PML-RARA fusion. This approach obviously has many ethical and legal hurdles, and probably wouldn’t be approved for truly novel discoveries.

A Call to Reviewers

In closing, I would like to appeal to peer reviewers of those journals who now wish to see functional validation of genomic findings. Asking the authors to “provide some functional validation” of their findings may be a valid critique, but it’s not terribly helpful. It would be better to outline what kind of experiments you’d like to see to become convinced. Because the odds are, you’ll be reading this manuscript again at some point, and wouldn’t it be nice if they performed the validation that you were looking for?

In fairness, some of those who work in the field of next-gen sequencing, even to tackle genetic diseases, do not have knowledge of (or even access to) laboratory techniques that could functionally validate their findings. It would benefit the entire research community if we took a moment to outline potential avenues of functional validation so that we “dry lab” scientists can begin to explore them.

A comprehensive atlas of breast cancer genomes

September 24, 2012 by Dan Koboldt

Now online at Nature is the most comprehensive molecular portrait of human breast tumors published to date. The Cancer Genome Atlas study encompasses more than 500 primary tumors representing the four intrinsic subtypes of breast cancer: Luminal A, Luminal B, Her2-enriched, and Basal-like. Using a suite of high-throughput platforms, TCGA characterized various aspects of these tumors at the molecular level:

Somatic mutations and germline susceptibility variants (exome sequencing)
Genomic copy number (SNP and CGH arrays)
DNA methylation
microRNA expression (sequencing)
Gene expression (mRNA arrays)
Protein expression (reverse-phase protein arrays)

Many of the key insights in this study were enabled through integrative analysis across these platforms. The findings suggest a highly heterogeneous disease whose intrinsic subtype was apparent not only by gene expression, but by mutational and DNA copy number profiles as well. Only three genes (TP53, PIK3CA, and GATA3) were mutated in 10% or more of all tumors; grouping them by gene expression (intrinsic) subtype uncovered a number of subtype-specific events. New recurrently mutated genes, two novel protein expression subtypes, and a previously unrecognized connection to ovarian cancer are just a few of the tantalizing results. Let’s get started.

Primary Breast Tumors

There were 507 primary tumors in the main analysis set; these were the focus of most analyses in this study. The breakdown by intrinsic [gene expression] subtype is generally representative of the disease incidence as a whole:

Luminal A tumors are the most prevalent (44% in this study), followed by luminal B (24%). Both of these subtypes are estrogen receptor (ER) positive, with luminal A carrying the most favorable prognosis. Basal-like tumors compose 19% of the set; these overlap considerably with so-called triple negative breast cancers, which are clinically negative for estrogen receptor (ER), progesterone receptor (PR), and Her2 receptor (HER2). Another 11% are of the Her2-enriched subtype, a group whose prognosis has improved since the development of targeted therapies against Her2 (e.g. Herceptin). Finally, 2% of tumors were of the rare normal-like subtype, which couldn’t be comprehensively studied with so few samples.

Another way to look at breast cancer tumor is by histology type, which reflects the cell of origin:

Most often, breast cancer arises from the mammary ducts (ductal) or milk-producing lobules (lobular), though other less common histology types were represented as well.

Somatic Mutations in Breast Cancer

Figure 1 nicely encapsulates the characterization of breast cancer genomes as a whole and by intrinsic subtype.

Mutation heatmap (TCGA, Nature 2012)

The most frequently mutated gene overall was TP53 (37% of cases), followed closely by PIK3CA (36%). Mutually exclusive mutations in MAP3K1, MAP2K4, and GATA3 were prevalent in luminal A and luminal B subtypes, but almost absent from the others. In contrast, TP53 mutations were frequently observed in Her2 (72%) and Basal-like (80%) subtypes. Mutations in E-cadherin (CDH1) were frequent across subtypes but correlated almost perfectly with lobular tumors.

The observed mutation rate in coding regions seems to reflect the aggressiveness of breast cancer subtypes: lower in luminal A (0.84 mutations/Mbp) than luminal B (1.36 mutations/Mbp), higher in Her2-enriched (2.05 mutations/Mbp) and Basal-like (1.68). The number and extent of copy number alterations, too, followed this trend.

Analysis with the Mutation Significance In Cancer (MuSiC) package identified 35 significantly mutated genes (SMGs). Among these were nearly all genes previously implicated in breast cancer (PIK3CA, PTEN, AKT1, TP53, GATA3, CDH1, RB1, MLL3, MAP3K1 and CDKN1B) and a number of novel SMGs, including:

TBX3, which is mutated in ulnar-mammary syndrome and involved in mammary gland development
Ubiquitous transcription factors CTCF and FOXA1
RUNX1 and CBFB, both rearranged in acute myeloid leukaemia and interfering with haematopoietic differentiation
PIK3R1, in which mutations clustered in the PIK3CA interaction domain (similar to glioma and endometrial cancer).
Splicing factor SF3B1, previously reported in myelodysplastic syndromes and CLL
Protein tyrosine phosphatases PTPN22 and PTPRD

A statistically significant exclusion pattern among PIK3R1, PIK3CA, PTEN and AKT1 mutations (P = 0.025) reflects the well-known activation of PI3K signaling in breast cancer.

Basal-Ovarian Connection

One of the most intriguing findings comes from a comparison of breast cancer subtypes to other common cancers. Figure 5 shows how key pathways are altered in luminal (first % box) and basal (middle box) breast cancer and ovarian cancer (right box).

Luminal and Basal Breast cancer and Ovarian Cancer

Luminal and Basal breast cancer compared with ovarian cancer (TCGA, Nature 2012)

Among these and other results integrating multiple platforms, it’s apparent that Basal subtype tumors more closely resemble ovarian tumors than other breast cancer subtypes. This raises the exciting possibility that therapies for ovarian cancer might benefit patients with Basal breast tumors, and vice-versa.

I won’t spoil any more of this outstanding paper, but recommend that you read it yourself online at Nature where it’s open access.

References

The Cancer Genome Atlas Network (2012). Comprehensive Molecular Portraits of Human Breast Tumors Nature DOI: 10.1038/nature11412

Aging, Mutations, and Leukemia

July 26, 2012 by Dan Koboldt

A study now online at Cell employs has revealed new insights about the normal processes of aging and mutation, and their role in development of acute myeloid leukemia (AML).

Mutation and Evolution in AML: Full Disclosure

Most tumors harbor numerous somatic mutations, but only a fraction are believed to contribute to cancer development and growth. In fact, most malignancies are thought to arise after a single initiating event, which may lead to genomic instability and thereby cause additional mutations. Through a process called clonal evolution, tumor cells acquire mutations and undergo natural selection for growth advantages.

Thus, many tumors (especially solid tumors) are not a uniform mass of identical cells, but a heterogeneous mixture of different cell subpopulations. They share the initiating mutation thanks to their common ancester, the “founding clone.” But they might have developed hundreds or thousands of mutations on their own. That’s one of the challenges faced by those of us who sequence cancer genomes. Among dozens of somatic coding mutations, which ones are the initiators, and which are just along for the ride?

Why Sequence AML?

AML offers an opportunity to study these processes because genomic instability is rare, and it’s believed that there are only a few initiating mutations. In this study, our group performed whole-genome sequencing on 24 AML cases. Twelve of these were FAB type M3, where the initiating event, a PML-RARA fusion, is known. The other twelve were FAB type M1 with normal cytogenetics, where the initiating event is usually unclear.

Mutations in AML

The success of this study in characterizing mutations in AML comes from two important components:

Systematic validation by custom capture and deep sequencing of somatic SNVs, indels, and structural variation. This not only confirmed which predicted events were valid somatic mutations (an important step because these are rare), but provided deep read counts by which we could compute accurate allele frequencies and examine tumor clonality.
Extension of the analysis to additional AML cases (53 M1 and 31 M3), in we screened for additional mutations in the 384 mutated genes. This provided a better picture of the prevalence of recurrent mutations in AML.

Among the 108 genomes assessed, there were 23 genes harboring non-silent mutations in at least three independent tumors, suggesting an important role in tumor development or growth. We observed an average of 14.5 “tier 1” (coding) mutations per tumor in the 24 WGS cases; on average, ~3 of these affected recurrently mutated genes in M1 tumors, and ~2 affected recurrently mutated genes in M3 tumors. This fits nicely with the idea that M1 genomes harbor an initiating mutation, a “driver”, analogous to the known driver (PML-RARA) in M3 genomes.

Driver Mutations in M1 Leukemia

Because the driver in M3 genomes is known, any mutations we find in M3 genomes aren’t likely to be initiating events. FLT3 mutations, for example, often co-occurred with PML-RARA in M3 genomes and are already known to be a cooperating (but not initiating) alterations. Therefore, we looked for genes recurrently mutated only in M1 genomes, and there were three:

NPM1, a classic mutation known to be pathogenic in AML
DNMT3A, encoding a DNA methyltransferase and found in the last couple of years to be recurrently mutated across various liquid tumors and even solid tumors
IDH1, encoding isocitrate dehydrogenase 1, a gene with a “mutational hotspot” at amino acid 132, first observed in glioblastoma, and since found to be recurrent in numerous tumors including AML.

Background Mutations in Hematopoietic Stem Cells

Welch et al, Cell 2012.

Now comes a second part of the study, something that sets it apart from other cancer sequencing efforts. We obtained human hematopoietic stem cells (the progenitors of blood cells) from 7 healthy volunteers of different ages (cord blood from newborns all the way to 70 year olds). For each sample, single HSCs were isolated and grown into colonies of homogenous cells to obtain enough DNA for exome sequencing. The exomes of each HSC colony were then compared to those of a matched blood normal, whose cells are produced by an estimated 1,000 operating HSCs. The idea is to determine, in healthy people, whether HSCs accumulate background mutations over time.

As it happens, yours truly had the task of analyzing HSC exomes and identifying somatic mutations. Here’s the thing about somatic mutation detection: the more mutations there are, the easier it gets. That’s why it’s possible to have a very high validation rate (>90%) for somatic mutations in solid tumors like lung cancer. There are just so many good, high-confidence mutations that the majority of them validate.

The HSC exomes, however, tended to have very few coding mutations (bad news for me, but good news for the healthy volunteers). Even with such tools as VarScan 2 and SomaticSniper, it was not an easy task. But we got it done, and the findings were striking: the number of coding mutations in HSC exomes directly correlated with the age of the volunteer. Cord blood cells from a newborn had virtually none, young people in that key disposable income marketing demographic had relatively few. Only over-the-hill volunteers in their 40’s and 50’s had appreciable numbers (5-9 mutations), and the 70-year-old had the most.

Mutations Correlate with Age in HSCs and AML

So mutations in the hematopoietic stem cells of healthy volunteers seemed to accumulate over time. Intriguingly, when AML patients were segregated by age, the correlation was also apparent. In other words, the number of mutations in both AML tumors and the stem cells from which they arise are correlated with age. Not only that, they’re similar in number and distribution across the mutation spectrum:

Welch et. al, Cell 2012

Background Mutation, Tumor Initiation, and Progression

This suggests that the majority of mutations in AML are random background mutations that occurred in HSCs. Then, an initiating mutation was acquired (e.g. NPM1 or PML-RARA). At that point, the background mutations in the cell were “captured” and carried forward as the progenitor gave rise to AML.

Welch et al, Cell 2012

Cells from the founding clone sometimes acquired additional cooperating mutations (e.g. FLT3), yielding subclones that may have contributed to progression or relapse.

These findings reinforce many things that we already know: that mutations acquire gradually with age, that most of the mutations in AML (and likely other tumors) are random background events not contributing to tumorigenesis, and that subsequent mutation and evolution can give rise to subclones that ultimately determine cancer progression and response to therapy.

References
Welch JS, Ley TJ, Link DC, Miller CA, Larson DE, Koboldt DC, Wartman LD, Lamprecht TL, Liu F, Xia J, Kandoth C, Fulton RS, McLellan MD, Dooling DJ, Wallis JW, Chen K, Harris CC, Schmidt HK, Kalicki-Veizer JM, Lu C, Zhang Q, Lin L, O’Laughlin MD, McMichael JF, Delehaunty KD, Fulton LA, Magrini VJ, McGrath SD, Demeter RT, Vickery TL, Hundal J, Cook LL, Swift GW, Reed JP, Alldredge PA, Wylie TN, Walker JR, Watson MA, Heath SE, Shannon WD, Varghese N, Nagarajan R, Payton JE, Baty JD, Kulkarni S, Klco JM, Tomasson MH, Westervelt P, Walter MJ, Graubert TA, Dipersio JF, Ding L, Mardis ER, & Wilson RK (2012). The origin and evolution of mutations in acute myeloid leukemia. Cell, 150 (2), 264-78 PMID: 22817890

Breast Cancer Sequencing by Aromatase Inhibitor Response

June 29, 2012 by Dan Koboldt

This month, Matthew J. Ellis and colleagues reported the whole-genome and/or exome sequencing of 77 estrogen-receptor-positive (ER+) breast cancer patients enrolled in a clinical trial of aromatase inhibitors. Their findings provide new insights into the genetic mechanisms of AI resistance, and may help pave the way to personalized cancer treatment in breast cancer.

Breast Cancer and Aromatase Inhibitors

Many breast cancers are “estrogen sensitive”, meaning that they need the hormone to grow. These tumors usually over-express the estrogen receptor (ER). ER-positive tumors comprise two “intrinsic subtypes”: Luminal A, which generally has a favorable prognosis, and Luminal B, which generally has a worse prognosis. Although ER-positive tumors represent the majority of breast cancer cases, they exhibit a wide range of prognoses, histological growth patterns, and treatment outcomes. Because of their dependence on estrogen, these tumors are often treated by estrogen deprivation therapy.

The well-known breast cancer drug Tamoxifen, for example, blocks a tumor’s ability to use estrogen. Aromatase inhibitors (AIs), in contrast, lower the amount of estrogen in the body by preventing non-ovary tissues from producing it. Some ER-positive tumors respond to AI therapy, and some do not. We don’t yet understand why this is. A clinical trial of AI therapy response paired with next-gen sequencing offers the opportunity to profile the genetic alterations of AI-sensitive and AI-resistant tumors.

Next-Gen Sequencing in a Clinical Trial

Seventy-seven cases from two neoadjuvant aromatase inhibitor clinical trials underwent next-gen sequencing (46 whole-genome, 31 exome). Of these, 29 were AI-resistant and 48 were AI-sensitive. The authors examined interactions between tumor proliferation levels (Ki67), histological categories, intrinsic subtype, and somatic alterations in these two categories of tumors.

As an author, I’m admittedly biased, but the sequencing and analysis in this study were state-of-the-art. Illumina paired-end whole-genome sequencing to 30x haploid coverage for tumor whole-genomes, and >80% coverage at 20x of CDS sequences for the exomes. Mutations were detected using SomaticSniper, VarScan, GATK, and Pindel. SVs were detected with BreakDancer and SquareDancer. High-level analyses were performed with our PathScan and MuSiC packages, as well as PathScan, GeneGo, MetaCore, and PARADIGM.

Candidate somatic mutations and SVs were experimentally validated by custom capture and deep resequencing. This not only enabled confirmation of the predicted events, but provided deep coverage to examine the clonal architecture of tumors. That’s critical when looking at therapy response. Further, the mutated genes were screened for recurrence in 233 additional breast cancer cases, for a total of 310 tumors surveyed.

Patterns of Somatic Alterations

Interaction Networks in 77 Breast Cancers (Ellis et al, Nature)

This study also represents one of the largest surveys of breast cancer by next-gen sequencing, certainly by whole-genome sequencing. So what was the mutational landscape of luminal-type breast cancer?

An overall mutation rate of 1.18 mutations per megabase (higher than AML, but lower than melanoma, liver, and lung cancers).
The most frequently mutated gene was PIK3CA (41.3% of cases)
8 significantly mutated genes that are known breast cancer genes: PIK3CA, TP53, GATA3, CDH1, RB1, MLL3, MAP3K1, and CDKN1B
9 cancer genes not previously observed in clinical breast cancer samples: TBX3, RUNX1, MYH9, STMN2, SF3B1, CBFB, LDLRAP1, AGTR2, and STMN2

One of the relatively new findings is the recurrence of mutations in MAP3K1, a serine-threonine kinase that activates the ERK and JNK kinase pathways. Thirteen tumors had two non-silent MAP3K1 mutations (biallelic loss), and most of the mutations are highly deleterious (nonsense, frameshift, etc.) suggesting that this gene may act as a tumor suppressor. Across 310 cases, some 15.5% had mutations in MAP3K1 or MAP2K4.

Correlating Mutations with Clinical Data

Somatic landscapes of AI responses (Ellis et al, Nature)

A number of fancy clinical-correlation and pathway analyses revealed some interesting patterns of somatic mutations between AI-sensitive and AI-resistant tumors:

TP53 mutations were significantly enriched in luminal B and high-grade tumors, and correlated with higher tumor proliferation (both at baseline and after AI therapy).
MAP3K1 mutations, in contrast, were significantly enriched in luminal A tumors and correlated with lower proliferation at baseline.
Several pathways were enriched in AI-resistant tumors, including TP53 signalling, DNA replication, and mismatch repair.
ESR1 and FOXA1 were among activated hubs in the entire cohort, while MYC, FOXM1, and MYB were activated in AI-resistant tumors.
GATA3 mutations were not associated with Ki67 levels, but did correlate with reduced Ki67 over therapy, suggesting it may be a positive predictive marker for response to aromatase inhibition.

In summary, this study sheds light on the somatic alteration landscape of ER-positive breast cancers and offers insight into some of the mechanisms of aromatase inhibitor resistance.

References
Ellis MJ, Ding L, Shen D, Luo J, Suman VJ, Wallis JW, Van Tine BA, Hoog J, Goiffon RJ, Goldstein TC, Ng S, Lin L, Crowder R, Snider J, Ballman K, Weber J, Chen K, Koboldt DC, Kandoth C, Schierding WS, McMichael JF, Miller CA, Lu C, Harris CC, McLellan MD, Wendl MC, DeSchryver K, Allred DC, Esserman L, Unzeitig G, Margenthaler J, Babiera GV, Marcom PK, Guenther JM, Leitch M, Hunt K, Olson J, Tao Y, Maher CA, Fulton LL, Fulton RS, Harrison M, Oberkfell B, Du F, Demeter R, Vickery TL, Elhammali A, Piwnica-Worms H, McDonald S, Watson M, Dooling DJ, Ota D, Chang LW, Bose R, Ley TJ, Piwnica-Worms D, Stuart JM, Wilson RK, & Mardis ER (2012). Whole-genome analysis informs breast cancer response to aromatase inhibition. Nature, 486 (7403), 353-60 PMID: 22722193

« Previous Page