Great Mutation. Is It Functional?

October 22, 2010 by Dan Koboldt

As promised, NGS instruments are yielding thousands of new genome sequences. Read lengths and throughputs are increasing. Alignment and analysis algorithms are getting more mature. Databases of sequence variants are growing exponentially. Things are looking pretty good, right? Sure, there are lots of variants still waiting to be discovered. Sure, some of those already reported simply aren’t real. But I think we’re rapidly approaching a point where finding the variants won’t be much of a problem.

Instead, we are facing two significant challenges. First, identifying the subset of variants have functional significance – separating the wheat from the chaff, if you will. Second, understanding how these functional variants contribute to a phenotype. This is soon to be the frontier in genetics and genomics. It merits, I think, a discussion of some of the strategies that have been used to go beyond variant detection, to isolate disease-causing variants and assess their functional impact.

Strategy 1: Process of Elimination

This approach (to my knowledge) is best demonstrated in whole-genome, exome, or pooled sequencing of samples from individuals with rare inherited diseases. It’s essentially a filtering strategy where you start with a list of candidate variants and whittle it down using several criteria:

Pedigree information, especially variants that do not segregate with the disease in Mendelian disorders.
Control variants, usually identified in HapMap samples or other individuals not affected by the disease.
Gene structure information, which serves to eliminate synonymous or non-coding variants.
Evolutionary conservation, to prioritize variants in sequences that are conserved across species.

This strategy has worked well for a handful of rare, inherited diseases like Miller syndrome and severe hypercholesterolemia. There are, however, so many things that can go wrong. The pedigree or assumed mode of inheritance could be wrong. The causal variant might be synonymous or even noncoding (e.g. in a transcription factor binding site). The conservation trick in particular worries me. True, many of the known disease-causing mutations map to conserved amino acid residues, but certainly not all of them.

Strategy 2: Recurrence

This is a developing strategy to identify key mutations and pathway alterations in cancer genomes. Because tumors are genetically unique, and often possess thousands of acquired (somatic) mutations, pedigree analysis and control samples are less informative. Instead, we reason that passenger mutations should occur randomly, mutations key to tumor development and progression are likely to be recurrent (i.e. found in other tumors of the same type). By this reasoning, the more important a mutation, the higher its rate of recurrence. TP53 mutations are a good example of this; in ovarian cancer, more than 80% of tumors carry a TP53 mutation. This is why databases like Sanger’s Catalogue of Somatic Mutations in Cancer (COSMIC) are such powerful tools. As these catalogues grow, having an available panel of additional tumors to screen for novel mutations may become less critical.

Strategy 3: Computational Evaluation

A growing suite of tools and annotation databases enable computational assessments of putative variants to predict their effect in vivo. SIFT and Polyphen are well-known examples of these. The UCSC Genome Browser Database contains dozens of genome-wide annotation datasets (both computational and experimental); many of these are presumed-regulatory regions that form the basis for our “Tier 2” classification (non-coding conserved/regulatory variants). There are also motif-scoring algorithms that evaluate a mutation’s effect on the binding affinity of trancription or splicing factors. These types of inferences are both interesting and helpful, when assessing a mutation’s functional effect. They’re not convincing, however, without supporting experimental evidence.

Strategy 4: Molecular Validation

This may be the most difficult strategy, but potentially the most informative one. A myriad of experimental techniques can be applied to assess a mutation’s functional effect in vivo or in vitro. For coding mutations, the first thing we typically assess is mRNA expression (by RT-PCR or RNA-Seq), to determine (1) if the affected gene is expressed in the tissue of interest (e.g. the retina for studies of retinitis pigmentosa) and (2) whether the mutant allele affects it. Many known disease-causing mutations ablate expression of the mature mRNA, because they introduce splicing defects, mRNA instability, or other effects. A number of other molecular biology tools can also be applied:

Western blot, to determine protein expression
Enzyme activity assays, such as the complex I rescue technique that has been applied to characterize mutations in patients with complex I deficiency (see my last post).
Recombinant DNA techniques, such as a luciferase assay to assess mutations in gene promoters
Colony growth assays, especially for somatic mutations, to determine if mutations confer a growth advantage or invasion potential.

Specialized Sequencing Techniques

A number of recently-developed applications of massively parallel sequencing can be used to assess the functional impact of candidate mutations. RNA-Seq can detect allele-specific expression and alternative splicing. ChIP-Seq can assess protein-DNA interactions and theoretically detect allele-specific DNA binding. Methyl-Seq can be used to profile DNA methylation, either at specific loci or (for methylation pathway mutations) genome-wide. MiRNA-Seq and HITS-CLIP, techniques that measure microRNA expression or isolate miRNA-transcript interactions, also have potential for characterizing mutation effects. Many of these high-throughput techniques stand poised to supplant their traditional experimental counterparts.

Given the wide array of experimental tools, it’s disappointing when reports of new (possible) disease-causing mutations lack sufficient functional validation. I find myself unconvinced when the answer is supported by “it segregates with the disease” or worse, “we filtered everything else.” So when I read new papers that claim to have identified disease-causing variants, my answer is this: Great mutation. Is it functional?

Mutation Detection in Rare Disease by Pooled Sequencing

October 13, 2010 by Dan Koboldt

When it comes to massively parallel sequencing, few areas of human health stand to benefit as much as rare genetic diseases. Indeed, both whole-genome and exome sequencing strategies have identified disease-causing mutations in probands with Charcot-Marie Tooth disease, Miller syndrome, severe brain malformations, and a few other disorders. The Mito10K project took a different approach. They assembled a cohort of mostly unrelated individuals with complex I deficiency (n=103), the most common cause of human respiratory chain diseases.

Mitochondrial Electron Transport Chain (Wikipedia)

Forty-two HapMap samples were included as controls. Instead of employing a whole-genome or exome strategy, they performed deep resequencing of carefully-chosen candidate genes in pools of ~20 samples. And they did it all using a single Illumina flowcell.

Pooled Sequencing of Candidate Genes

The candidates included 103 genes that (i) encoded known complex I proteins, (ii) were implicated in the disease, or (iii) were identified by phylogenetic profiling. The 145 kb target space comprised 653 exons from nuclear genes (138 kb) and two mtDNA regions (7 kb). About 90% of target regions achieved at least 100x coverage; the median redundancy was 3,359x per pool, which works out to ~168x per individual. Next, the authors developed a method (“Syzygy”) to model sequencing error and call variants at very low frequencies. A comparison of calls for the HapMap samples to existing genotype data suggested 92% sensitivity and 99.6% specificity, at sites where coverage was 100x or greater.

Although the pooling strategy worked well for nuclear DNA, there were some problems with the targeted regions in mtDNA. Basically, the distribution of mtDNA was not uniform between samples. That may be due to the fact that while each cell contains exactly 2 copies of each nuclear chromosome, it contains numerous mitochondria and thus numerous copies of the MT chromosome (possibly 20-25 per cell, by one estimate). The resulting shift in sample representation can be quite dramatic. In one pool, for example, 96% of the mtDNA came from a single individual (5% of the pool). The bottom line is that sensitivity to call mutations in pooled samples is going to be lower for mtDNA.

Variant Calling and “Deleteriousness” Prioritization

The unfortunately-named Syzygy method identified 652 variants (high confidence); to boost sensitivity, the authors also employed an ad-hoc approach that called 246 more variants supported by at least 3 reads on each strand (low confidence). The 898 calls were filtered to prioritize variants that seemed likely to underlie a rare and devastating phenotype. In short, the authors removed:

Variants present in healthy individuals (HapMap controls) or public databases (dbSNP, mtDB, 1000 Genomes).
Synonymous or noncoding variants, unless they affected tRNA or splice sites.
Missense variants at positions of low evolutionary conservation

Of 898 detected variants, 216 remained and were validated by multiplexed Sequenom genotyping. Some 82 sites were also Sanger-sequenced to assess the accuracy of the genotyping platform. The comparison revealed 11% false positives and 2% het/hom miscalls, for an overall error rate of 13% for Sequenom assays. Ouch. As for the variant calls, the validation rate was pretty good for high-confidence calls (91/109, or 84%) but rather abysmal for the low-confidence ones (12/107, or 11%). Intriguingly, validation assays identified 12 additional pathogenic variants that were missed by the discovery screen. Based on these data, the sensitivity of the Syzygy method alone was 79.1% (91/115). That’s not bad, but probably not enough for a study whose goal is to identify rare disease-causing variants.

New Diagnoses from Validated Mutations

Some 60 of the sequenced cases lacked a previous molecular-genetic diagnosis. Among these, the authors were able to provide 11 new diagnoses based on mutations in known disease-causing genes. Several lines of supporting evidence were given to support the diagnoses:

6 patients had mutations that were previously known to be disease-causing.
3 patients were homozygous for deleterious mutations that caused splicing defects (observed in cDNA) and no detectable protein (by SDS-page and protein blot).
2 patients had mutations in highly conserved protein domains.

Intriguingly, half of the cases with known mutations (3/6) were compound heterozygotes; that is, they inherited a different defect in the same gene from mother and father. This apparent prevalence of compound hets in monogenic disease is unsettling because they tend to make pedigree analysis complicated and require detection of both variants in heterozygous form, which is more difficult to do by sequencing.

Detection and Characterization of Novel Disease Genes

The key finding of this paper (as suggested by the title) was the implication of two new genes in complex I deficiency: NUBPL and FOXRED1. Pathogenicity of each mutated genes was confirmed by a “rescue” assay in which introduction of wild-type cDNA into patient fibroblasts restored complex I activity. In the absence of rescue, residual complex I activity was markedly reduced (19-40%) in the NUBPL-mutated fibroblasts and strikingly reduced (9-15%) in the FOXRED1-mutated fibroblasts.

The case with NUBPL mutations was particularly interesting. RT-PCR showed that the dominant mRNA species was truncated, and the full-length transcript hardly expressed at all. Sequencing revealed that the shortened fragment had a branch site mutatation that likely caused exon 10 skipping, as well as a missense mutation (Gly56Arg), both on the paternal chromosome. The maternal allele wasn’t expressed. Array-based copy number analysis, however, showed that the maternal chromosome had a complex rearrangement of NUBPL in which exons 1-4 were deleted and exon 7 was duplicated. Obviously this structural variation was not detected in the discovery screen. I think this highlights two things: the importance of structural variation in human disease, and the limitations of targeted sequencing on NGS platforms.

Success and Limitations

As the authors note in their discussion, key to the success of this study was the availability of cellular models of disease, with which the pathogenicity of newly discovered mutations in individual patients could be established. With the two new findings, the 11 newly diagnosed cases, and the 40 or so already-diagnosed cases, the authors now have identified the genetic defect for about half of the cases in their cohort. What about the rest? The authors admit that the causal mutations were likely missed because:

They occur in genes not targeted in this study
They affect targeted genes, but reside in noncoding regulatory regions or novel/unknown exons
They were targeted, but not detected due to limited sensitivity (especially in mtDNA)
They were detected, but filtered out as not likely to be deleterious
They are large-scale deletions or rearrangements, which this approach can’t detect

Despite these limitations, the authors have demonstrated that sequencing carefully-chosen candidate genes in pooled samples, with follow-up validation and experimental support, can successfully identify disease-causing mutations in a good-sized patient cohort. Not bad for a single flowcell.

References

Calvo, S., Tucker, E., Compton, A., Kirby, D., Crawford, G., Burtt, N., Rivas, M., Guiducci, C., Bruno, D., Goldberger, O., Redman, M., Wiltshire, E., Wilson, C., Altshuler, D., Gabriel, S., Daly, M., Thorburn, D., & Mootha, V. (2010). High-throughput, pooled sequencing identifies mutations in NUBPL and FOXRED1 in human complex I deficiency Nature Genetics, 42 (10), 851-858 DOI: 10.1038/ng.659

Ng SB, Buckingham KJ, Lee C, et al (2010). Exome sequencing identifies the cause of a mendelian disorder. Nature genetics, 42 (1), 30-5 PMID: 19915526

Bilgüvar K, Oztürk AK, Louvi A, et al (2010). Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Nature, 467 (7312), 207-10 PMID: 20729831

Lupski JR, Reid JG, Gonzaga-Jauregui C, et al (2010). Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. The New England journal of medicine, 362 (13), 1181-91 PMID: 20220177

Lalonde E, Albrecht S, Ha KC, et al (2010). Unexpected allelic heterogeneity and spectrum of mutations in Fowler syndrome revealed by next-generation exome sequencing. Human mutation, 31 (8), 918-23 PMID: 20518025

CSHL 2010: Genomes Get Personal

September 22, 2010 by Dan Koboldt

Last week I attended the third annual “Personal Genomes” meeting at Cold Spring Harbor. The meeting opened with a keynote talk by NHGRI director Eric Green, who reminded us that finding the pathway to genomic medicine is the central mission of NHGRI. He mentioned several of the past successful initiatives that have yielded key findings concerning human genetic variation and its relationship to phenotype: The HapMap Project (common variation), the ENCODE Project (functional variation), and the 1,000 Genomes Project (rare variation), to name a few. He showed the absolutely stunning growth of the NHGRI-hosted genome-wide association study (GWAS) catalog, which currently holds ~2,600 associations from 780 publications.

Dr. Green also discussed the dichotomy of genetic architecture underlying human diseases, and took the position that while we’ve made substantial progress studying rare, monogenic, mendelian disorders (predominantly caused by coding mutations), we face a more daunting task with common, complex, multigenic diseases because he believes that these arise from primarily noncoding mutations.

Theme 1: Human Mutation Rates

Several talks addressed the topic of mutation rate in human genomes. Donald Conrad, who will be joining the WashU Genetics Department next year, presented mutation rate as a quantitative trait based on 1,000 Genomes Project trio data. Three of the primary sources of variation in mutation rate are age (males have 3x-6x higher rates), environment, and genetic variation (e.g. inherited aging disorders).

Lee Hood gave an excellent keynote on “Systems Genetics and P4 Medicine”, part of which was a discussion of mutation rate. His group uses whole-genome sequencing (WGS) of family cohorts (in this case, the Miller syndrome family quartet), focusing on the ~2.3 GBP of non-repetitive reference sequence. Using the family information and inheritance modeling, they identify de novo mutations in the offspring, which manifest as errors of Mendelian inheritance. Validation using a custom capture array for 60,000 candidate sites followed by deep sequencing showed that only 1/1,000 “new” mutations in the offspring were real; the vast majority proved to be sequencing errors. That works out to a mutation rate of 1.1 x 10-8, or roughly 70 mutations per child.

Lynn Jorde (Univ. of Utah) later gave a talk on directly estimating human mutation rate by WGS, also using the Miller syndrome quartet. Sequencing by Complete Genomics yielded >50x fold coverage per subject; there were ~4 million positions in the 1.8 Gbp of “useful” reference sequence in which at least one subject differed from the reference. Only 330,000 or so SNPs were novel (not known to dbSNP), and 20% of these proved to be sequencing errors. More array validation, more calculations, and the same answer as given by Dr. Hood: a mutation rate of 1.1 x 10-8.

Theme 2: Personal Cancer Genomes

Cancer genomes were another focus of the meeting. Sean Grimmond (Univ. of Brisbane, Queensland, Australia) presented some of his group’s work on pancreatic cancer as part of the International Cancer Genome Consortium (ICGC). Pancreatic is one of the most deadly forms of cancer; about 90% of patients diagnosed die within one year. Brisbane has assembled a very nice workflow from sample collection to sequencing, that includes pathology review, tumor dissection, QA, and microarray analysis to determine tumor cellularity. The sequencing strategy (WGS, exome, and RNA-seq) differs between high-cellularity (70-100%) and low-cellularity (~30%) tumors. The ultimate deliverable is a “tumor report” documenting cellularity estimates, microarray findings, cytogenetics, what sequencing was done, and what mutations were found.

James Brugarolas (UT Southwestern Medical Center) described the genome evaluation and functional studies of a patient with clear cell renal carcinoma. I learned a bit more about this form of cancer – 85% of tumors prove to be the “clear cell” carcinoma; common lesions include 3p loss (VHL gene) and 5q35 gain. This particular tumor underwent Illumina whole-genome sequencing to 35x coverage; some 46 somatic mutations were validated. One of these was in a gene whose protein product complexes with mTOR, the central player in a known cancer pathway. The tumor was successfully xenografted to a mouse model; some 43/46 somatic mutations were retained, and all had higher frequencies (similar to our findings on basal-like breast cancer). The xenograft let them test a few different cancer drugs – erlotinib (an EGFR inhibitor that had no effect), sunitinib (the front-line therapy for these patients, also no effect), and others. Intriguingly, however, the tumor was sensitive to an mTOR inhibitor compound.

Rick Wilson (The Genome Center at Washington University) gave a talk on whole-genome sequencing of leukemia patients at WashU. Of the 50+ leukemia patients sequenced to date, most have less than 20 valid protein-altering mutations. For most patients, low-resolution cytogenetic screens are the paradigm for disease classification and treatment decisions. Favorable-risk patients (17% of cases) undergo light chemotherapy. For adverse-risk patients (22% of cases), an all0-matched bone marrow transplant is the standard of care. That leaves a large body of patients (~61%) with “intermediate” risk according to cytogenetics; here, the correct treatment decision is harder to make. Better stratification of intermediate-risk patients is the first goal. Dr. Wilson related a fascinating case study, a 39-year-old female with suspected acute promyelotic leukemia, in which rapid-turnaround WGS was able to provide an accurate diagnosis that was not obtained by conventional FISH, and ultimately guided her treatment.

Theme 3: Genome Regulation and Epigenetics

Peter Laird (Univ. Southern California, LA) led us out of the genome to the epigenome with his talk on mining the cancer methylome. He argued that the first steps in oncogenesis may be epigenetic changes, specifically, the dysrgeulation of genes due to abnormal methylation. Dr. Laird presented what he’s calling the first cancer methylome – a tumor sample and matched normal control that underwent bisulfite treatment and sequencing to ~30x coverage. As expected, bisulfite sequencing yielded very accurate estimates of DNA methylation (r=0.97 with Illumina Infinium) but was able to do so across the complete human genome with base-pair resolution.

Theme 4: Exome Sequencing

There is a ton of exome sequencing going on. I saw at least two posters describing “whole” exome sequencing in 1,000 cases and 1,000 controls. I put “whole” in quotes because it’s not true at this point; people really shouldn’t be going around saying that the “whole exome” was sequenced. It’s more like 80-90% of known genes. Rick Lifton spoke about some of the valuable applications of exome sequencing – finding dominant reproductive lethal mutations, unraveling recessive traits with high locus heterogeneity, characterizing somatic mutations in cancer, and identifying rare variants associated with common disease. He described recently published work in which recessive mutations in WDR62 were linked to severe brain malformations by exome sequencing. Matt Bainbridge gave a nice overview of the exome sequencing currently under way at Baylor. So yes, it turns out that groups outside of WashU are doing exome sequencing too.

The Four Dimensions of a Breast Cancer Genome

April 15, 2010 by Dan Koboldt

Published today in the journal Nature is the whole-genome sequencing of a basal-like breast cancer tumor, metastasis, and xenograft. There’s also a News and Views article by Joe Gray of Lawrence Berkeley National Laboratory, as well as a news feature on large-scale cancer projects.

brc1-nature08989screenshot

This study is a bit unlike our previous cancer genomes (AML1 and AML2). By my count it is the sixth cancer genome to be sequenced, and the third to come out of the Genome Center at Washington University. Obviously, it’s our first solid tumor. What’s particularly interesting about this study, however, is that we sequenced four DNA samples from a single patient with “double-negative” breast cancer: the primary tumor, peripheral blood (normal), a brain metastasis, and a mouse xenograft derived from the primary tumor. The xenograft is a success story in itself – we managed to create a human-in-mouse (HIM) transplant of the primary tumor that was >90% pure when harvested 101 days after engraftment.

The genomes of these four samples (tumor, normal, metastasis, and xenograft), examined with the incredible power of Illumina massively parallel sequencing, offer an unprecedented view of the somatic changes that underlie breast cancer development, growth, and metastasis.

Repertoire of Somatic Mutations

We validated a total of 50 somatic sites in at least one of the three cancer genomes, including:

28 missense mutations predicted to alter the sequence of an encoded protein
11 synonymous (silent) mutations in coding sequences
4 small insertions ranging in size from 1 to 6 bp
3 small deletions ranging in size from 1 to 13 bp
2 splice site mutations at intron-exon junctions
1 nonsense mutation predicted to result in a truncated protein
1 RNA mutation in a gene encoding a signal recognition particle (SRP) RNA.

We employed deep Illumina sequencing of PCR amplicons to assess the frequencies of each mutation across all four tissues. Intriguingly, more than half of them exhibited differential frequencies between primary tumor, metastasis, and/or xenograft. Two mutations (a nonsense mutation in MYCBP2 and a missense mutation in TGFBI) were significantly enriched in the primary tumor (88-89% vs 14-44%). Some 26 mutations were significantly enriched in the metastasis and/or xenograft. Perhaps most interesting, however, were two sites (a missense mutation in SNED1 and a silent mutation in FLNC) that appear to be de novo mutations unique to the metastasis.

Acquired Structural Variation

Using our internally developed tools for structural variant prediction (BreakDancer) and de novo assembly (TIGRA), we predicted 59 deletions and 18 inversions that were putative somatic events. Validation by PCR and 454/3730 sequencing showed that 73/77 (94.8%) were real structural variants, of which 34 (28 deletions and 6 inversions) were somatic alterations not present in the normal genome. Among them was a 46.5 kbp heterozygous deletion affecting FBXW7 (a known cancer gene) and two overlapping 500-kb deletions affecting CTNNA1 and a handful of other genes. The latter was particularly interesting, because loss of CTNNA1 has been shown to result in global loss of cell adhesion in human breast cancer cell lines.

We also validated seven translocations with a combination of manual review (Pairoscope), assembly, and PCR/3730 sequencing. One translocation that we assembled in all three tumor samples involves a long terminal repeat (LTR) from the ERVL-MaLR family on chromosome 4 and the ABCA2 gene on chromosome 9. Two other validated translocations that assembled in all three tumors are on chromosome 2, and separated only by a 393-bp TcMar-Tigger repeat.

Insights from Comparisons of Tumor, Metastasis, and Xenograft

One of the most intriguing findings from our study was the differential mutation frequencies and structural variation patterns that we observed in the metastasis and xenograft, compared to the primary tumor. More than half of the somatic mutations (26/50) were significantly enriched in the metastasis and xenograft, while observed at relatively low frequencies in the primary tumor. This suggests that a sub-population of tumor cells, not the primary clone, gave rise to the cerebellar metastasis that eventually killed the patient.

Is there a fitness cost to the mutations that enabled metastasis? Can we develop sensitive tests to detect the cells that are likely to spread? Genome sequencing has brought us to a point where we can begin to ask these questions, and answering them brings us one step closer to unraveling the complex, devastating, deadly disease that is cancer.

References
Li Ding, Matthew J. Ellis, Shunqiang Li, David E. Larson, Ken Chen, John W. Wallis, Christopher C. Harris, Michael D. McLellan, Robert S. Fulton, Lucinda L. Fulton, Rachel M. Abbott, Jeremy Hoog, David J. Dooling, Daniel C. Koboldt, Heather Schmidt, Joell (2010). Genome remodelling in a basal-like breast cancer metastasis and xenograft Nature, 464 (15), 999-1005 : 10.1038/nature08989