AGBT: Cancer Genomics at St. Judes, Harvard, WashU

February 26, 2010 by Dan Koboldt

Today’s plenary session included some great talks on cancer genomics. Keynote speaker Jim Downing of St. Jude Children’s Research Hospital gave a talk on acute leukemia, in which he openly admitted that he would show no next-gen sequencing data. Instead, he gave a very nice overview of the four biological processes that are dysregulated in acute leukemia:

Self renewal. With a few exceptions, pre-leukemic cells have only limited self-renewal capacity. AML1-ETO is often altered to overcome this limitation.
Response to growth factor signals. The BCR-ABL gene fusion is a classic example of an alteration that lets cells grow in the absence of growth factors.
Differentiation. Leukemic cells block this process via alterations in PML-RARA, PAX5, EBF, BTLA, and others).
Apoptosis. This normal pathway of cell death is circumvented in leukemia via alterations in CDKN2A/B, BT6, and the RB pathway.

Non-NGS Molecular Profiling

Dr. Downing’s group uses several molecular techniques to characterize pediatric leukemias, including Affy SNP-chip (for copy number alterations), cytogenetics/FISH, and targeted sequencing in a handful of genes. In a study of 242 pediatric acute lymphoblastic leukemia (ALL) tumors with matched controls, a surprisingly small number of copy number alterations were observed.

There were a few significantly altered genes, however. PAX5 was deleted or amplified in 30% of B-cell ALLs; some apparent 3′ deletions proved to be fusion events with ETV6, FOXP1, or other genes. Another gene, IK2F1, was deleted in 83.7% of ALLs that were BCR-ABL positive. These and other findings convinced the audience, I think, that there is much to be learned, even about the best-characterized human cancer, and even without next-generation sequencing technologies.

Cancer Genomes and Translational Oncology

Levi Garraway of Harvard Medical School spoke about how next-generation sequencing can be applied to translational oncology. He offered a clinical perspective to cancer genomics, which has somewhat different requirements from basic research:

Targeted. The mutations and genes to be assessed in clinical samples must already be known and well-characterized.
Resource-efficient. To minimize costs, clinicians are interested in tests that make efficient use of sample and equipment resources.
Actionable. Only mutations and biomarkers that give actionable information, i.e., “the patient has X mutation, so we should administer drug Y” are valuable in a clinical setting.

A resource compiled by Dr. Garraway and others, called OncoMap, offers a database of known oncogenic mutations that can be tested (on frozen or FFPE samples) for just $200 per patient. Granted, it includes only 46 mutations from 34 cancer genes, but each provides a validated, actionable course in regard to treatment.

The speaker admitted that ideally, a systematic mutational profiling method would have high sensitivity and specificity, testing both oncogenes and tumor suppressors. It would also detect multiple alteration types (SNVs, CNAs, etc) and be able to use either DNA or RNA, or both. And it would have an “acceptable” turnaround time, say 2 weeks. This is what clinicians want, and it may be that hybrid capture approaches may offer the best solution. More on that in another post.

Elaine Mardis: Single Molecule Sequencing in Cancer

My favorite talk of the day (obviously) was by genome center co-director Elaine Mardis, who presented WashU’s pipeline for detecting and validating somatic mutations from whole-genome sequencing. Our pipeline has evolved over the course of AML1, AML2, and other cancer whole-genome sequencing projects, and now has the highly automated capacity to handle the coming 600 tumor-normal pairs to be sequenced for the Pediatric Cancer Genome Project (PCGP).

Dr. Mardis also discussed our methods for systematically assessing the prevalence of somatic mutations (within a tumor population) as well as their recurrence in tumors of the same or other types. Prevalence is important because the greater fraction of tumor cells that share a mutation, the more likely it occurred early during progression. By similar reasoning, assessing the recurrence of mutations in a tumor type provides a measure of their importance for disease development.

The Importance of Recurrence Testing

IDH1 demonstrates this principle well. Initially identified as a key cancer gene in glioma by Bert Vogelstein’s group at Johns Hopkins, the isocitrate dehydrogenase 1 (IDH1) gene was also mutated in AML2, and, in a screen of hundreds of AML samples, proved to be recurrent. At least two large-scale studies of AML have since replicated the common incidence of IDH1 mutations in AML and other cancers.

Third Generation Sequencing in Cancer

Finally, the speaker presented some recent experiments that we’ve performed using the Pacific Biosystems Single Molecule Real Time sequencer on in-house cancer samples. In work that’s part of a manuscript in submission, the accuracy and sensitivity of the SMRT sequencer were assessed on GBM and AML tumor samples that had already been characterized by whole genome sequencing. In general, the results were promising – 25 of 25 known somatic mutations were identified in SMRT sequencing of PCR products, although 6 were detected at lower-than-expected prevalence.

Somatic mutations from AML2 were also used to create mixed PCR libraries of various tumor cellularities from 50% to 100%. It was apparent that “tier 1” somatic coding mutations were more reliably detected on Pac Bio than tier 2 and tier 3 mutations, and that there’s a slight bias against detecting C to T mutations. That said, the ability of SMRT sequencing to detect somatic mutations even at low tumor cellularities is promising.

Capture and Subassembly with Jay Shendure

January 15, 2010 by Dan Koboldt

Yesterday our 2010 Genetics Seminar Series kicked off with Jay Shendure (Univ. Washington) whose twelve-exome paper landed in Nature late last year. His talk covered three very different applications of next-generation sequencing: high-throughput mutational studies of core promoters, sub-assembly of Illumina reads to 454-length contigs, and exome capture to unravel Mendelian disorders.

Mutational Profiling

First, Dr. Shendure described some interesting experiments under way in his lab to elucidate the function of non-coding regulatory variants – specifically, single nucleotide changes in the core promoter that alter gene transcription. The approach is called “saturation mutagenesis” and involves generating every possible mutant in a construct, and then assaying the effect of each construct on transcription. By leveraging high-density Agilent arrays and next-generation sequencing, Shendure and his colleagues performed saturation mutagenesis in vitro in high-throughput fashion. Their process involves three steps:

Synthesize mutant constructs on an Agilent array. The oligos (probably ~150 bp) include the core promoter region surrounding a gene’s transcription start site (TSS). They generate a single mutation (SNP or single-base indel) per construct, and label each construct with a sequence barcode downstream of the TSS.
Cleave mutant templates from the array, amplify, and sequence on Illumina to measure relative construct abundance.
Perform in vitro transcription, then Illumina RNA-Seq, to measure the expression of each construct.

Dr. Shendure noted that there was some sequencing bias between barcodes, so they used multiple barcodes (6) per mutant construct and normalized the results. Then, by combining the construct abundance data (Seq) and the expression data (RNA-Seq) for mutants and comparing them to the results for the wild-type construct, they could assess the functional impact of each synthesized mutation on transcription.

As far as results go, Dr. Shendure showed a histogram: on the X-axis was each base of the core promoter region that they evaluated, and on the Y-axis, the effect of mutating that position on transcription. Most of the values were negative, indicating that mutations reduced transcriptional activity, particularly around the TATA box and INR site. Essentially, the plot neatly described the footprint of RNA polymerase binding, with the most effective mutations centered on the TSS. Intriguingly, the single-base deletion mutants consistently showed the greatest reduction of transcription, suggesting, perhaps, that indels in promoter regions are likely to be functional variants.

Short Read Subassembly

The next area of interest was very pertinent to groups with access to next-generation sequencing, but not the 454 “length matters” platform. While Illumina read lengths are still growing (most groups currently run 75- or 100-bp protocols), they still cannot rival the ~450 bp reads consistently produced on 454 Titanium. And yet, many applications of NGS benefit from longer reads – de novo assembly, metagenomics, and the core promoter assays I’ve just described, to name a few. Thus, Shendure and his group sought to combine some Tech D cleverness with Illumina’s incredible read depth to generate localized assemblies of kilobase-length fragments.

First, they sheared DNA into fragments that were a few kilobases long, ligated adapters to the ends of each fragment, and did a round of amplification. Now they had many copies of each fragment with adapters on each end. The fragments are concatemerized, then somehow randomly sheared to variable-length pieces of the original fragment such that each piece has one of the original adapters on one end. A new adapter is ligated to the sheared end. Then there’s another round of PCR, followed by Illumina paired-end sequencing. The resulting paired-end reads (75-mers) have a “read2” that’s the same for all pieces of the same kilobase-fragment, but a read1 that comes from some random location within the fragment.

Then, it’s possible to perform a localized assembly for each kilobase fragment. It’s an interesting approach, but here’s the problem: after assembly, in their proof-of-principle experiment, they achieved a median contig size of 350 bp. Granted, the per-base quality was very high (85% of bases had Q>40), but the lengths are unimpressive. As Dr. Shendure joked, they managed to get similar read lengths to a 454 run and make it cost just as much. There’s still a lot of work to do. Or they could just pick up one of those cute little GS-Juniors.

Human Exomes and Mendelian Disease

Finally, Dr. Shendure gave an overview of last year’s elegant Nature paper, in which exome sequencing of four individuals, followed up by careful downstream informatics, correctly identified the causative gene. Their defined “exome” was 30 Mb, which they targeted using two solid-phase array capture chips. Illumina sequencing of the exome capture generated about 6.4 gigabases per individual. Exome sequencing makes a lot of sense in certain Mendelian disorders, where (1) the pattern of inheritance, e.g. autosomal recessive, is known, and (2) the causative mutations occur in a single gene.

By sequencing the exomes of multiple individuals, isolating what we’d call “tier 1” variants – Nonsynonymous, nonsense, splice site, or frameshift-indel – and then removing all known common variants from public databases, Dr. Shendure and colleagues can reduce 20,000 gene candidates down to a handful. It worked out beautifully in the Nature paper – all four individuals had rare, tier 1 mutations in the same gene.

But in another cohort (4 individuals from 3 kindreds with Miller syndrome, a rare developmental disorder) Dr. Shendure and colleagues discovered the danger of overfiltering. They removed all variants from dbSNP 129, but when they limited the scope to only mutations predicted to be “damaging” or “deleterious”, the number of genes dropped to zero. Apparently the deleteriousness of at least one of the causal mutations wasn’t predicated correctly.

Obviously, the need is for better filters of common variants. But with projects like the 1,000 Genomes in full swing, I wonder, will filtering out using dbSNP get better, or worse? Already, as Shendure pointed out, certain genes have basically a SNP reported at every position. I know that TP53 does. What’s more, with the advent of next-generation sequencing, I hate to tell you, but people are going to be reporting a lot of false positives. I guarantee it. So when you filter all of the variants, you might actually remove the ones you’re looking for.

References
Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, & Shendure J (2009). Targeted capture and massively parallel sequencing of 12 human exomes. Nature, 461 (7261), 272-6 PMID: 19684571

Finding Recurrent CNVs in Cancer

January 6, 2010 by Dan Koboldt

Copy number aberrations (CNAs) represent one of the most prevalent genetic alterations in cancer cells. There is considerable interest in finding CNAs that affect the same chromosomal region in multiple tumor samples. Recurrent CNA (RCNA) implies the presence of key cancer genes; on chromosome 7, for example, we often see amplification of the region containing the EGFR gene.

Most common approaches to RCNA identification involve a two-step approach: first, call CNAs in each individual sample; second, perform cross-sample analysis to look for recurrence. Unfortunately, with large numbers of samples and increasingly dense genomic data, this two-step approach carries a significant computational burden.

Enter the Matrix: Correlational Matrix Diagonal Segmentation

Now online at Bioinformatics Early Access is a paper describing CMDS, a population-based method for detecting RCNA in cancer that was developed here at Washington University by Qunyuan Zhang and his colleagues.

cmds-screenshot

CMDS uses raw intensity ratio data (from SNP arrays, CGH, etc.) and adopts a diagonal transformation strategy to identify RCNAs via between-chromosomal-site correlation. Not only does this reduce the computational burden of RCNA identification, but it increases the detection power as well.

Done in 13 Seconds

CMDS has a speed advantage as well. Qunyuan compared its execution time to that of AWS-STAC, SBS-STAC, and pREC-A on a dataset comprised of 10,000 sites in 100 samples. The R version of CMDS finished in 13 seconds. The other algorithms took more than 300 times longer on the same dataset, indicating that CMDS represents a substantial performance gain. There’s also a C version of CMDS that runs even faster.

Application to Real Data: Lung Cancer and Glioblastoma

To evaluate CMDS on real data, Qunyuan applied it to lung adenocarcinoma and glioblastoma (brain cancer) datasets that were generated as part of the Tumor Sequencing Project (TSP) and the Cancer Genome Atlas (TCGA), respectively. CMDS called 39 significant RCNA regions in lung cancer and 37 in brain cancer. All of the significant regions had been previously reported/validated; they included or were proximal to a number of well-known cancer genes including EGFR, CCND1, KRAS, MDM2, PDGFRA, and others.

When the two datasets were combined, a few key RCNA regions emerged – amplification of EGFR, CDK4, and MDM2, and deletion of CDKN2A – that were shared by both cancers. This, to me, demonstrates one of the most powerful aspects of CMDS – its population-based approach can compare not only samples of the same cancer type, but also pools of samples across sample types. It makes a great addition to our arsenal of cancer genomics tools at Washington University.

CMDS is implemented in R and C programs which are available from Qunyuan’s web site.

References
Zhang Q, Ding L, Larson DE, Koboldt DC, McLellan MD, Chen K, Shi X, Kraja A, Mardis ER, Wilson RK, Boreki IB, & Province MA (2009). CMDS: a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data. Bioinformatics (Oxford, England) PMID: 20031968

The Search for Somatic Changes

October 29, 2009 by Dan Koboldt

As cancer genome sequencing ramps up here and pretty much everywhere around the world, I got to thinking about strategies for identifying somatic changes, with confidence, from massively parallel sequencing data. As part of the Cancer Genome Atlas Project (TCGA), we’ve been applying both targeted (capture-based) and whole-genome sequencing approaches to tumor samples and matched normal controls. Ideally, the resulting data will yield high (>20x) coverage in both tumor and normal across our positions of interest. What happens next, at least at WashU, is the culmination of a multiple-year effort to develop a comprehensive pipeline for detecting somatic variants.

First up: Single Nucleotide Variants (SNVs)

With more than 15 million entries in dbSNP, single nucleotide polymorphisms (SNPs) remain the most common form of DNA sequence variation in humans. In cancer, most of the well-characterized somatic mutations are single nucleotide changes as well. Conceptually, SNVs should be the easiest things to find in next-gen sequencing data. They occur at a single position that can be directly compared between tumor and normal. They should have minimal effects on sequence alignments to the reference genome. For example, here’s a putative somatic variant in TP53:

What you see above is SAMtools “pileup” output at a single position (7518990 on chr17), for Normal and Tumor. The Normal shows 4 reads that all support the reference on the – strand (,,,,). The Tumor, however, shows 6 reads that all support a G variant, 2 on the + strand (GG) and 4 on the – strand (gggg). It seems reasonable that, given this output across the entire genome for Normal and Tumor, one can compare them at every position and look for differences such as these.

Yet we struggle to validate even high-confidence SNVs that look to be somatic. Some are real, but Germline (probably under-sampled or missed in the Normal); most are simply false positives in the tumor. These might arise from a number of causes – homopolymers, paralogs, repeats, sequencing error, alignment error, etc. Only a small fraction of variants that appear somatic in NGS data will validate as such.

Why is that? In general, it’s because by screening for somatic variants, we remove all of the variants that are most likely to be real. First, we exclude any variants that are present in the normal (germline) – which account for the majority of true sequence variations. We also exclude known variants from dbSNP and 1,000 Genomes databases, which are also likely to be real but almost certainly germline events. Then, we prioritize variants that are predicted to have functional effects – on protein coding, on splicing, in conserved regions, etc. Such regions are often under negative selection for damaging mutations, meaning that variants should be exceedingly rare. Every one of these filters selects for variants that are less likely to be valid.

Small Indels

With longer (>50 bp) fragment-end reads and/or paired-end libraries, it’s possible to detect small insertion/deletion variants (indels) in next-gen sequencing data. Here, detection and specificity are the challenges. In 454 data, the reads are [hopefully] of sufficient length (250 bp) for accurate gapped alignment to a reference sequence, and indeed, aligners commonly used with 454 data (Newbler, BLAT, cross_match, SSAHA2) do so. Unfortunately, indels are both the strength and the weakness of 454 data – due to the underlying pyrosequencing, homopolymeric regions are often under- or over-called, resulting in numerous false positives. Many can be filtered, but often homopolymer-associated errors cause mis-alignment of reads, yielding indels that might not look like homopolymer artifacts.

Indel detection is also possible with Illumina data, though the shorter read lengths make this challenging. Few short read aligners can handle the throughput of Illumina data and allow for gaps in read alignments, because speed and gapped alignment are at odds with one another. Fortunately, paired-end sequencing on Illumina offers a solution implemented by Maq some time ago – first, map all reads that you can without gaps, and then, look for gapped alignments in unplaced reads whose mate is mapped nearby. This reduces the search space considerably for gapped alignment, and also limits the query space to reads that likely contain indels (gaps).

In cancer sequencing, small indels present one additional problem – determining whether they are present in the normal. Even the best aligners can’t always precisely define where an indel starts or stops. Thus, a germline indel might have different coordinates in the tumor than in its matched control; when comparing the samples, it might appear to be somatic.

Loss of Heterozygosity (LOH)

It is well known that the genomes of tumor show extensive loss of heterozygosity (LOH). Generally, this occurs because a position that is heterozygous in the germline is affected by some kind of structural event – deletion, gene conversion, chromosome loss, etc. – that results in the loss of one allele. Of course, to detect LOH, one needs a variant that’s heterozygous in the Normal, and to precisely define the region of LOH, one needs a dense set of heterozygotes. Even so, the maximum precision for the start and stop of an LOH region is the interSNP distance, since only SNPs can inform on LOH, and that can be hundreds or thousands of bases. But LOH calls do tend to cluster, and detection of LOH regions is not really the problem. Even lower-resolution array technologies identify recurrent LOH regions in tumor samples.

But what exactly does LOH mean in terms of cancer development and growth? It’s hard to say. Quite possibly, a tumor suppressor gene was deleted, or an oncogenic allele was duplicated. Unfortunately, LOH regions tend to be kilobases or megabases in size, containing dozens or hundreds of genes, and identifying which ones are truly affected in terms of cancer remains challenging. We see a lot of LOH in cancer, but sadly, it never seems to get anyone excited.

Structural and Copy Number Variation

Image Credit: Wikipedia

Last and most difficult to characterize are the sub-microscopic structural changes – insertions, deletions, inversions, translocations, duplications, etc. – that often occur in tumor genomes. These tend to be large, complex events that are tough to infer from NGS data. We run Ken Chen’s breakDancer, of course, and it predicts numerous SVs. But how do you validate a massive, complex variant spanning thousands of bases? We do our best with PCR and 3730/454 sequencing, but until read lengths get really really long (perhaps on single-molecule sequencing), validating such events and determining their breakpoints is tough.

There are well-characterized, recurrent copy number alterations in cancer, like EGFR amplification on chromosome 7. Here’s my question: where are all of those extra copies? Are they just tandem duplications of part of a chromosome, or are they duplications that get inserted elsewhere in the genome? In the absence of a complete, linear, high-confidence genome, I’m not sure we can tell.

Fruits of Our Labors

It occurs to me that this is a bit of a negative article – focusing entirely on the challenges and failures, without highlighting the successes. And there are many successes. Every cancer genome tells us something, and every new piece of knowledge goes into our arsenal in the war against cancer. As sequencing ramps up, we’ll see exponential growth in the number of known somatic mutations across a wide array of cancers. With the help of cancer biologists, these data will be leveraged to better understand the genes, proteins, and pathways underlying tumorigenesis. Greater understanding will undoubtedly improve the detection, diagnosis, prognosis, and treatment of cancer patients.

« Previous Page