The Four Dimensions of a Breast Cancer Genome

April 15, 2010 by Dan Koboldt

Published today in the journal Nature is the whole-genome sequencing of a basal-like breast cancer tumor, metastasis, and xenograft. There’s also a News and Views article by Joe Gray of Lawrence Berkeley National Laboratory, as well as a news feature on large-scale cancer projects.

brc1-nature08989screenshot

This study is a bit unlike our previous cancer genomes (AML1 and AML2). By my count it is the sixth cancer genome to be sequenced, and the third to come out of the Genome Center at Washington University. Obviously, it’s our first solid tumor. What’s particularly interesting about this study, however, is that we sequenced four DNA samples from a single patient with “double-negative” breast cancer: the primary tumor, peripheral blood (normal), a brain metastasis, and a mouse xenograft derived from the primary tumor. The xenograft is a success story in itself – we managed to create a human-in-mouse (HIM) transplant of the primary tumor that was >90% pure when harvested 101 days after engraftment.

The genomes of these four samples (tumor, normal, metastasis, and xenograft), examined with the incredible power of Illumina massively parallel sequencing, offer an unprecedented view of the somatic changes that underlie breast cancer development, growth, and metastasis.

Repertoire of Somatic Mutations

We validated a total of 50 somatic sites in at least one of the three cancer genomes, including:

28 missense mutations predicted to alter the sequence of an encoded protein
11 synonymous (silent) mutations in coding sequences
4 small insertions ranging in size from 1 to 6 bp
3 small deletions ranging in size from 1 to 13 bp
2 splice site mutations at intron-exon junctions
1 nonsense mutation predicted to result in a truncated protein
1 RNA mutation in a gene encoding a signal recognition particle (SRP) RNA.

We employed deep Illumina sequencing of PCR amplicons to assess the frequencies of each mutation across all four tissues. Intriguingly, more than half of them exhibited differential frequencies between primary tumor, metastasis, and/or xenograft. Two mutations (a nonsense mutation in MYCBP2 and a missense mutation in TGFBI) were significantly enriched in the primary tumor (88-89% vs 14-44%). Some 26 mutations were significantly enriched in the metastasis and/or xenograft. Perhaps most interesting, however, were two sites (a missense mutation in SNED1 and a silent mutation in FLNC) that appear to be de novo mutations unique to the metastasis.

Acquired Structural Variation

Using our internally developed tools for structural variant prediction (BreakDancer) and de novo assembly (TIGRA), we predicted 59 deletions and 18 inversions that were putative somatic events. Validation by PCR and 454/3730 sequencing showed that 73/77 (94.8%) were real structural variants, of which 34 (28 deletions and 6 inversions) were somatic alterations not present in the normal genome. Among them was a 46.5 kbp heterozygous deletion affecting FBXW7 (a known cancer gene) and two overlapping 500-kb deletions affecting CTNNA1 and a handful of other genes. The latter was particularly interesting, because loss of CTNNA1 has been shown to result in global loss of cell adhesion in human breast cancer cell lines.

We also validated seven translocations with a combination of manual review (Pairoscope), assembly, and PCR/3730 sequencing. One translocation that we assembled in all three tumor samples involves a long terminal repeat (LTR) from the ERVL-MaLR family on chromosome 4 and the ABCA2 gene on chromosome 9. Two other validated translocations that assembled in all three tumors are on chromosome 2, and separated only by a 393-bp TcMar-Tigger repeat.

Insights from Comparisons of Tumor, Metastasis, and Xenograft

One of the most intriguing findings from our study was the differential mutation frequencies and structural variation patterns that we observed in the metastasis and xenograft, compared to the primary tumor. More than half of the somatic mutations (26/50) were significantly enriched in the metastasis and xenograft, while observed at relatively low frequencies in the primary tumor. This suggests that a sub-population of tumor cells, not the primary clone, gave rise to the cerebellar metastasis that eventually killed the patient.

Is there a fitness cost to the mutations that enabled metastasis? Can we develop sensitive tests to detect the cells that are likely to spread? Genome sequencing has brought us to a point where we can begin to ask these questions, and answering them brings us one step closer to unraveling the complex, devastating, deadly disease that is cancer.

References
Li Ding, Matthew J. Ellis, Shunqiang Li, David E. Larson, Ken Chen, John W. Wallis, Christopher C. Harris, Michael D. McLellan, Robert S. Fulton, Lucinda L. Fulton, Rachel M. Abbott, Jeremy Hoog, David J. Dooling, Daniel C. Koboldt, Heather Schmidt, Joell (2010). Genome remodelling in a basal-like breast cancer metastasis and xenograft Nature, 464 (15), 999-1005 : 10.1038/nature08989

Next-Gen Sequencing in 2010

March 9, 2010 by Dan Koboldt

On the shuttle from Marco Island to the airport last week, I happened to sit next to a very nice gentleman from Illumina. We got to talking, of course, and I asked him if they saw a threat from any of the new sequencing platforms presented at AGBT. I’m aware that Illumina currently enjoys a greater-than-50% share of the next-gen sequencing market, so I was curious about his impressions.

“We definitely see a segmentation of the market,” he admitted.

Something had been bothering me about the sequencing-company presentations this year, and I finally realized what it was. During AGBT 2009, every player was gunning to take over the world. This year it seems like every sequencing platform has a niche in mind.

General Sequencing: Illumina vs. Life Technologies

Illumina’s HiSeq2000 and Life Tech’s SOLiD 4 are after the general sequencing market – whole genome, transcriptome, and targeted (capture) sequencing. It’s a constant game of one-upmanship in throughput and claimed accuracy. In February this year, Illumina launched the HiSeq2000 with expected throughput of 200 GB per run. Life Technologies launched SOLiD 4 with 100 GB per run, but promised 300GB per run later this year. On the read length front, Illumina remains the clear winner – 2×100 is in production at many genome centers, and even longer reads have been promised. Life Tech, to their credit, is pushing the SOLiD 4 platform pretty hard.

When Length Matters: 454

Roche/454 has wisely backed away from large-scale sequencing, and instead seems to be targeting applications where longer (450 bp) reads are a requirement. At AGBT, Henry Erlich (Roche) gave an interesting talk about genotyping and haplotyping human HLA regions to improve donor matching for organ transplants. Here’s a key challenge of modern medicine where sequencing can offer tangible benefits. Here at the genome center, we use 454 runs for validation and for small-scale targeted sequencing. There are many applications where relatively inexpensive long-read sequencing runs are idea; full-length cDNA sequencing, for example, comes to mind.

Complete Genomics: Sequencing as a Service

The business model of Complete Genomics seems a bit of a gamble to me. They aim to be the provider of relatively inexpensive, start-to-finish sequencing services. No technology or reagent sales for these guys. Instead, they want to take your samples and give you back the SNPs. In the coming years, they hope to build as many as 10 facilities throughout the world that provide these services. I’m a bit leery of Complete Genomics, not only because their proprietary technology lags behind others (currently it’s at 2X35 bp), but because they’ll need to do something like 10,000 genomes a year just to stay in business. I don’t think we’re ready for that.

Sequencing for the Masses: IonTorrent

Many of us were impressed by IonTorrent this year at AGBT. The incredibly low cost of their instrument ($50K) and sequencing runs ($300-500) mean that nearly any lab could write a grant around this technology. The sample prep, accuracy, and throughput are still a grey area, but if they prove to be good enough, high-throughput sequencing will suddenly be available to just about everyone.

Single Molecule Applications: Pac Bio and Oxford Nanopore

The true single-molecule sequencing platforms that are close to market are certainly getting everyone excited. In the next few years, however, it’s unlikely that Pacific Biosciences, Oxford Nanopore, mystery-Chinese-platform, or other companies will displace massively parallel sequencing. No, I think Illumina and SOLiD will remain the “work horses” for discovery, certainly at major genome centers. Where SMS technologies can excel, however, is ultra-long reads – think about PacBio’s strobe sequencing to resolve structural variation or finish assemblies – and lots of molecule-kinetics stuff that I don’t understand.

I think that 2010 will be an exciting and telling time for all of these platforms. In a year’s time, we should have results in hand from HiSeq, SOLiD4, PacBio, and even IonTorrent, and be able to distinguish between marketing claims and sequencing reality.

Marco Island Meeting Preview

February 22, 2010 by Dan Koboldt

The Advances in Genome Biology and Technology (AGBT) meeting begins this week at Marco Island. I’ll be there to present a poster on our somatic mutation detection pipeline, and also to learn about what’s to come in next-generation and next-next-generation sequencing.

Some of the companies are already ramping up. Last week Pac Bio announced the intial members of their partnership program to provide complete solutions for single molecule real-time sequencing. Microfluidics company Caliper Life Sciences formed a scientific advisory board for next-gen sequencing that included WashU’s own Vince Magrini. Other companies – Illumina, Complete Genomics, and RainDance Technologies, for example – are hosting workshops or other events at AGBT.

AGBT Sessions Not To Miss

Day 1 of the meeting will be very strong, with opening remarks from Len Pennacchio (JGI), Kelly Frazer (UCSD) on genomic enrichment, Mike Snyder (Stanford) on paired-ends for SVs/assembly, and Barbara Wold on ChIP-Seq. On Day 2, Stacey Gabriel of the Broad Institute will discuss applications of new sequencing technology to medical and cancer genetics. Carlos Bustamante of Stanford will present the complete genome sequencing and analysis of African-American and Mexican-American individuals. WashU’s David Wang will give a talk on metagenomic approaches to pathogen discovery.

Some friends of mine are giving talks later that evening. Jeff Reid (Baylor College of Medicine) has what looks to be a very interesting talk on miRNA precursor variants in schizophrenia. Daniel MacArthur, of Sanger and Genetic Future fame, will present “Loss-of-Function Mutations in Healthy Human Genomes,” likely based on his work with the 1,000 Genomes Project.

Cancer Genomics and Sequencing

I’m very excited about an entire session devoted to cancer genomics. Elliott Margulies (NHGRI) will discuss the sequencing and analysis of a melanoma genome. In what may be the first application of single-molecule sequencing to cancer, the sequencing of Ewing’s Sarcoma on a Heliscope instrument will be presented by Timothy Triche of Childrens Hospital Los Angeles. Two speakers from BC Cancer Agency will discuss rearrangements in follicular lymphoma and capture/transcriptome sequencing in lung cancer.

Whole Genome Sequencing

There are to be big-picture sequencing talks as well. Genome center co-director Elaine Mardis will present “Single Molecule Sequencing to Detect and Characterize Somatic Mutations in Cancer Genomes.” Stan Nelson of UCLA will give a talk, presumably on his group’s recent publication – whole genome sequencing of a glioblastoma cell line on ABI SOLiD.

I’ll be there, and posting regular updates, as the latest and greatest in sequencing technologies unfolds at Marco Island.

Capture and Subassembly with Jay Shendure

January 15, 2010 by Dan Koboldt

Yesterday our 2010 Genetics Seminar Series kicked off with Jay Shendure (Univ. Washington) whose twelve-exome paper landed in Nature late last year. His talk covered three very different applications of next-generation sequencing: high-throughput mutational studies of core promoters, sub-assembly of Illumina reads to 454-length contigs, and exome capture to unravel Mendelian disorders.

Mutational Profiling

First, Dr. Shendure described some interesting experiments under way in his lab to elucidate the function of non-coding regulatory variants – specifically, single nucleotide changes in the core promoter that alter gene transcription. The approach is called “saturation mutagenesis” and involves generating every possible mutant in a construct, and then assaying the effect of each construct on transcription. By leveraging high-density Agilent arrays and next-generation sequencing, Shendure and his colleagues performed saturation mutagenesis in vitro in high-throughput fashion. Their process involves three steps:

Synthesize mutant constructs on an Agilent array. The oligos (probably ~150 bp) include the core promoter region surrounding a gene’s transcription start site (TSS). They generate a single mutation (SNP or single-base indel) per construct, and label each construct with a sequence barcode downstream of the TSS.
Cleave mutant templates from the array, amplify, and sequence on Illumina to measure relative construct abundance.
Perform in vitro transcription, then Illumina RNA-Seq, to measure the expression of each construct.

Dr. Shendure noted that there was some sequencing bias between barcodes, so they used multiple barcodes (6) per mutant construct and normalized the results. Then, by combining the construct abundance data (Seq) and the expression data (RNA-Seq) for mutants and comparing them to the results for the wild-type construct, they could assess the functional impact of each synthesized mutation on transcription.

As far as results go, Dr. Shendure showed a histogram: on the X-axis was each base of the core promoter region that they evaluated, and on the Y-axis, the effect of mutating that position on transcription. Most of the values were negative, indicating that mutations reduced transcriptional activity, particularly around the TATA box and INR site. Essentially, the plot neatly described the footprint of RNA polymerase binding, with the most effective mutations centered on the TSS. Intriguingly, the single-base deletion mutants consistently showed the greatest reduction of transcription, suggesting, perhaps, that indels in promoter regions are likely to be functional variants.

Short Read Subassembly

The next area of interest was very pertinent to groups with access to next-generation sequencing, but not the 454 “length matters” platform. While Illumina read lengths are still growing (most groups currently run 75- or 100-bp protocols), they still cannot rival the ~450 bp reads consistently produced on 454 Titanium. And yet, many applications of NGS benefit from longer reads – de novo assembly, metagenomics, and the core promoter assays I’ve just described, to name a few. Thus, Shendure and his group sought to combine some Tech D cleverness with Illumina’s incredible read depth to generate localized assemblies of kilobase-length fragments.

First, they sheared DNA into fragments that were a few kilobases long, ligated adapters to the ends of each fragment, and did a round of amplification. Now they had many copies of each fragment with adapters on each end. The fragments are concatemerized, then somehow randomly sheared to variable-length pieces of the original fragment such that each piece has one of the original adapters on one end. A new adapter is ligated to the sheared end. Then there’s another round of PCR, followed by Illumina paired-end sequencing. The resulting paired-end reads (75-mers) have a “read2” that’s the same for all pieces of the same kilobase-fragment, but a read1 that comes from some random location within the fragment.

Then, it’s possible to perform a localized assembly for each kilobase fragment. It’s an interesting approach, but here’s the problem: after assembly, in their proof-of-principle experiment, they achieved a median contig size of 350 bp. Granted, the per-base quality was very high (85% of bases had Q>40), but the lengths are unimpressive. As Dr. Shendure joked, they managed to get similar read lengths to a 454 run and make it cost just as much. There’s still a lot of work to do. Or they could just pick up one of those cute little GS-Juniors.

Human Exomes and Mendelian Disease

Finally, Dr. Shendure gave an overview of last year’s elegant Nature paper, in which exome sequencing of four individuals, followed up by careful downstream informatics, correctly identified the causative gene. Their defined “exome” was 30 Mb, which they targeted using two solid-phase array capture chips. Illumina sequencing of the exome capture generated about 6.4 gigabases per individual. Exome sequencing makes a lot of sense in certain Mendelian disorders, where (1) the pattern of inheritance, e.g. autosomal recessive, is known, and (2) the causative mutations occur in a single gene.

By sequencing the exomes of multiple individuals, isolating what we’d call “tier 1” variants – Nonsynonymous, nonsense, splice site, or frameshift-indel – and then removing all known common variants from public databases, Dr. Shendure and colleagues can reduce 20,000 gene candidates down to a handful. It worked out beautifully in the Nature paper – all four individuals had rare, tier 1 mutations in the same gene.

But in another cohort (4 individuals from 3 kindreds with Miller syndrome, a rare developmental disorder) Dr. Shendure and colleagues discovered the danger of overfiltering. They removed all variants from dbSNP 129, but when they limited the scope to only mutations predicted to be “damaging” or “deleterious”, the number of genes dropped to zero. Apparently the deleteriousness of at least one of the causal mutations wasn’t predicated correctly.

Obviously, the need is for better filters of common variants. But with projects like the 1,000 Genomes in full swing, I wonder, will filtering out using dbSNP get better, or worse? Already, as Shendure pointed out, certain genes have basically a SNP reported at every position. I know that TP53 does. What’s more, with the advent of next-generation sequencing, I hate to tell you, but people are going to be reporting a lot of false positives. I guarantee it. So when you filter all of the variants, you might actually remove the ones you’re looking for.

References
Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, & Shendure J (2009). Targeted capture and massively parallel sequencing of 12 human exomes. Nature, 461 (7261), 272-6 PMID: 19684571

« Previous Page