Next-generation sequencing technologies are said to be ushering in a new era of cancer genomics. A powerful demonstration of the new paradigm for cancer research came out today in Nature. It’s the much-anticipated publication of our AML project, in which we used the Illumina/Solexa platform to sequence the entire genome of a woman who died of acute myeloid leukemia. See my colleague David Dooling’s post on Politigenomics for links to some of the news coverage. This study offers two important milestones to the field:
1.) The first complete genome of a woman. Patient 933124 follows James Watson and J. Craig Venter into the archives of whole-genome history.
2.) The first cancer genome to be completely sequenced on a next-generation platform.
The basic biological problem is simple. This patient had AML. AML is a disease state that was initiated, and driven, by mutations in her genome. Her blood cells should be dominated by the clone of the most effective cancer genome. So, we sequence both tumor and germline DNA. We find any mutations in the tumor that are not in the germline, and identify which of those are novel, protein-coding variants. Among these should be, must be, the mutations that initiated and drove the development of cancer.
Major Informatic Challenges
Yet even with new technologies, this was no simple task. The sheer volume of data generated for this project is what amazes me. It took 98 full Solexa runs (4 libraries totaling ~5.86 billion reads) to reach our target diploid coverage in the tumor, which was 90%. The data was generated over a period of several months, during which both the sequencing technologies and the informatic algorithms were constantly evolving. The AML project offered me my first view of both 454 and Solexa data, and it presented our group with numerous challenges. Disk space. Computing power. Short read alignment. Variant calling. You name it.
The Power of the Unbiased Approach
It seems like a lot of work just for ten mutations. That’s how many validated, somatic, nonsynonymous mutations we found in this AML genome. Eight of these ten mutations, however, implicated new genes that were not previously linked to AML. Four of the genes are in gene families strongly associated with cancer pathogenesis (PTPRT, CDH24, PCLKC, and SLC15A1). The other four genes (KNDC1, GPR123, EBI2, and GRINL1B) are not known to contribute to cancer pathogenesis, but they have potential roles in metabolic pathways that may act to promote cancer growth. These are also four genes that would almost certainly have been excluded from a candidate gene approach.
Mutation Frequencies by 454 Read Count
Another interesting application of massively parallel sequencing was the estimation of mutation frequencies in the tumor sample using the Roche/454 platform. For each of the 10 somatic mutations, as well as 2 germline control variants, we performed PCR-targeted 454 resequencing in samples from the primary tumor, the relapse tumor, and the germline. The idea here was to profile the relative proportions of clonal cells that made up each sample. We got some results that the first author of the study, Tim Ley, described more than once (in our meetings) as “absolutely beautiful.” All of the somatic SNPs were at 50% frequencies in the tumor, as you’d expect for heterozygotes. They hovered slightly lower (around 40%) in the relapse sample, which was known to be less pure (i.e. ~78% blasts), but if you correct for the blast count, they reach 50% as well. Intriguingly, the somatic variants were detected in the germline sample as well at frequencies of 5-13%, suggesting that the skin sample was contaminated by a small fraction of leukemic cells. The one non-beautiful result was FLT3, which had frequencies of around 35% in tumor and 31% in relapse. It may be that the FLT3 ITD mutation was not present in all tumor cells; perhaps it was introduced later than the others.
Yes, We Can Find Indels in Short Reads
One of the significant bio-informatic challenges in which I became intimately involved was the detection of indels, which is theoretically possible but practically very difficult in fragment (non-paired) reads that are only ~36bp long. We ended up combining a few different approaches and found over 700 putative small indels, more than half of which were already in dbSNP. We attempted to validate 28 of these by 3730 sequencing. Two were the previously-known mutations in FLT3 and NPM1. Two were false positives. The other 26 were real, but present in the germline, which was a bummer since we thought they’d be somatic. Those are the breaks. Fortunately, indel detection is one area that will be helped dramatically by improvements to the sequencing technologies, namely longer reads and paired-end protocols.
A New Paradigm for Cancer Genomics
I think that most of all, this work was important because it established the feasibility of sequencing entire genomes with massively parallel / short read technologies and getting valuable results from it. It also drove us to develop and apply new algorithms (like decision trees) to analyze the data. I expect that we’ll begin to see a number of whole-genome-sequencing approaches to the study of cancer and other disease that take advantage of this new paradigm. The question of whether or not we can do science on a whole-genome scale has been answered. In the words of our next president, “yes we can!”
ben berman says
This is a really cool landmark paper! I was wondering, you guys chose a very permissive threshold i think for the decision tree SNP calls (rightly so, in order to minimize false negatives). Does the decision tree implementation yield overall probabilities so that this threshold could be tightened in order to minimize false positives instead? Are you making these decision tree probabilities available to the community in order to compare this dataset to their own AML data?