The False Positives in Deep Resequencing

At last the PNAS article previewed earlier this week by In Sequence is available on the journal’s site. Subcloncal phylogenetic structures in cancer revealed by ultra-deep sequencing had two aspects that appealed strongly to me – the use of massively parallel sequencing to study leukemia, and a formalized algorithm to distinguish true variants from false-positives.

The authors set out to examine clonal evolution in cancer with next-generation sequencing of B-cell chronic lymphocytic leukemia (CLL) samples. CLL was an appealing model for this study because its high mutation rate in the short stretch of DNA that encodes the IG heavy chain (IGH). The short size of the locus was ideal for 454 sequencing, and because single-molecule reads are generated, the authors were able to identify haplotypes of somatic hypermutations carried by individual leukemic cells.

A key part of this study was the characterization of sequencing error rates and their causes. Three patterns of sequence errors were apparent:

Errors found near runs of 4 or more bases of the same nucleotide (homopolymers). This well-known artifact of pyrosequencing accounted for many false indel calls, and created false SNP calls as well.
Errors near the end of the sequence. These arise from a reduced signal-to-noise ratio after about 200 bases have been read.
Polymerase misincorporation during PCR. These are not sequencing errors, but random polymerase errors that created a low rate of substitutions through the length of the amplicon.

Weeding out false-positives is one of the greatest challenges facing those of us who analyze massively parallel sequencing data. Often this issue is addressed *after* the sequencing is done, with concordance estimates, decision trees, and the like. What I like about this study is that the authors looked at sequencing errors first, to precisely classify the sources of false-positives, and then built their variant-calling algorithm around the results.

The evolutionary biology aspect of this study is fascinating as well. Cancer is a powerful micro-system to study evolution, since subclones of cells have a mixture of shared and private somatic mutations and compete with one another to grow. Subclones with the best evolutionary fitness will, in time, come to dominate the population. It’s Darwinian fitness at its best.

By identifying haplotypes from single-molecule reads, the authors were able to construct phylogenetic trees of the leukemic cells in a single patient, something that could only be done on the 454 platform. Intriguingly, the initiating driver mutation of leukemogenesis occurred before the earliest branching of trees. Yet there were numerous different subclone haplotype – one came to dominate, but the others persisted as well. This suggests that every subclone persisting in the population picked up at least one additional mutation that gave it a competitive advantage. Thus even the rare subclones carry driver mutations that contribute to cancer cell survival.

The more rare subclones we can detect, the more mutations we can find, and the better we can come to understand the complex set of disease mechanisms that play a role in cancer.