The False Positives in Deep Resequencing

August 22, 2008 by Dan Koboldt

At last the PNAS article previewed earlier this week by In Sequence is available on the journal’s site. Subcloncal phylogenetic structures in cancer revealed by ultra-deep sequencing had two aspects that appealed strongly to me – the use of massively parallel sequencing to study leukemia, and a formalized algorithm to distinguish true variants from false-positives.

The authors set out to examine clonal evolution in cancer with next-generation sequencing of B-cell chronic lymphocytic leukemia (CLL) samples. CLL was an appealing model for this study because its high mutation rate in the short stretch of DNA that encodes the IG heavy chain (IGH). The short size of the locus was ideal for 454 sequencing, and because single-molecule reads are generated, the authors were able to identify haplotypes of somatic hypermutations carried by individual leukemic cells.

A key part of this study was the characterization of sequencing error rates and their causes. Three patterns of sequence errors were apparent:

Errors found near runs of 4 or more bases of the same nucleotide (homopolymers). This well-known artifact of pyrosequencing accounted for many false indel calls, and created false SNP calls as well.
Errors near the end of the sequence. These arise from a reduced signal-to-noise ratio after about 200 bases have been read.
Polymerase misincorporation during PCR. These are not sequencing errors, but random polymerase errors that created a low rate of substitutions through the length of the amplicon.

Weeding out false-positives is one of the greatest challenges facing those of us who analyze massively parallel sequencing data. Often this issue is addressed *after* the sequencing is done, with concordance estimates, decision trees, and the like. What I like about this study is that the authors looked at sequencing errors first, to precisely classify the sources of false-positives, and then built their variant-calling algorithm around the results.

The evolutionary biology aspect of this study is fascinating as well. Cancer is a powerful micro-system to study evolution, since subclones of cells have a mixture of shared and private somatic mutations and compete with one another to grow. Subclones with the best evolutionary fitness will, in time, come to dominate the population. It’s Darwinian fitness at its best.

By identifying haplotypes from single-molecule reads, the authors were able to construct phylogenetic trees of the leukemic cells in a single patient, something that could only be done on the 454 platform. Intriguingly, the initiating driver mutation of leukemogenesis occurred before the earliest branching of trees. Yet there were numerous different subclone haplotype – one came to dominate, but the others persisted as well. This suggests that every subclone persisting in the population picked up at least one additional mutation that gave it a competitive advantage. Thus even the rare subclones carry driver mutations that contribute to cancer cell survival.

The more rare subclones we can detect, the more mutations we can find, and the better we can come to understand the complex set of disease mechanisms that play a role in cancer.

Fitness Effects of Amino Acid Mutations in Humans

June 4, 2008 by Dan Koboldt

The current issue of PLoS Genetics has an interesting article on the distribution of fitness effects (DFE) among new amino acid changing (nonsynonymous) mutations.

Adam R. Boyko et al. Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet 4(5): e1000083. May 2008.

Call me old-fashioned, but I’m still impressed by strong datasets. The authors of this study resequenced the exons of 11,404 protein-coding genes in 35 individuals (20 EUA, 15 AFA), which provided a uniform ascertainment and frequency estimate for some 47,576 coding SNPs. The paper itself is very statistical in nature, with various “selection models” applied to determine the demographic and selective effects on amino acid variation in the human genome. Let me admit that I understand only the fundamentals of such things. While the authors only look at nonsynonymous and synonymous variants, they’ve done a lot of work to comprehensively investigate evolutionary models with their data. Let me hit you with the highlights:

They investigated the unfolded nonsynonymous site frequency spectra using 13 different selection models, including some complex two-parameter and three-parameter models.
The authors inferred a similar mean selection coefficient (-0.030) for newly arising mutations in European Americans as in African Americans, despite complications of demographic history (admixture) in both groups.
Various manipulations of the data showed that two major potential confounding factors, SNP ascertainment bias and weak selection at presumably-neutral sites, had little influence on the inferences from their data set.
The authors estimate that 10-20% of amino acid divergence between chimps and humans is due to positive selection. This figure holds in both African and European derived samples.
According to best-fit models, 27-29% of nonsynonymous changes are neutral, 30-42% are modestly deleterious, and the remainder highly deleterious. Due to the strength of purifying selection, however, deleterious mutations make up <1% of common segregating SNPs (MAF >= 0.05) in human populations.

It follows from the last point above that the vast majority of common human genetic variation, i.e. SNPs with derived allele frequencies of at least 5%, is neutral or nearly neutral with respect to fitness. If this is true, then there are important implications for genetic association studies, which often rely on surveys of common genetic variation in the human genome. Such studies may miss the rare, highly deleterious mutations that are both evolutionarily and medically relevant.

The authors conclude that “re-sequencing in large samples of phenotypically extreme individuals, on the other hand, is much more likely to discover rare, large-effect mutations that are predicted… to be deleterious.” As a HapMap consortium member I’m not sure that I agree outright, but as an employee of the WashU Genome Sequencing Center, I have to say, resequencing is not a bad way to go.

Cis-regs and Functional Noncoding Variation

April 24, 2008 by dkoboldt

On Tuesday I attended a very interesting thesis defense by Scott Doniger, a student in Justin Fay’s lab. I admit, I was lured in by the thesis title, “Comparing and Contrasting Cis-regulatory Sequences to Identify Functional Noncoding Sequence Variation.” While I do not know Scott personally, I’m certainly familiar with Justin Fay’s work on positive and negative selection in the human genome. His paper, in fact, is the foundation of my work on signatures of natural selection and the SNPseek project.

Scott proved a confident and articulate speaker, and laid the groundwork for his thesis by presenting three convincing motivations for this work:

The regulatory hypothesis of evolution. Despite the obvious phenotypic diversity of species on this planet, the DNA sequence diversity is surprisingly limited. More than twenty-five years before the completion of the human genome sequence, King and Wilson [1975] found that the chimpanzee and human genomes diverged by only 1.6%. From this seminal paper came the idea that regulation of gene expression, not differences in DNA sequence, drove phenotypic divergence.
The functional relevance of noncoding sequences. Despite the traditional view that functional variants in humans alter protein-coding sequence, it is becoming clear that the genetics underlying many traits extend into noncoding DNA, particularly for complex phenotypes like disease susceptibility and drug response.
The availability of numerous genome sequences. Draft genome sequences for at least 27 vertebrate species have been completed to date, and their availability has spurred wide interest in the field of comparative genomics.

Scott’s work is based on the reasonable premise that functional noncoding sequences are subject to purifying selection (fewer changes tolerated over time), and thus they should be conserved between genomes that share common ancestry. Thus, comparative genomics serves to guide us to functional variants, as SNPs in constrained positions are more likely to be deleterious. This works well for coding sequences in both humans and yeast (the Fay lab model organism). Scott looked at the 9 known quantitative trait nucleotides (QTNs) in yeast and sure enough, 8 of them were SNPs in highly conserved amino acid positions. Gravy.

Because deep sequence conservation approaches might not work for noncoding SNPs, they focused on a few closely related species of yeast, identifying 2,106 variant positions (13% of the total) that fell within conserved transcription factor binding sites (TFBS’s). Of those, 615 (29%) appear to be deleterious based on their conserved-nucleotide model. If I can extrapolate, by their approach about 3.8% of the SNPs between closely related yeast species are likely to be functional.

The Model-Free Approach: PhyloNet-SNP

All of Scott’s work to this point relies on having good annotations of cis-regulatory TFBS’s in your genome of interest. Because you can’t always count on that, they developed a “model-free” approach to evaluating SNPs. With some help from Gary Stormo’s group, they devised an algorithm (PhyloNet-SNP) that uses each SNP +/- 20 bp of flanking sequence in each direction as a query sequence to identify those within multi-copy conserved elements of a genome. By this approach, ~15% of the SNPs in their model system were called as functional.

The Experimental Backup: Allele-specific Expression

The brief wet-lab portion of the thesis work was an allele-specific expression experiment where the ability of SNPs to alter gene expression levels was evaluated in vivo. Among randomly-chosen SNPs about 8% had a regulatory effect. However, using sequence conservation and/or PhyloNet-SNP to select SNPs brought this up to 25%, suggesting that the conservation approach yields a three-fold enrichment of SNPs that affect gene expression.

At the conclusion, Scott admitted that while comparative genomics does help identify functional sequences and variation, it doesn’t explain everything. Indeed, recent findings from the ENCODE project cast doubt on whether many conserved noncoding sequences are important at all. Yet until we have a better understanding of the dark matter of the human genome, using sequence conservation to identify SNPs of interest seems like a good way to go.

« Previous Page