A paper online at Bioinformatics describes our flagship algorithm for detecting somatic point mutations in whole-genome sequencing of tumor samples. This freely available software package, called SomaticSniper, performs a Bayesian comparison of the genotype likelihoods in tumor and normal samples at every [covered] position in the genome.
The study includes a detailed investigation of common sources of false positive mutation calls (usually from sequencing- or alignment-related artifacts) and describes a filtering strategy to remove them from mutation callsets.
Inception: First Cancer Genomes
Like many bioinformatics algorithms, SomaticSniper reached publication after a long and colorful history. It began in 2008 when we sequenced the first cancer genome, AML1. At the time, we were generating fragment-end, 32 bp reads on early Illumina GA instruments. It took over a hundred lanes to achieve ~30-fold coverage on each sample (tumor and normal). We were in dire need of a short read aligner that could handle this amount of data, and Maq answered the call (see my Maq Top Ten).
In addition to serving as one of the most widely used short read aligners, Maq included a probabilistic genotype calling model for detecting germline SNPs in a single genome. Dave Larson (the lead author) and others from our group developed an algorithm to compare genotype likelihoods between tumor and normal, to compute the probability that a site is not somatic given the sequence data. Putative somatic mutations receive a somatic score, a Phred-scaled value representing the quality of the call. Here’s something interesting: during the data generation phase for AML1, as we added more sequence, the number of candidate mutations went down. This is because only a tiny fraction of variants in a tumor genome are somatic; the vast majority are germline variants also present in the normal. As better coverage was achieved, more and more variants turned out to be germline. By the end, it turned out that there were just ten somatic coding mutations in the tumor genome of AML1, a cytogenetically normal leukemia. A lot of people were flabbergasted. Ten little changes, and a woman got leukemia.
More Genomes, Better Algorithm
This algorithm became the core of our cancer whole-genome sequencing analysis pipeline, evolving and improving over the course of the second cancer genome (AML2) in the New England Journal, a breast cancer genome (BRC1), and others. It found, among others, mutations in IDH1 and DNMT3A that we and others showed to be recurrent across many tumors. The algorithm’s name changed a few times, settling at last on SomaticSniper. It’s now a lean and hungry animal, capable of processing high-coverage whole-genome sequence pairs in a matter of hours.
Filtering Out the Noise
No matter how good the mutation caller, there are going to be some false positives. This is because you’re looking for a one-in-a-million event, a true somatic mutation. Raw SomaticSniper calls therefore undergo a series of Maq-inspired filters. Sites are retained if they meet these criteria:
- Covered by at least 3 reads
- Consensus quality of at least 20
- Called a SNP in the tumor sample with SNP quality of at least 20
- Maximum mapping quality of at least 40
- No high-quality predicted indel within 10 bp
- No more than 2 other SNVs called within 10 bp
Sites passing these criteria are subjected to two additional filters: a screen against germline variants from dbSNP (remove if matches position and allele of known non-cancer dbSNP) and an LOH filter (remove if normal is heterozygous and tumor homozygous for the same variant allele). Sites removed by the former are probably inherited variants under-sampled in the matched normal, while sites removed by the latter are likely due to large-scale structural changes (e.g. deletions) causing the loss of one allele. Finally, the filter-passed mutations are classified as high-confidence (HC) if the somatic score is at least 40 and the mapping quality is at least 40 (for BWA) or 70 (for Maq).
Frequent Sources of False Positives
Even sites that pass the filters above are vulnerable to certain sequencing and alignment artifacts that produce false positive calls. A detailed study revealed (as many in the field know already) a few common sources of false positives: strand bias, homopolymer sequences, paralogous reads (deriving from a paralogous region of the genome, but mapped to the wrong region, usually three or more substitutions), and the read position of the predicted variant. The latter type of artifact is something new; it turned out that variants only seen near the “effective” 3′ end of reads (the start of soft-trimmed bases or the actual end of the read if untrimmed) were more likely to be false positives. This may be a combination of sequencing error, which is higher at the 3′ end of reads, and alignment bias favoring mismatches over gaps near the ends of reads. In any case, false positives deriving from these common causes tend to have certain properties enabling them to be identified and removed while maintaining sensitivity for true mutations.
SomaticSniper adds to the growing arsenal of tools developed by our group to address the significant challenges presented by next-generation sequencing data analysis.
Larson, DE., Harris, CC., Chen, K., Koboldt, DC., Abbott, TE., Dooling, DJ., Ley, TJ., Mardis, ER., Wilson, RK., & Ding, L. (2011). SomaticSniper: Identification of Somatic Point Mutations in Whole Genome Sequencing Data Bioinformatics, Online : doi: 10.1093/bioinformatics/btr665