decision trees

My colleague David Larson just returned from CSHL’s Personal Genomes meeting, where he presented a poster on decision-tree filtering of variant predictions from Illumina/Solexa data. I don’t know much about machine learning, but I can see that it offers a useful approach in at least one aspect of next-generation sequencing.

From my basic understanding, a decision tree is a machine learning algorithm that you “train” on a dataset where the correct decisions are known, and then apply to another dataset in which decisions are not known.

A sample decision tree that uses weather attributes to determine if a game will be played or not. Image Credit: Wikipedia

A sample decision tree that uses weather attributes to determine if a game will be played. Credit: Wikipedia

For example, Dave’s poster described a decision tree that determines whether SNP predictions from Solexa are real (“Germline”) or false-positives (“WildType”). As a training set, Dave used ~650 SNPs whose true status had previously been determined on 3730 sequencing. For each SNP, he provided several attributes (base quality, read count, etc.) as well as the correct “answer (Germline or WildType) as determined by 3730. These inputs went into the c4.5 program which generated a decision tree to distinguish Germline from Wildtype based on these characteristics.

Dave applied the decision tree to whole-genome Solexa data for an individual that we recently sequenced to over 10x coveraged with Solexa fragmented reads. Maq had predicted ~5 million SNPs; the decision tree filter cut this number in half. Even more promising, Dave’s decision tree filter isolated a substantially better data set. Over 90% of the SNPs detected by array-based genotyping were among the Germline-classified SNPs. Concordance with dbSNP, which is one of our measures of specificity, was over 80% the last I heard.

It occurred to me that the decision tree approach has numerous applications for next-generation sequencing analysis. It could be used to distinguish true variants from false positives, or somatic mutations from germline variants. Decision trees might also be informative for short read alignments, where a number of attributes (read length, alignment score, alignment quality, mismatches, etc.) could be used to determine whether or not a read was correctly placed.

After talking with Dave, I spent half a day building decision trees that might be useful for 454 variant detection. One thing I realized very quickly is the importance of the training data set. First, I tried a training set of ~75 variants sequenced by 3730. This was way too small, yielding a tree with one decision (allele frequency) to classify the data. Then, I tried a training set of ~400,000 454 read alignments with several attributes. This was far too much, yielding a massive tree with hundreds of branches. Also, I worry about the correctness of the “answers” in my data sets. While 3730 sequencing is a gold standard, it also has a tendency to miss certain kinds of variants, which might be detected in 454. Real variants, labeled as Wild-Type in the training data set. I think I’ll have to find a larger, more reliable training set before decision trees bear fruit for 454 variant detection.