False Positives from Next-Gen Sequencing

Recently I saw one of those presentations that seem common in the era of next-generation sequencing: the kind where they put up a list of genes that were variant or mutated or significant in a sequenced cohort. A glance at the list tells me that there are some true positives — one of the classic cancer genes is at the top — but there are other symbols I recognize, too: TTN, USH2A, MUC16…. we know them as the bad-apple genes, the ones that seem to crop up in any analysis.

With TTN and USH2A, it’s simply a gene size issue: these genes are so physically large that they tend to accumulate a lot of mutations in any cohort. And we see a lot of mucin genes because they’re recent gene copy events and give rise to many paralogous alignments, which in turn give rise to systematic false positive variant calls. For example, here’s an IGV screenshot of exome data for three samples at the MUC4 locus:

ngs false positives


This represents what we call “paralogous alignment” and it’s one of the most frequent causes of false positive variant calls in NGS data. If you look at the pileup of these positions, the evidence for individual variants seems compelling because so many reads support the same variant at the same position. But these reads likely originate from another part of the genome, meaning that there’s probably not a real variant here.

There’s at least one notable exception to the expectation that a concentration of apparent variants in a small region is artifactual: the HLA locus on chromosome 6, which hosts the genes that encode human MHC proteins. In these regions there are many sequence variants, and most are probably real. However, the locus violates many of our assumptions about the human genome as a whole, which is why you often see a peak on chromosome 6 in Manhattan plot of genetic association signal.

False Positive Gene Lists

It gets to the point where analysts are building lists of these “usual suspect” genes as a ready, quick filter to remove them from candidate gene lists. Because treating TTN like it’s one of your top genes is kind of a rookie move. Incidentally, I learned that titin does have a role to play in humans, evidenced by the fact that targeted therapies against it cause heart failure. I recently made a list of genes that had novel heterozygous coding variants in multiple (unrelated) families with a rare Mendelian disorder. Then I took the first 3 characters of every gene name, and looked at how frequently that string occurred. Here were the top recurrences:

Count String Description
91 LOC LOC genes
22 ENS Ensembl genes
21 FAM FAM proteins
15 GOL Golgi-like GOLGA8E
13 PRA PRAMEF genes
9 NBP Nuclear breakpoint family
7 POT POTE ankyrin domain family
6 DEF defensins
5 OR2 Olfactory receptor
5 MUC Mucins
5 KRT Keratins
4 WAS WAS protein family homolog
4 ANK ankyrins
3 TRI tri-partite motif containing
3 OR1 Olfactory receptor
3 FRG FSHD region gene

The Danger of Filtering

Look at this list, and then think about how many papers you’ve read in which the “pathway analysis” of gene hits came up with olfactory or immune system pathways. Simply put, there’s a lot of protein sequence variation necessary for our sense of smell and our immune system, and underlying that is a great deal of genetic variation.

For rare Mendelian disorders, most of these are unlikely to harbor the causal variant. So it’s very tempting to simply filter them out and look at what’s left. There’s an inherent danger in that, however, since many of these are real genes that encode important proteins, and thus they could easily harbor variants that contribute to genetic disease.

Reducing False Positives from NGS

The next-gen sequencing community is not unaware of the challenges that false positives bring to bear on analysis of sequencing data. Many deterministic and/or statistical methods have been developed to help address them. For example, the significantly mutated gene (SMG) test in the MuSiC package considers gene size and local mutation rates in its assessment of whether a given gene is mutated more often than expected in a cancer dataset. And most variant calling pipelines employ advanced filtering strategies to remove systematic false positives that arise from common sequencing artifacts, like paralogous alignment.

The Need for a Better Reference

Many of the problems we face in NGS analysis will be mitigated by a better human genome reference assembly. This is because many parts of the human genome simply aren’t represented [accurately] in the reference, causing BWA and most other aligners to map them incorrectly. Worse, the aligners may not see anything wrong with such placements, and so they give the aligned reads a high mapping quality.

Of course, improving the human reference genome is no small request. Even with a better assembly in hand, releasing it, seeing it adopted by the NGS community, and updating all of our annotations of the human genome to its coordinate system requires a huge amount of work and time. In the end, however, it will give us better analysis results, and improve our ability to understand the genetic basis of human disease.

Print Friendly