The analysis of NGS data comes with many challenges — data management, read alignment, variant calling, etc. — that the bioinformatics community has tackled with some success. Today I want to discuss another critical component of analysis that remains an unsolved problem: annotation of genetic variants. This process, in which we try to predict the likely functional impact of individual sequence changes, is crucial for downstream analysis. Virtually every type of genetic study — family studies of rare disorders, case-control studies, population genetics surveys — relies on annotation to identify the variants that are most likely to influence a phenotype.
A paper currently in pre-print at Genome Biology reports that choice of transcripts and software has a large effect on variant annotation.that are used. I’ll talk about some of their findings as part of a wider discussion of the variant annotation problem.
Annotation Challenges and Complexities
Even in the protein-coding portions of the genome, where we know the most about gene structure and function, predicting the impact of a single base change is not always straightforward. Here are a few of the reasons why:
- Multiple isoforms. The ENCODE consortium’s extensive RNA sequencing revealed that the average protein-coding gene has something like 5 different isoforms with different transcription starts/stops or exon combinations. It’s very difficult to predict which of these will be active in the cell type at the time point of interest for a given disease.
- Overlapping genes. Even if you could handle the isoforms, there are still going to be variants that affect two or more different genes. The two genes might share an exon, or it could be one gene’s exon and another’s promoter. This one-to-many relationship of variants to genes can be problematic in the many downstream pipelines that expect exactly one annotation per variant.
- Competing annotation databases. There are at least three widely-used annotation databases (ENSEMBL, RefSeq, and UCSC) that provide a set of human transcripts for annotation purposes. Their minimum evidence requirements and curation procedures differ, so the transcript sets they provide are not the same. RefSeq release 57 (REFSEQ) has 105,258 human transcripts, while ENSEMBL 69 (EMBL) has nearly twice that (208,677).
- Ranking procedures. Even a variant in a single gene with one isoform can have multiple annotations: it could be a synonymous change that’s also in a splice site, or a nonsynonymous variant that disrupts the stop codon. Which annotation should be reported? “All of them” is too easy of an answer. At some point, downstream analysis may require users to make a choice.
Comparing Annotation Databases and Software Tools
In the paper I mentioned, McCarthy et al took 80 million genetic variants (SNPs and small indels) obtained from whole-genome sequencing of 274 individuals. We don’t know much about ancestry, but they include 80 patients with immune disease, 151 from families with Mendelian disorders (mostly trios), and 45 from cancer studies (germline DNA only). The authors compared variant annotations from two different tools (ANNOVAR and VEP) using the REFSEQ or EMBL transcript databases.
I’ve discussed this paper with a number of colleagues, and we share some concerns about how the comparison was conducted. Even so, I think that the work highlights some of the important differences between these tools and databases. If I distill it down to what I consider the highlights:
- ANNOVAR annotation of 80 million variants using either REFSEQ and EMBL transcripts returned matching annotations about 84% of the time.
- However, for variants considered “loss of function” (LOF: missense, nonsense, nonstop, frameshift, splice site), the concordance was only 44%.
- Much of the disagreement can be attributed to EMBL having twice as many transcripts: it yielded more exonic annotations, and also annotated many variants as UTR or noncoding RNA when REFSEQ considers them noncoding.
- VEP and ANNOVAR software tools did not always agree, even when using the same transcript set. VEP seems to provide better annotation of variants in and around splice sites.
- There are also differences in reporting between the tools: ANNOVAR reports the most-damaging annotation for a variant, whereas VEP tends to report all annotations. This forced the authors to apply a ranking system to VEP results in order to make comparisons, and that likely caused some mismatches as well.
Is There A Right Answer?
It needs to be said that comparative studies like these are extremely difficult to undertake. It’s easy to point out the flaws, but we should still applaud the efforts of the authors, who undertook a major effort to help us better understand how annotations can differ. Variant annotation is much like variant detection, in that the quality of the results depends on:
- The software tools (e.g. VEP vs. ANNOVAR, VarScan vs. GATK) and their underlying algorithms
- The quality of the input data (e.g. read alignments for variant calling, transcript sets for annotation).
It pains me to say this, but there are limits to what we can do computationally, and we’ll almost certainly need experimental data to determine the right answer. For variant detection, that might be validation of variant calls on an orthogonal platform. For variant annotation, that might mean RNA-Seq data or proteomics approaches. This is a hard problem to solve, and these are the regions of the genome that we probably know best.
Imagine what it will take to accurately annotate variants in regulatory and noncoding regions of the genome.
Davis J McCarthy, Peter Humburg, Alexander Kanapin, Manuel A Rivas, Kyle Gaulton, The WGS500 Consortium, Jean-Baptiste Cazier and Peter Donnelly (2014). Choice of transcripts and software has a large effect on variant annotation Genome Medicine, 6 (26) : doi:10.1186/gm543