Next-gen sequencing has helped elucidate the genetic basis of numerous inherited diseases. Single-gene Mendelian disorders, while rare, are the lowest-hanging fruit for such discoveries. They’re rare, they run in families, and they presumably are caused by a single mutation shared by all affected individuals. Getting the samples is usually the hard part; once that’s done, some exome sequencing and rather straightforward analysis typically narrows the list of suspects to a few hundred variants:
If identifying the genetic cause of Mendelian disorders is that easy, why aren’t we finding more of them? There are many genetic diseases that remain unsolved, despite access to large families and high-throughput sequencing technologies. Some have even been traced, via linkage mapping, to part of a chromosome. And yet, even after you sequence the exomes of several family members (affected and unaffected) the causal variant or gene remains elusive. How can this be? I offer seven possible explanations.
- The causal variant was found, but deemed non-pathogenic. Analysis frameworks like the one above make numerous assumptions based on the expected properties of a disease-causing variant. We anticipate rare, protein-altering mutations at conserved residues of known protein-coding genes. What about a synonymous SNP that won’t change the amino acid, but requires a rare tRNA? That might easily reduce protein levels. Regulatory variants affecting splicing, transcription, or translation might similarly contribute to a phenotype.
- The causal variant was found, but exists in a public database. There are so many ways this can happen. Someone reported a false positive. Or someone sequenced an affected individual without knowing it. Or the variant came from a mutation database, such as OMIM, which contains disease-causing variations.
- The causal variant was missed by your variant callers. Exome sequencing typically prioritizes SNPs and small indels, despite the fact that other types of variation (large indels, inversions, and structural variants) exist, affect coding sequences, and have been linked to disease. Some of these can be found with exome sequence data, but you have to be looking for them.
- The causal variant wasn’t covered with sufficient sequence depth. Current exome reagents are able to capture about 90% of the known protein-coding exome. Certain regions, especially repetitive or GC-rich sequences, are under-represented. So there’s a 10% chance that even an obvious disease-causing variant won’t be captured.
- The region harboring the causal variant wasn’t targeted for sequencing. It might be an as-yet undiscovered gene, an alternative exon, or a regulatory element in a noncoding region. You won’t even have a chance to ask “Could this be the variant?” because it simply wasn’t targeted and thus not identified. This possibility nags at most of us who rely on exome sequencing.
- One of your assumptions is wrong. Even with perfect sequencing and analysis, success relies on a number of assumptions being correct mode of inheritance, high penetrance, accurate diagnosis. Maybe you thought it was X-linked but it’s actually autosomal-recessive. Or a digenic disorder. The diagnosis could be incorrect for either cases or controls. Family relationships could be “inaccurate”, to put it politely.
Ultimately, exome sequencing itself may be the root of many negative results. Whole-genome sequencing and newer, better algorithms will eventually be required to find the genetic basis of many (even Mendelian) diseases.
Li MX, Gui HS, Kwan JS, Bao SY, & Sham PC (2012). A comprehensive framework for prioritizing variants in exome sequencing studies of Mendelian diseases. Nucleic acids research, 40 (7) PMID: 22241780