Today, if you could find someone to do it, you might have your genome sequenced for around $10,000. You would receive a list of your ~3 million genetic variants: predominantly base substituations (SNPs), but also insertions or deletions (indels) and larger structural changes (e.g. copy number variants). And you would probably have no idea what it means.
Genome interpretation, not genome sequencing, is the challenge of the coming decade. Consider this: the time required to sequence a human genome is currently 8-10 days, but the time analyze it ranges from months to years. Analysis and interpretation are already more expensive than sequencing when you consider the cost of salaries and computing cycles. Many who are cognizant of this reality see an opportunity there: just look at the number of “genome annotation” startups in the last year.
Functional Annotation is Hard
I just attended a talk by Daniel MacArthur, a fellow blogger (at Genetic Future) who’s just starting his own lab at Massachusetts General Hospital. The topic was loss-of-function variants in human genomes, and centered on his recent work in characterizing LOF variants using 1,000 Genomes Project data. Several key lessons from that study have important implications for human genome interpretation.
First, true loss-of-function variants are exquisitely rare, largely because they’re under purifying selection. This means that:
- Candidate LOF variants are enriched for false positives. Indeed, in the LOF study, the false positive rate for LOF variants was unusually high (21%).
- The more common a variant, the less likely it is to be functional. Common LOF variants (>3%) were found in genes that are less conserved, have more paralogues, and participate in fewer protein-protein interactions. In other words, they likely have a minor effect on phenotypic variation.
Second, Daniel and his colleagues uncovered some difficulties in the interpretation of variant annotations. Namely:
- Variant effects depend on other variants. A variant that appears to have a deleterious effect when considered alone might be far less damaging in the context of another nearby variant. For example, the combined effect of a 17 bp frameshfit deletion and a 1 bp frameshift deletion in the same exon could be an 18 bp inframe deletion, which is likely far less damaging. A number of potential LOF variants turn out to be not-so-damaging when considered in such contexts.
- Effects vary by position. While nonsense and frameshift events have the potential to be quite damaging, they’re not created equal. A premature stop codon early in the protein sequence tends to be far more devastating than, say, a premature stop that only removes 5% of the protein’s length.
- Haplotypes matter. Homozygous inactivation of a gene requires a variant disrupting both copies. Without assigning variants to haplotypes, we don’t know if one copy of a gene has two LOF variants, or if both copies are disrupted by a single variant.
Complexity of the Human Genome
One thing that’s become quite apparent since we sequenced the human genome is that it’s incredibly complex. The effect of a genetic variant depends on many factors beyond how it impacts protein structure: where the gene is expressed, when it’s expressed, how it interacts with other genes that themselves may harbor variants, etc. In his talk, Daniel admits that the loss-of-function study went after the lowest-hanging fruit: truncation variants (nonsense, splice site, and frameshift) in known protein-coding genes.
Other categories of loss-of-function variants (noncoding regulatory variants, large-scale structural variation, etc.) are out there, and will be even harder to interpret. And let’s not forget the effect of environment. Diet, exercise, exposure to carcinogenes, and other factors may have a far more dramatic effect on whether or not someone gets a particular disease than their constitutional genetic makeup.
Sure, we can sequence a human genome. But we’re a long, long way from being able to interpret one.
References
MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB, Albers CA, Zhang ZD, Conrad DF, Lunter G, Zheng H, Ayub Q, DePristo MA, Banks E, Hu M, Handsaker RE, Rosenfeld JA, Fromer M, Jin M, Mu XJ, Khurana E, Ye K, Kay M, Saunders GI, Suner MM, Hunt T, Barnes IH, Amid C, Carvalho-Silva DR, Bignell AH, Snow C, Yngvadottir B, Bumpstead S, Cooper DN, Xue Y, Romero IG, 1000 Genomes Project Consortium, Wang J, Li Y, Gibbs RA, McCarroll SA, Dermitzakis ET, Pritchard JK, Barrett JC, Harrow J, Hurles ME, Gerstein MB, & Tyler-Smith C (2012). A systematic survey of loss-of-function variants in human protein-coding genes. Science (New York, N.Y.), 335 (6070), 823-8 PMID: 22344438