In my last post, I reviewed a genome-wide association study highlighting the importance of rare genetic variants in complex disease, specifically age-related macular degeneration (AMD). Notably, that GWAS was conducted using a custom high-throughput SNP array with classic GWAS variants (tag SNPs), a catalog of known protein-altering variants (exome chip) and several custom sets based on prior studies of AMD:
- All variants correlated with replicated GWAS hits for AMD
- Protein-altering variants within 500 kb of 22 “index SNPs” uncovered in targeted sequencing of GWAS loci
- Virtually all known variants in ABCA4 (in which recessive-acting mutations cause Stargardt disease), independent of consequence
- Predicted cysteine-altering substitutions in TIMP3, because the known cysteine mutations cause an AMD-like phenotype.
Altogether, the authors examined 440,000 unique variants in more than 43,000 samples (cases & controls). The genotyped markers accounted for 47% of variability in advanced AMD risk. Some of the associated variants were super rare (MAF<1%), suggesting that genotyping studies like this are well-powered to detect associations even at allele frequencies below one percent. Which leads some researchers to ask a difficult question:
Why Do We Need Sequencing?
Despite the plummeting costs afforded by newer instruments, sequencing studies remain far more expensive than genotyping studies: exome sequencing costs 3-5x more, and whole genome sequencing costs 15-20x more.For genetic studies of common complex disease, many researchers now consider 10,000 samples the absolute minimum. A cohort like the one in the AMD study (44,000 samples) probably costs $2.2 million to genotype, compared to sequencing costs of $9.7 million (exome) to $45 million (whole genome).
Sure, you could do fewer samples, but then you lose the power for detecting association in the lowest-frequency variant classes.
Despite these economic challenges, I believe there are several strong arguments for sequencing rather than genotyping.
1. Rare and private variants
No matter how comprehensive a SNP chip might be, the design still relies on known variant positions. The incredible growth of databases like dbSNP — fueled by large-scale discovery sequencing studies — certainly provides millions of variants to choose from. Yet the fact remains that 2-5% of genetic variants in an individual genome are novel with respect to public databases. These are generally super-rare variants which is why they haven’t been found. They might be private to just a family (inherited) or even an individual (de novo). SNP arrays will always miss this class of variants, because their positions and alleles aren’t known before the experiment.
2. Large-scale variation
The argument for SNP arrays also conveniently ignores larger genomic variants: insertions, deletions, duplications, inversions, and other rearrangements. While far less prevalent than SNPs, structural variants (SVs) can affect more bases in an individual’s genome simply because of their size. These are generally not amenable to high-throughput array designs because of their size, and the imprecise nature of their boundaries. Although common SVs may be tagged reasonably well by SNP markers, the rare genetic variants will not be.
A significant proportion of SVs affect known protein-coding genes by altering coding sequence or gene dose. Although SV detection by short-read sequencing is by no means a solved problem, this class of variation is missed entirely by a SNP chip design.
3. Regulatory sequences and functional elements
A fundamental weakness of exome chip designs (and exome sequencing, for that matter) is the emphasis of known genes. Undoubtedly, many (if not most) of the variants underlying complex phenotypes are located outside of the 1.5% of our genome that codes for proteins. We expect that common variants in such elements will be well-interrogated by classic GWAS approaches, but rare variants will not. And we don’t yet know enough about the regulatory regions of the genome to select variants for a custom array.
4. Aggregation Tests
Aggregation tests (sometimes called burden tests) were developed to help identify genetic association driven by variants that, individually, are too rare to reach statistical significance on their own. The theory and approaches of aggregation testing are too broad a topic to cover here, but the general concept is this: by grouping individual rare variants into a biological unit — most often, a gene or exon — it’s possible to test the super-genotype (i.e. “has a rare variant in gene A”) for association. Without examining hundreds of thousands of samples, this is the only way to identify very rare trait-associated variants.
Although they rely on certain assumptions, such as the ability to define which variants truly affect gene function, aggregation tests have another thing going for them. The collective association of multiple independent variants in one locus strongly suggests that it’s the functional element responsible for that association. In other words, while common associated variants (tag SNPs) tell us the region of the gene where variation seems to affect a disease, aggregation tests identify the potential drug target.
You Don’t Know What You’re Missing
All of this points to the heart of the problem with genotyping over sequencing studies: many of the most interesting classes of variation are overlooked. These may or may not play a role in disease risk, but you don’t know because the GWAS didn’t interrogate them. This is especially worrisome for diseases in which the genetic architecture of risk remains poorly understood. A well-powered GWAS that produces no hits might mean that genetic variation has little influence on the phenotype. But it might also mean that rare or non-SNP variants govern the genetic basis of disease.
We won’t know until we survey them all, and genome sequencing remains the only way to do that.