At last, some results from Evan Eichler’s SV project! The results of the first phase of the “Human Genome Structural Variation Project” were presented in today’s issue of Nature. I’ve been cognizant of this project for a couple of years and eager for the results, as it is really the first large-scale, sequence-based study of copy number and structural variants. As it happens, our Genome Center played a big role in the sequencing, and two of our researchers (Tina Graves and Rick Wilson) are among the authors.
In fairness, however, I should disclose another thing about the Evan Eichler project.
In 2006, just after three simultaneous papers in Nature Genetics brought structural variation to the forefront, my former lab began working on a grant proposal. In it, we proposed to mine existing trace data (from NCBI’s Trace Archive) for putative structural variants in the human genome. We were developing a sequence-based approach to identify reads spanning insertions, deletions, duplications, inversions, and translocations. It was an ambitious project and the timing was perfect, but, unfortunately, NIH sent our proposal back twice. Unscored. Later, I learned that a group headed by Evan Eichler pretty much locked up U.S. funding for this research in the form of a $40 million grant. With all of the NIH eggs in a single basket, I thought to myself, they had better deliver.
It looks like I won’t be disappointed. Dr. Eichler and colleagues constructed whole-genome libraries of ~1 million fosmids for each of 8 individuals whose samples were used in the HapMap Project. Four were African, two were CEPH European, one was Japanese, and one Han Chinese. They used the fosmid end sequencing approach (described by Tuzun et al, 2004) in which you sequence the ends of each fosmid and map the sequences to the reference genome. Altogether, about 6.1 million end-sequence pairs (ESPs) were uniquely mapped to the genome. Of these some 76,767 (~1.26%) were discordant by alignment distance or orientation, indicating a possible underlying structural variant. It’s a big paper (the Supplemental File alone was 57 pages) so I’ll hit you with the take-homes:
- The human genome is still incomplete. The fosmid approach yielded a number of novel sequences not present in the current human genome assembly. The sequences range in size from 2-130 kbp, were randomly distributed among genic/nongenic regions, and were often (40% of the time) copy-number variable. There’s still more human genome out there, folks.
- Current CNV databases are inflated. The higher resolution of sequence-based approaches served to refine the location of many previously-reported CNVs. Generally speaking, CNVs on a haplotype are smaller (in bp) than copy-number-variable regions as a whole, meaning that fewer genes are affected. Array-cGH platforms, which to-date are among the most widely-used approaches to assess CNV, greatly exaggerate the size of these variants. And less than half of the SVs in this study cannot adequately be genotyped with existing platforms.
- NAHR is the driving force behind most structural variation. About half of the SVs detected in this study were flanked by segmental duplications. This is nothing new, but it somewhat contradicts the study by Korbel and colleagues last year, which used 454 paired-end sequencing and reported that NAHR-mediated events were rare. Eichler et. al. suggest that the association of NAHR with segmental duplications and the difficulties of short sequence reads to resolve such regions might together mean that next-generation ESP approaches are missing a lot of variation. I’m not sure I agree, but that’s what they said.
Overall it seems like an impressive publication. They planned and executed a very careful study, and as a result, we learned quite a bit about the landscape of structural variation in the human genome. Not bad, Dr. Eichler.