At last published in early access at Genome Research is the whole-genome sequencing of a Yoruban male on ABI SOLiD technology. A year ago, this might have merited a Nature or Science publication. That window seems to have closed for whole-genome sequencing of a single, undiseased individual. By my count, this is the sixth published individual genome sequenced on next-gen platforms. I begin to wonder if this ABI SOLiD paper is too little, too late.
Well, it’s probably not too little. The advance access PDF is over 60 pages, and I must admit that the authors did a substantial amount of work to identify, characterize, and discuss the sequence variation in this genome. Despite a relatively modest coverage level (18x), the combination of paired-end sequencing and two-base encoding made it possible to simultaneously detect SNPs, small indels (3-11 bp), large indels (30 bp-97 kbp), and structural variants.
Two-Base Encoding in Colorspace for Calling SNPs
My central interest, however, is how much the two-base encoding aids distinguishing SNPs from sequencing errors. The ABI SOLiD study identified ~3.8 million SNPs in the genome, compared to 4.1 million SNPs identified by Illumina sequencing of the same individual, an anonymous African male from the HapMap collection. However, the ABI study did it with less than half the coverage (18x compared to 40x), and called a greater fraction of novel-to-dbSNP SNPs (19% compared to 12.7%). Experimental validation confirmed 280 of 299 (94%) of the novel SNPs, suggesting that most of these variants are real.
The authors performed a rather elegant comparison with HapMap data for this individual, by comparing not only SNP genotypes but the phase of the genotypes, which they inferred on the basis of mate pair information. Some 21.74% of HapMap-phased heterozygotes were covered by at least one ABI read pair, and the phase agreement was 98.95%. Thus, the read-pairing strategy employed by ABI can serve to produce more accurate and complete haplotyping of the sequenced individual. I find this side-benefit of whole-genome sequencing to be very valuable, given the huge amount of money and efforts spent to build the human haplotype map.
Lots of Indels and Structural Variants
Perhaps the greatest strength of this study is that it represents, to my knowledge, the most extensive and detailed effort to characterize indels/SVs from WGS of a single individual. Small intra-read indels (<=13 bp) had a high dbSNP concordance (67%), perhaps benefited by the terminating chemistry and two-base encoding of ABI SOLiD. Using mate pair information to identify discordant insert clones, the authors called 1,515 insertions (30-1,287 bp in size) and 4,075 deletions (86-96,957 bp in size), many of which were also detected in Venter, Watson, and CHB (?) genomes.
Cross-WGS Comparisons: Key Illumina Study Ignored
In a direct comparison, 20% of the SNPs identified in the ABI study were also seen in Watson, Venter, and CHB genomes. Fewer structural variants were shared between genomes, but this very well may be related to the difficulty in calling such types of variation on different platforms, rather than true biological diversity. Here’s something I find both irritating and amusing. The ABI study authors made no comparisons whatsoever to the results from the Bentley et al. (Illumina WGS) study, which is surprising since BOTH STUDIES SEQUENCED THE SAME INDIVIDUAL. I refer you to:
“We sequenced the genome of a male Yoruba from Ibadan, Nigeria (YRI, sample NA18507).” [Bentley et al] “We compared the SNPs and structural variations identified in NA18507 to those found in the Venter (Levy et al. 2007), Watson (Wheeler et al. 2008) and YH (Wang et al. 2008) genomes.” [McKernan et al].I’m sorry, but when you do whole-genome sequencing on an individual that’s been sequenced already on a different technology, you have to do that comparison. Whatever their reasons, the ABI study authors’ decision to blatantly avoid comparisons with Bentley et al results is outright negligence.
Functional Consequences of Genetic Variation
The authors embarked on a long exploration of the putative phenotypic impact of variants in NA18507 using OMIM and HGMD databases along with a comprehensive literature review. They developed a pipeline to map the poorly-formatted OMIM entries to genomic coordinates, and successfully obtained 9,239 uniquely mapped nonsynonymous OMIM variants. I’d hoped for a supplemental table of these, or better yet that the results might be shared back with OMIM, but alas. No dice. NA18507 is apparently a carrier for over 50 disease-associated alleles, including five which appear to be homozygous. These are all listed in supplemental tables 4 and 5, however, no supplemental data appears to be available at present.
There were 2,477 large indels in NA18507 that potentially disrupted genes. Among 2,015 genes affected, some 303 were disease-associated genes from OMIM, HGMD, or the literature review. The authors conclude “we can see a trend for disruption events to cluster around genes, but no clear preference to cluster around disease genes. Further analysis of these disruption events along with an evaluation of whether an exon is disrupted is warranted.” This is why individual HapMap genomes no longer merit Nature papers. Without a phenotype to study, “further investigation is warranted” is as far as such studies can go to assess the functional impact of many mutations.
Signatures of Natural Selection
All gripes aside, the study did provide evidence of purifying selection, notably an under-representation of damaging nsSNPs, and an under-representation of variation inside exons in general. Using the Panther database, the authors identified several protein families with evidence of purifying selection (fewer than expected damanging nsSNPs) – nucleic acid binding proteins, ligases, transferases, transcription factors, and of course kinases. There were also categories over-represented for damaging nsSNPs, which may reflect either higher mutation rates or positive selection. These included G-protein coupled receptors, extracellular matrix glycoproteins, cell adhesion molecules, as well as genes related to olfactory perception. Ah yes, sense-of-smell diversity.
The Outlook for ABI SOLiD
With a high-profile publication of an individual human genome, ABI SOLiD officially joins the ranks of WGS-enabling platforms. In my opinion, they’re a little late to the game. I recall seeing a poster presenting much of this data about a year ago, and even that was after Illumina had taken the lead in whole genome sequencing. According to a report by Julia Karow on Genomeweb, SOLiD accounts for just 17% of next-gen sequencers at major genome centers, just ahead of Roche/454 (14%) but well behind Illumina, which claims 2/3 of the market. ABI can’t compete with 454 on read length, and it can’t compete with Illumina on data throughput or market share. In short, SOLiD needs to find a niche, and find it quickly, or this platform will go the way of the dodo.