At last published in early access at Genome Research is the whole-genome sequencing of a Yoruban male on ABI SOLiD technology. A year ago, this might have merited a Nature or Science publication. That window seems to have closed for whole-genome sequencing of a single, undiseased individual. By my count, this is the sixth published individual genome sequenced on next-gen platforms. I begin to wonder if this ABI SOLiD paper is too little, too late.
Well, it’s probably not too little. The advance access PDF is over 60 pages, and I must admit that the authors did a substantial amount of work to identify, characterize, and discuss the sequence variation in this genome. Despite a relatively modest coverage level (18x), the combination of paired-end sequencing and two-base encoding made it possible to simultaneously detect SNPs, small indels (3-11 bp), large indels (30 bp-97 kbp), and structural variants.
Two-Base Encoding in Colorspace for Calling SNPs
My central interest, however, is how much the two-base encoding aids distinguishing SNPs from sequencing errors. The ABI SOLiD study identified ~3.8 million SNPs in the genome, compared to 4.1 million SNPs identified by Illumina sequencing of the same individual, an anonymous African male from the HapMap collection. However, the ABI study did it with less than half the coverage (18x compared to 40x), and called a greater fraction of novel-to-dbSNP SNPs (19% compared to 12.7%). Experimental validation confirmed 280 of 299 (94%) of the novel SNPs, suggesting that most of these variants are real.
The authors performed a rather elegant comparison with HapMap data for this individual, by comparing not only SNP genotypes but the phase of the genotypes, which they inferred on the basis of mate pair information. Some 21.74% of HapMap-phased heterozygotes were covered by at least one ABI read pair, and the phase agreement was 98.95%. Thus, the read-pairing strategy employed by ABI can serve to produce more accurate and complete haplotyping of the sequenced individual. I find this side-benefit of whole-genome sequencing to be very valuable, given the huge amount of money and efforts spent to build the human haplotype map.
Lots of Indels and Structural Variants
Perhaps the greatest strength of this study is that it represents, to my knowledge, the most extensive and detailed effort to characterize indels/SVs from WGS of a single individual. Small intra-read indels (<=13 bp) had a high dbSNP concordance (67%), perhaps benefited by the terminating chemistry and two-base encoding of ABI SOLiD. Using mate pair information to identify discordant insert clones, the authors called 1,515 insertions (30-1,287 bp in size) and 4,075 deletions (86-96,957 bp in size), many of which were also detected in Venter, Watson, and CHB (?) genomes.
Cross-WGS Comparisons: Key Illumina Study Ignored
In a direct comparison, 20% of the SNPs identified in the ABI study were also seen in Watson, Venter, and CHB genomes. Fewer structural variants were shared between genomes, but this very well may be related to the difficulty in calling such types of variation on different platforms, rather than true biological diversity. Here’s something I find both irritating and amusing. The ABI study authors made no comparisons whatsoever to the results from the Bentley et al. (Illumina WGS) study, which is surprising since BOTH STUDIES SEQUENCED THE SAME INDIVIDUAL. I refer you to:
“We sequenced the genome of a male Yoruba from Ibadan, Nigeria (YRI, sample NA18507).” [Bentley et al] “We compared the SNPs and structural variations identified in NA18507 to those found in the Venter (Levy et al. 2007), Watson (Wheeler et al. 2008) and YH (Wang et al. 2008) genomes.” [McKernan et al].I’m sorry, but when you do whole-genome sequencing on an individual that’s been sequenced already on a different technology, you have to do that comparison. Whatever their reasons, the ABI study authors’ decision to blatantly avoid comparisons with Bentley et al results is outright negligence.
Functional Consequences of Genetic Variation
The authors embarked on a long exploration of the putative phenotypic impact of variants in NA18507 using OMIM and HGMD databases along with a comprehensive literature review. They developed a pipeline to map the poorly-formatted OMIM entries to genomic coordinates, and successfully obtained 9,239 uniquely mapped nonsynonymous OMIM variants. I’d hoped for a supplemental table of these, or better yet that the results might be shared back with OMIM, but alas. No dice. NA18507 is apparently a carrier for over 50 disease-associated alleles, including five which appear to be homozygous. These are all listed in supplemental tables 4 and 5, however, no supplemental data appears to be available at present.
There were 2,477 large indels in NA18507 that potentially disrupted genes. Among 2,015 genes affected, some 303 were disease-associated genes from OMIM, HGMD, or the literature review. The authors conclude “we can see a trend for disruption events to cluster around genes, but no clear preference to cluster around disease genes. Further analysis of these disruption events along with an evaluation of whether an exon is disrupted is warranted.” This is why individual HapMap genomes no longer merit Nature papers. Without a phenotype to study, “further investigation is warranted” is as far as such studies can go to assess the functional impact of many mutations.
Signatures of Natural Selection
All gripes aside, the study did provide evidence of purifying selection, notably an under-representation of damaging nsSNPs, and an under-representation of variation inside exons in general. Using the Panther database, the authors identified several protein families with evidence of purifying selection (fewer than expected damanging nsSNPs) – nucleic acid binding proteins, ligases, transferases, transcription factors, and of course kinases. There were also categories over-represented for damaging nsSNPs, which may reflect either higher mutation rates or positive selection. These included G-protein coupled receptors, extracellular matrix glycoproteins, cell adhesion molecules, as well as genes related to olfactory perception. Ah yes, sense-of-smell diversity.
The Outlook for ABI SOLiD
With a high-profile publication of an individual human genome, ABI SOLiD officially joins the ranks of WGS-enabling platforms. In my opinion, they’re a little late to the game. I recall seeing a poster presenting much of this data about a year ago, and even that was after Illumina had taken the lead in whole genome sequencing. According to a report by Julia Karow on Genomeweb, SOLiD accounts for just 17% of next-gen sequencers at major genome centers, just ahead of Roche/454 (14%) but well behind Illumina, which claims 2/3 of the market. ABI can’t compete with 454 on read length, and it can’t compete with Illumina on data throughput or market share. In short, SOLiD needs to find a niche, and find it quickly, or this platform will go the way of the dodo.
Keith Robison says
Maybe the comparison with the Illumina paper will be in another publication, though you’d think they’d at least say something.
Are both datasets in the Short Reads Archive? It would be a great project for a grad student or advanced undergraduate to compare the two sequences. Some of the contract vendors I have spoken to claim the SOLiD has lower error rates, and it would be great to get some hard numbers on that.
Kevin McKernan says
I’ve really enjoyed this thread but feel obligated to clarify a concern of the original author.
This was not a blatant omission on our end.
At the time of submission the ILMN genotypes were not available. They posted the coordinates of their variation but not the actual variant genotype making a direct comparison weak and probably uncredible. Bit odd that a Nature paper can exist without such critical information being available and I dont see any reason why they would choose to keep these private for such a long period of time.
We have no such reservations and plan to have all variants public.
– it is the policy of Genome Research to withhold the supplemental data until the paper is in print
My subjective 2 cents on Niche and dodos….I dont think this race is over in the 1st 900 machines considering over 12,000 CE instruments have been sold. Most analysts are predicting this market to be far bigger than the CE market and predicting SOLiD will have a part of it. For what little its worth, ILMN is down ~20% this week based on missing their numbers. LIFE is up.
Additionaly, Just as Bentley et al was GAI data this is all SOLiD V2.0 data in the paper. The field is moving quickly and 50Gb runs are happening quite frequently on SOLiD 3.0s (One presented @CSH this 5/09). Many would argue SOLiD has been the pace setter and has a history of always leading the throughput curve in terms of Gb/day.
ie..CSH Biology of Genomes meeting has for the last 2 years had SOLiD presentations from customer with 2-3X more data/day than ILMN with reports of 99.91% accuracy (5-10X higher than the reported accuracy on ILMN at the time).
Obviously, mate pairs are not unique to SOLiD but the accuracy is and it enables the investigation of phasing (must be confident you have 2 SNPs in your mate pair, not a SNP and an error). If you have paired 50mers (100bp) and the reported >1% error on ILMN, you will always have at least 1 error in your paired end. Makes a phasing study a bit more cumbersome. We have not seen others phase structural variants like we have with SOLiD. These long insert high accuracy reads will become of increasing value as we move to more reference free de novo assemblies.
I dont think there is a Next Gen genome done with this much physical coverage and insert size. Nor do I believe other platforms have yet bridged the gap between small indels and large indels to this extent. Of course, there are several SOLiD users with far deeper genomes than this now so its an emphemeral title:)
sm says
Thanks Dan for the interesting post and Kevin for the clarification. It is always a concern of separating hype from science when so much moolah is at stake for the vendors!
Dan Koboldt says
The ABI SOLiD paper has seen additional scrutiny this week. Over at Genetic Future, a guest post by Luke Josten criticizes the SOLiD publication for fudging its numbers in a few places, and wonders if the grey areas of the paper are why we read it in Genome Research rather than Science or Nature. GenomeWeb’s Daily Scan picked up the story in a posting entitled A Whole Genome Mess.
Gael Cristofari says
As a molecular biologist interested in genomic variation I tried to see how I could actually use the data produced by any of these companies. To my surprise I couldn’t found a way to get assembled genome sequences, which would make easy to work on the sequences. Only short reads are made publicly available through NCBI (and SNPs). This also prevent direct comparisons between these two data set and also reduces the use of these huge datasets by biologists without high bioinformatic knowledge and big computers to repeat the assembly process.
By the way I discovered your blog today and read all the threads !! I loved your ‘journal club’-like analyses !!!