Not-so-whole Exome Sequencing

There is growing interest in applying next-generation sequencing to targeted regions of interest, particularly the “exome” – the set of coding exons in the human genome. A paper in Genome Biology from Matthew Bainbridge and colleagues at Baylor describes solution-phase exome capture and sequencing of a HapMap sample with just 3 GB of data. The 1,000 Genomes Project recently announced a new pilot study focused on exome sequencing for hundreds of individuals. A few studies of human exome resequencing to identify disease genes have been published, and more are sure to come as genome centers ramp up their exome capabilities.

Yet this week’s In Sequence magazine writes that there are concerns about what exome capture is missing. For example, at CHI’s Beyond Sequencing meeting this week, researchers from NCI reported that current exome capture projects omit some medically important genes, such as insulin, ABO blood group, and HLA. Of course, some of this can be attributed to GC-rich exons and other tough-to-capture regions. The concern is that many RefSeq coding sequences aren’t even targeted by the two commercial platforms – 23% are missing from Nimblegen’s 2.1m array, and 17% are missing from Agilent’s SureSelect (according to the NCI group).

Exome Sequencing on Illumina and SOLiD

Even so, exome sequencing is rapidly reaching maturity. The Baylor study, led by Matt Bainbridge, used a customized Nimblegen solution-phase capture product to target 36 Mbp of consensus coding sequence (CCDS), and sequenced capture libraries on both ABI SOLiD and Illumina GAII platforms. Six individual capture libraries were generated from HapMap sample NA12812. Four were sequenced as technical replicates on SOLiD, while two more libraries went to Illumina single-end and paired-end sequencing.

On average, some 49.6% of mappable reads from the four SOLiD libraries were derived from target regions, with the remainder mapping elsewhere in the genome. The target coverage correlation between the four replicates was 98%, suggesting that reproducibility across capture and SOLiD sequencing was pretty good.

Duplication Rates in Exome Capture

The authors performed a detailed analysis of duplication rates in their data, a metric that is critical to the unique coverage and downstream analysis. The duplication rate for three SOLiD libraries with 3GB of data was ~22%, and highly consistent between replicates. Duplication was higher (~33%) in the fourth SOLiD library, which is not surprising since it had more than three times (10 GB) the data.

Intriguingly, the authors used simulations to demonstrate that the “expected” duplication rates for 3GB and 10GB of data are 14% and 22% by random chance, suggesting that as many as one-third of observed duplicates are not artifactual, but chance events.

Paired-end sequencing offers the opportunity to identify duplicates using both reads in a read pair. Theoretically, this should help distinguish artifacts from chance events. Indeed, the authors observed a dramatic difference in duplication rate between the Illumina fragment-end (30.97%) and paired-end (8.3%) libraries, even though both generated about 2.5 GB of data. They surmised that the improved identification of duplicates from paired-end sequencing, not a difference in library construction, was the reason. When pairing information was ignored, the duplication rate in the PE library nearly quadrupled to 27.6%.

SNP Discovery and HapMap Concordance

Because this was a HapMap sample, the authors were able to compare SNPs identified in sequencing to known genotypes from the HapMap Project. Genotype concordance in the target regions was 82% for 3GB libraries and 92% for 10GB libraries, but importantly, this considered all sites regardless of coverage. When the authors limited comparisons to sites with >=9x unique read depth, concordance was ~95%. That’s still a bit low for my taste, but within the realm of expectation for sequence-to-genotype comparisons.

SOLiD Versus Illumina Sequencing

I was pleased that Bainbridge and his colleagues made some direct comparisons between SOLiD and Illumina sequencing. This is a delicate issue, from the point of view of the sequencing vendors, but one of great interest to the NGS community. The Illumina PE data yielded ~25% more SNP calls in target regions, with higher HapMap concordance (98%) than ABI SOLiD data (95%). The authors attribute this to the better mapping, higher coverage, and low duplication rate made possible by paired-end sequencing. Considering only HapMap heterozygous SNPs, SOLiD out-performed Illumina at low (<9x) coverage, but Illumina consistently yielded 2-3% higher concordance at high coverage.

In their concluding section, the authors write “Interestingly, Illumina sequencing consistently shows higher levels of enrichment than SOLiD sequencing. This is unexpected because both sequencing platforms yield similar coverage distributions in whole genome sequencing data… therefore we suspect that differences in efficiency are due to an increase in initial library complexity from better annealing efficiencies of the Illumina adapter.”

Such a frank conclusion, from a group that’s highly invested in SOLiD sequencers, is especially poignant. When it comes to exome sequencing, Illumina seems to have the advantage.

References
Bainbridge MN, Wang M, Burgess DL, Kovar C, Rodesch MJ, D’Ascenzo M, Kitzman J, Wu YQ, Newsham I, Richmond TA, Jedeloh JA, Muzny D, Albert TJ, & Gibbs RA (2010). Whole exome capture in solution with 3Gbp of data. Genome biology, 11 (6) PMID: 20565776

Comments

cheese says

July 9, 2010 at 3:48 pm

unfortunately, not apples to apples, but i think it shows the power of paired reads more than a system comparison. fragment vs paired….

both platforms can do paired reads for SNP calling, but i agree illumina has the advantage in the exome market.
German Leparc says

August 6, 2010 at 6:18 am

What I actually found most interesting is the PCR duplicates and how they significantly dropped when using paired-end reads vs single ends. This means that a lot of people who have been filtering our PCR duplicates on single ends reads have been throwing away a significant amount of data!