Drowning in the Flood of Next-Gen Data

April 18, 2008 by dkoboldt

Working at the WashU Genome Center, I expect to encounter datasets that are large even by bioinformatician standards. But as we transition from traditional 3730-based sequencing to next-generation platforms, I’m beginning to appreciate just how much additional infrastructure we’ll need to handle the data flow. In the Medical Genomics group we’re constantly pushing up against capacity – servers, disk space, and man hours. None of these are in adequate supply for what’s ahead.

This is not to say that we’re without resources. In fact, the infrastructure already in place is considerable. We have about 500 computational servers (1600 cores) and nearly a petabyte (1,000 terabytes) of disk space. There’s an LSF system through which we submit and monitor jobs on The Blades.

You Didn’t Need That Done TODAY, Did you?

I submit about 1,000 small jobs and notice they’re all pending:

No doubt that’s because there are 61,000 jobs in front of me. We have a few different “queues” into which jobs can be submitted. The “short” queue is for jobs that execute in less than 15 minutes. At one job per core if every job finishes in 15 minutes, it looks like my jobs will start in about 9 hours. Oy.

The powers that be around here are rushing to build up our resources. As I’m not part of management, I can’t say for sure how long it will take to get the disk space and hardware we need. One thing I do know: we need a lot, and we need it soon.

Still Waiting for that ABI SOLiD Genome

April 8, 2008 by dkoboldt

One of the big announcements at this year’s AGBT was ABI’s sequencing of a complete human genome using the SOLiD system. It wasn’t just any genome, either – it was the genome of an African male of the Yoruba tribe in Nigeria (one of the HapMap samples). Perhaps I should be unsurprised that the press releases flew months ago but we’ve yet to see the peer-reviewed publication. Yet I’m eager to read the results of their project, as it will be the first complete genome sequencing of an individual from the African continent. Many studies have seen higher incidence and allele frequencies of SNPs in African samples, consistent with population bottlenecks during out-of-Africa expansions. In fact, a recent genome-wide survey of genetic variation in 51 populations showed that humans formed a chain of colonies as they migrated out of Africa some 10,000 years ago. That article’s a very interesting read.

But back to ABI. Perusing the SOLiD web site, I did find a poster on the genome-wide variation detected from their not-yet-completed SOLiD sequencing. From it I took these key pieces of information. They sequenced both fragment and mate-pair libraries to a coverage of about 4.9X. The mate-pair libraries allowed them to detect ~22,000 insertions and ~45,000 deletions, nearly all of which were heterozygous. At ~4X coverage on chromosome 7, some 75% of the SNPs detected were already in dbSNP. In the ENCODE regions (which have been extensively characterized), 91% of the SNPs detected were in dbSNP. To me, the fraction of novel SNPs seems low, but if it remains constant, this study will almost certainly add more SNPs to public databases than the Watson and Venter efforts.

Helicos Resequences M13 Virus Genome

April 7, 2008 by dkoboldt

The April 4th issue of Science had an article by Helicos BioSciences in which they described the single-molecule DNA sequencing of a viral genome. I knew about Helicos because they came and gave a talk to our Genetics department describing their planned strategy to develop a method for single-molecule sequencing. As I recall, the talk was entirely theoretical as they didn’t have much experimental data to show. Clearly things have gone well for Helicos, since their article convincingly demonstrates the potential of single-molecule sequencing for high-throughput, low-cost sequencing.

Introduction: The Problems with PCR

Why bother with single molecule sequencing? The introduction briefly discussed three problems associated with PCR-based sequencing.

Bias in template representation. Due to thermodynamics and other factors I don’t well understand, PCR efficiency is directly affected by characteristics of the template. Shorter products, for example, are more efficient to amplify than longer products.
Library preparation complications. PCR-based sequencing methods require a lot of templates, and preparation of the libraries can be “onerous and expensive in terms of DNA manipulation,” according to the article. I don’t do library prep myself, but this sounds reasonable.
Error incorporation. Here is something that I do know about. Any time you use PCR, there’s a chance that mis-incorporation at an early cycle will introduce (and then amplify) errors in the sequence. We’ve seen some problems with 454 and Solexa sequencing that may be attributed to this. The idea of taking PCR-induced errors out of sequence reads appeals to me very much.

Results: Sequencing-by-synthesis of the M13 Viral Genome

The authors report sequencing the ~7 kbp M13 genome with 100% coverage and at an average depth of 150X. The read lengths averaged 23-27 bp, depending on the run and some post-processing; the authors claim to have performed runs with average read lengths of over 30 bp. According to alignment statistics in Table 1, there were 32,473 forward-orientation reads (relative to the reference) for an average coverage of 96X, and 34,109 reverse-orientation readds for an average coverage of 105X. Coverage in both orientations becomes important during their mutation-detection simulations.

Simulations of Mutation Detection

Because they sequenced the canonical strain of M13, there should be no sequence polymorphisms. So, to test the ability of this sequencing method to pick up mutations, the authors created “synthetic mutations” in the reference sequence and re-performed alignments. The synthetically-introduced mutations are picked up with an average sensitivity of ~98%. To me, this was the weaker part of the paper – mutations created in silico won’t accurately represent real variation, but at least it let the authors discuss analysis and refinement steps that led to improved mutation detection.

Discussion: Caveats and Future Directions

I don’t think Helicos is yet a threat to established next-generation platforms like Roche/454 and Illumina/Solexa. At 25 bp, the reads are too short to be useful in eukaryotes. Like 454, the Helicos platform has some difficulties with homopolymers , especially runs of cytosine residues. The authors readily admit that “large genomes, heterogeneous samples, and genomic structural variations will likely require longer reads, reduced homopolymer run through, and enhanced alignment tools.”

Yet this publication is an important proof-of-principle for the Helicos method. As far as single-molecule DNA sequencing goes, it looks like Helicos Biosciences is the one to beat.

« Previous Page