Crossbow: NGS Informatics in the Cloud

November 23, 2009 by Dan Koboldt

Just online at Genome Biology is a new paper from the Steven Salzberg lab (UMD) on searching for SNPs with cloud computing. Using $85 of computing time rented from Amazon’s EC2, Langmead et al processed an entire human genome – 3.3 billion reads totaling 38x coverage – in three hours.

logo_aws

The “Cloud” Can Be Nebulous

Cloud computing is a term bandied about often these days. What it boils down to is this: places with huge banks of computers (Providers, i.e. Amazon) rent out processing time to people who need it (Users). The “cloud” refers to a software layer between providers and users that acts like a virtual operating system – it loads any software needed by the user, and also provides an access point for running highly parallelized tasks on the cluster. Next-gen sequencing data is well suited to this kind of processing, since a large NGS dataset can usually be broken into smaller subsets (i.e. Illumina lanes) and processed at the same time on different computers, without affecting the results.

Map, Sort, Reduce

Crossbow – the cloud computing software featured in this publication – cleverly breaks down the analysis into a series of map, sort, and reduce steps. It takes a large sequencing dataset, breaks the reads into subsets, and maps them to the human genome using Bowtie (map). Then, it divides the 3.2 gigabase human genome into 1,600 non-overlapping 2-megabase partitions and assigns every mapped read to a bin (sort). The SNP caller, in this case SOAPsnp, is applied to each of these smaller bins rather than to the entire genome (reduce).

The Need for Parallelization

The CHB dataset is ~3.3 billion reads, with an average read length of 35 bp. Even with Bowtie’s multi-threading and incredible speed, this massive dataset would take months to process on a single computer. However, the authors divided the input reads into smaller subsets and aligned them in parallel, then processed the 2-Mbp genome “bins” in parallel as well. Throw all of these parallel tasks at Amazon’s Elastic Compute Cloud (EC2), and it eats them up. The high-performance EC2 cluster (40 nodes, each with 8 CPUs and 7 GB of RAM) finished all of the tasks in about 3 hours.

Digging into the Numbers

There are a couple of inconsistencies in the numbers that need to be ironed out. For example, the BGI study reported 36X coverage from 3.3 billion reads (2.02 billion single-end, 658 million paired-end), whereas Langmead et al downloaded 2.7 billion reads from the “YanHuang Site” and noted that it represented 38X coverage. Where did that extra 2X come from? Langmead et al do cite the Nature paper by Wang et al, and I believe it’s the same dataset.

At first I was concerned that the Salzberg group had only downloaded the mapped reads and run them, which would have been a biased test of alignment performance. However, I don’t believe this is the case. Instead, I believe they meant to say that they’d downloaded 2.02 billion single-end reads, and they’d also downloaded 657 million read pairs (1.314 billion paired-end reads). This would yield the correct total of 3.3 billion reads. I realize this is nitpicky.

More of a concern and hopefully less nitpicky are the SNP calling numbers. Langmead et al reported over 21% more SNPs (3.73 million) than BGI did (3.07 million) on the same dataset, and attributed the difference to less stringent filtering. Yet both groups used the same SNP caller, so is it possible that the Bowtie alignment, not the SNP filters, were responsible for what we presume are false positives? This is an important question that Heng Li and others are already considering.

Whole-genome Sequencing Analysis for the Masses

I like the Salzberg group because they’re all about the small lab, about putting NGS processing capabilities into the hands of people without substantial computing resources. Bowtie made it possible to map a lane of Illumina/Solexa data in a few hours, using only a laptop with 4 GB of RAM. Now, Crossbow offers anyone with $85 in their budget to run entire WGS datasets on borrowed (or rented) CPU time. There’s no need to purchase, maintain, or continuously upgrade expensive computing hardware. Even the storage space can be rented (i.e. from Amazon S3, which the authors used). It is literally now possible for someone to analyze an entire human genome while sitting on their laptop at the local coffee house.

References
Ben Langmead, Michael C. Schatz, Jimmy Lin, Mihai Pop and Steven L. Salzberg (2009). Searching for SNPs with cloud computing Genome Biology, 10 (R134) : doi:10.1186/gb-2009-10-11-r134

WUCGI: WashU Cancer Genomics Initiative

August 27, 2009 by Dan Koboldt

WUCGI

Yesterday afternoon was the kickoff party launching WashU’s Cancer Genomics Initiative (CGI), better known as our goal to sequence 150 cancer genomes in the coming year.

Cancer Sequencing Ramps Up

Under the leadership of Wilson, Ley, and Elaine Mardis, our group sequenced the first cancer genome, from a woman who had died of AML (M1), and published the results in Nature last fall. Three weeks ago came the sequel. In the New England Journal of Medicine, we published the complete genome of another M1 leukemia, this time from a man who’s been treated and remains in full remission. In less than a year, the number of Illumina runs required to sequence a tumor genome dropped by over 80%, from 98 runs in AML1 to just 16.5 runs in AML2.

It’s not just the sequencing throughput that makes WUCGI a realistic effort. Many groups have Illumina sequencers, some even more than we do. Some of the most critical advances have taken place behind the scenes – for example, the variant detection pipelines developed by David Larson, Ken Chen, Chris Harris, and others. Sequencing on this scale would not be possible without the IT and informatics infrastructure, built under the leadership of David Dooling and Gary Stiehr, that gives us the computational firepower to run whole-genome analyses.

Two Genomes Down, 150 To Go

With two genomes published, the center leadership has set an ambitious goal: To sequence 150 cancer genomes in the coming year. Obviously, these will include more AML samples, hopefully some with therapy-related changes or abnormal cytogenetics. In collaboration with Matt Ellis and others at the Siteman Cancer Center, we’ll be tackling breast cancer as well. No doubt we’ll be revisiting lung cancer, for which we sequenced candidate genes as part of the Tumor Sequencing Project (TSP) consortium. As part of the Cancer Genome Atlas (TCGA) consortium, we’re working on glioblastoma multiforme (brain cancer) and ovarian cancer. Also, intriguingly, I hear rumors that there will be some sequencing of less common, largely unexplored cancers like multiple myeloma.

As Tim Ley said yesterday, it’s thrilling to be a part of this. We truly are entering the golden age of cancer genomics.

Drowning in the Flood of Next-Gen Data

April 18, 2008 by dkoboldt

Working at the WashU Genome Center, I expect to encounter datasets that are large even by bioinformatician standards. But as we transition from traditional 3730-based sequencing to next-generation platforms, I’m beginning to appreciate just how much additional infrastructure we’ll need to handle the data flow. In the Medical Genomics group we’re constantly pushing up against capacity – servers, disk space, and man hours. None of these are in adequate supply for what’s ahead.

This is not to say that we’re without resources. In fact, the infrastructure already in place is considerable. We have about 500 computational servers (1600 cores) and nearly a petabyte (1,000 terabytes) of disk space. There’s an LSF system through which we submit and monitor jobs on The Blades.

You Didn’t Need That Done TODAY, Did you?

I submit about 1,000 small jobs and notice they’re all pending:

No doubt that’s because there are 61,000 jobs in front of me. We have a few different “queues” into which jobs can be submitted. The “short” queue is for jobs that execute in less than 15 minutes. At one job per core if every job finishes in 15 minutes, it looks like my jobs will start in about 9 hours. Oy.

The powers that be around here are rushing to build up our resources. As I’m not part of management, I can’t say for sure how long it will take to get the disk space and hardware we need. One thing I do know: we need a lot, and we need it soon.

« Previous Page