Crossbow: NGS Informatics in the Cloud

Just online at Genome Biology is a new paper from the Steven Salzberg lab (UMD) on searching for SNPs with cloud computing.  Using $85 of computing time rented from Amazon’s EC2, Langmead et al processed an entire human genome – 3.3 billion reads totaling 38x coverage – in three hours.

logo_aws

The “Cloud” Can Be Nebulous

Cloud computing is a term bandied about often these days.  What it boils down to is this:  places with huge banks of computers (Providers, i.e. Amazon) rent out processing time to people who need it (Users).  The “cloud” refers to a software layer between providers and users that acts like a virtual operating system – it loads any software needed by the user, and also provides an access point for running highly parallelized tasks on the cluster. Next-gen sequencing data is well suited to this kind of processing, since a large NGS dataset can usually be broken into smaller subsets (i.e. Illumina lanes) and processed at the same time on different computers, without affecting the results.

Map, Sort, Reduce

Crossbow – the cloud computing software featured in this publication – cleverly breaks down the analysis into a series of map, sort, and reduce steps.  It takes a large sequencing dataset, breaks the reads into subsets, and maps them to the human genome using Bowtie (map).  Then, it divides the 3.2 gigabase human genome into 1,600 non-overlapping 2-megabase partitions and assigns every mapped read to a bin (sort).  The SNP caller, in this case SOAPsnp, is applied to each of these smaller bins rather than to the entire genome (reduce).

The Need for Parallelization

The CHB dataset is ~3.3 billion reads, with an average read length of 35 bp.  Even with Bowtie’s multi-threading and incredible speed, this massive dataset would take months to process on a single computer.  However, the authors divided the input reads into smaller subsets and aligned them in parallel, then processed the 2-Mbp genome “bins” in parallel as well.  Throw all of these parallel tasks at Amazon’s Elastic Compute Cloud (EC2), and it eats them up.  The high-performance EC2 cluster (40 nodes, each with 8 CPUs and 7 GB of RAM) finished all of the tasks in about 3 hours.

Digging into the Numbers

There are a couple of inconsistencies in the numbers that need to be ironed out.  For example, the BGI study reported 36X coverage from 3.3 billion reads (2.02 billion single-end, 658 million paired-end), whereas Langmead et al downloaded 2.7 billion reads from the “YanHuang Site” and noted that it represented 38X coverage.  Where did that extra 2X come from?  Langmead et al do cite the Nature paper by Wang et al, and I believe it’s the same dataset.

At first I was concerned that the Salzberg group had only downloaded the mapped reads and run them, which would have been a biased test of alignment performance.  However, I don’t believe this is the case.  Instead, I believe they meant to say that they’d downloaded 2.02 billion single-end reads, and they’d also downloaded 657 million read pairs (1.314 billion paired-end reads).  This would yield the correct total of 3.3 billion reads.  I realize this is nitpicky.

More of a concern and hopefully less nitpicky are the SNP calling numbers.  Langmead et al reported over 21% more SNPs (3.73 million) than BGI did (3.07 million) on the same dataset, and attributed the difference to less stringent filtering.  Yet both groups used the same SNP caller, so is it possible that the Bowtie alignment, not the SNP filters, were responsible for what we presume are false positives?  This is an important question that Heng Li and others are already considering.

Whole-genome Sequencing Analysis for the Masses

I like the Salzberg group because they’re all about the small lab, about putting NGS processing capabilities into the hands of people without substantial computing resources.  Bowtie made it possible to map a lane of Illumina/Solexa data in a few hours, using only a laptop with 4 GB of RAM.  Now, Crossbow offers anyone with $85 in their budget to run entire WGS datasets on borrowed (or rented) CPU time.  There’s no need to purchase, maintain, or continuously upgrade expensive computing hardware.  Even the storage space can be rented (i.e. from Amazon S3, which the authors used).  It is literally now possible for someone to analyze an entire human genome while sitting on their laptop at the local coffee house.

References
Ben Langmead, Michael C. Schatz, Jimmy Lin, Mihai Pop and Steven L. Salzberg (2009). Searching for SNPs with cloud computing Genome Biology, 10 (R134) : doi:10.1186/gb-2009-10-11-r134

Print Friendly
3 comments
Gary Stiehr
Gary Stiehr

Dan, how does the computational complexity of this process change as the read lengths increase? I.e., for a given human genome with the a given coverage, would the computing required decrease over time as the read lengths increase?

Ben L
Ben L

Hi Dan,

Thanks for this thorough post!

Some very quick responses:

- The 38x number is from the raw number of bases in all of the reads. To calculate, we downloaded all the reads from the EBI UK mirror of the YanHuang site and totaled all bases in all reads, then divided by the number of bases (incl. Ns) in the hg18 reference. That gives us 38.2x. That's basically the same number as if you use the total for "Total bases (Gb)" from Table 1 of the Wang et al paper. Frankly, I'm not sure where the 36x number from the text of the Wang et al paper comes from; if you go by their total for "Mapped bases (Gb)" in Table 1, it gives you about 33.5x.

- As for # SNPs, filtering is the most likely reason because it's by far the biggest difference between the two experimental setups. Note that the alignment policy we used for Bowtie (2-mismatch unique (in the "stratified" sense)) is the same as is used in the SOAPsnp study (see the "Read alignment" subsection of the "Methods" section in the SOAPsnp study). We double-checked that they used this policy by looking at their alignments, which are available at the YanHuang site. We agree that our too-simplistic filtering scheme is a target for criticism, but because it's easy to implement as a non-parallel post-pass over the final SNP data, we thought it was too peripheral to fret about in the paper. That said, we're going to be working on it soon.

Thanks again - great blog,

Ben

Trackbacks

  1. […] From the Using clouds for parallel computations in systems biology workshop at the recent SC09 conference (Informatics Iron writeup) to last month’s Genome Informatics meeting, everyone in bioinformatics is talking about cloud computing these days. Last week Steven Salzberg’s group published a paper on their Crossbow tool entitled Searching for SNPs with cloud computing (Cloudera blog post on Crossbow). In the paper the authors describe how they were able to analyze the human sequence data published last year by BGI using Amazon EC2. Specifically, they have developed an alignment (bowtie) and SNP detection (SoapSNP) pipeline that is executed in parallel across a cluster using the Hadoop framework (a free software implementation of Google’s MapReduce framework). Using a 40-node, 320-core EC2 cluster, they were able to analyze 38× coverage sequence data in about three hours. The whole analysis, including data transfer and storage on Amazon S3, cost about $125. You can find a more detailed cost breakdown and comparison on Gary Stiehr’s HPCInfo post and more detail on the SNP detection on Dan Koboldt’s Mass Genomics post. […]