Comments on: Crossbow: NGS Informatics in the Cloud

By: Gary Stiehr

Gary Stiehr — Tue, 24 Nov 2009 20:20:19 +0000

Dan, how does the computational complexity of this process change as the read lengths increase? I.e., for a given human genome with the a given coverage, would the computing required decrease over time as the read lengths increase?

By: Gary Stiehr

Gary Stiehr — Tue, 24 Nov 2009 19:58:08 +0000

Hi Dan, great analysis of the data. You may be interested in my analysis of the computational numbers presented in that paper as well: http://hpcinfo.com/2009/11/22/benchmarking-the-cloud-for-genomics/

By: PolITiGenomics » Blog Archive » Bioinformatics and cloud computing

PolITiGenomics » Blog Archive » Bioinformatics and cloud computing — Tue, 24 Nov 2009 19:54:58 +0000

[…] From the Using clouds for parallel computations in systems biology workshop at the recent SC09 conference (Informatics Iron writeup) to last month’s Genome Informatics meeting, everyone in bioinformatics is talking about cloud computing these days. Last week Steven Salzberg’s group published a paper on their Crossbow tool entitled Searching for SNPs with cloud computing (Cloudera blog post on Crossbow). In the paper the authors describe how they were able to analyze the human sequence data published last year by BGI using Amazon EC2. Specifically, they have developed an alignment (bowtie) and SNP detection (SoapSNP) pipeline that is executed in parallel across a cluster using the Hadoop framework (a free software implementation of Google’s MapReduce framework). Using a 40-node, 320-core EC2 cluster, they were able to analyze 38× coverage sequence data in about three hours. The whole analysis, including data transfer and storage on Amazon S3, cost about $125. You can find a more detailed cost breakdown and comparison on Gary Stiehr’s HPCInfo post and more detail on the SNP detection on Dan Koboldt’s Mass Genomics post. […]

By: Ben L

Ben L — Mon, 23 Nov 2009 22:27:10 +0000

Hi Dan,

Thanks for this thorough post!

Some very quick responses:

– The 38x number is from the raw number of bases in all of the reads. To calculate, we downloaded all the reads from the EBI UK mirror of the YanHuang site and totaled all bases in all reads, then divided by the number of bases (incl. Ns) in the hg18 reference. That gives us 38.2x. That’s basically the same number as if you use the total for “Total bases (Gb)” from Table 1 of the Wang et al paper. Frankly, I’m not sure where the 36x number from the text of the Wang et al paper comes from; if you go by their total for “Mapped bases (Gb)” in Table 1, it gives you about 33.5x.

– As for # SNPs, filtering is the most likely reason because it’s by far the biggest difference between the two experimental setups. Note that the alignment policy we used for Bowtie (2-mismatch unique (in the “stratified” sense)) is the same as is used in the SOAPsnp study (see the “Read alignment” subsection of the “Methods” section in the SOAPsnp study). We double-checked that they used this policy by looking at their alignments, which are available at the YanHuang site. We agree that our too-simplistic filtering scheme is a target for criticism, but because it’s easy to implement as a non-parallel post-pass over the final SNP data, we thought it was too peripheral to fret about in the paper. That said, we’re going to be working on it soon.

Thanks again – great blog,
Ben