Random Post: AGBT 2010: First Impressions
RSS 2.0
  • Home
  • About
  • Aligners
  • Genomes
  • Subscribe
  • VarScan
  •  

    Crossbow: NGS Informatics in the Cloud

    November 23rd, 2009

    Just online at Genome Biology is a new paper from the Steven Salzberg lab (UMD) on searching for SNPs with cloud computing.  Using $85 of computing time rented from Amazon’s EC2, Langmead et al processed an entire human genome – 3.3 billion reads totaling 38x coverage – in three hours.

    logo_aws

    The “Cloud” Can Be Nebulous

    Cloud computing is a term bandied about often these days.  What it boils down to is this:  places with huge banks of computers (Providers, i.e. Amazon) rent out processing time to people who need it (Users).  The “cloud” refers to a software layer between providers and users that acts like a virtual operating system – it loads any software needed by the user, and also provides an access point for running highly parallelized tasks on the cluster. Next-gen sequencing data is well suited to this kind of processing, since a large NGS dataset can usually be broken into smaller subsets (i.e. Illumina lanes) and processed at the same time on different computers, without affecting the results.

    Map, Sort, Reduce

    Crossbow – the cloud computing software featured in this publication – cleverly breaks down the analysis into a series of map, sort, and reduce steps.  It takes a large sequencing dataset, breaks the reads into subsets, and maps them to the human genome using Bowtie (map).  Then, it divides the 3.2 gigabase human genome into 1,600 non-overlapping 2-megabase partitions and assigns every mapped read to a bin (sort).  The SNP caller, in this case SOAPsnp, is applied to each of these smaller bins rather than to the entire genome (reduce).

    The Need for Parallelization

    The CHB dataset is ~3.3 billion reads, with an average read length of 35 bp.  Even with Bowtie’s multi-threading and incredible speed, this massive dataset would take months to process on a single computer.  However, the authors divided the input reads into smaller subsets and aligned them in parallel, then processed the 2-Mbp genome “bins” in parallel as well.  Throw all of these parallel tasks at Amazon’s Elastic Compute Cloud (EC2), and it eats them up.  The high-performance EC2 cluster (40 nodes, each with 8 CPUs and 7 GB of RAM) finished all of the tasks in about 3 hours.

    Digging into the Numbers

    There are a couple of inconsistencies in the numbers that need to be ironed out.  For example, the BGI study reported 36X coverage from 3.3 billion reads (2.02 billion single-end, 658 million paired-end), whereas Langmead et al downloaded 2.7 billion reads from the “YanHuang Site” and noted that it represented 38X coverage.  Where did that extra 2X come from?  Langmead et al do cite the Nature paper by Wang et al, and I believe it’s the same dataset.

    At first I was concerned that the Salzberg group had only downloaded the mapped reads and run them, which would have been a biased test of alignment performance.  However, I don’t believe this is the case.  Instead, I believe they meant to say that they’d downloaded 2.02 billion single-end reads, and they’d also downloaded 657 million read pairs (1.314 billion paired-end reads).  This would yield the correct total of 3.3 billion reads.  I realize this is nitpicky.

    More of a concern and hopefully less nitpicky are the SNP calling numbers.  Langmead et al reported over 21% more SNPs (3.73 million) than BGI did (3.07 million) on the same dataset, and attributed the difference to less stringent filtering.  Yet both groups used the same SNP caller, so is it possible that the Bowtie alignment, not the SNP filters, were responsible for what we presume are false positives?  This is an important question that Heng Li and others are already considering.

    Whole-genome Sequencing Analysis for the Masses

    I like the Salzberg group because they’re all about the small lab, about putting NGS processing capabilities into the hands of people without substantial computing resources.  Bowtie made it possible to map a lane of Illumina/Solexa data in a few hours, using only a laptop with 4 GB of RAM.  Now, Crossbow offers anyone with $85 in their budget to run entire WGS datasets on borrowed (or rented) CPU time.  There’s no need to purchase, maintain, or continuously upgrade expensive computing hardware.  Even the storage space can be rented (i.e. from Amazon S3, which the authors used).  It is literally now possible for someone to analyze an entire human genome while sitting on their laptop at the local coffee house.

    References
    Ben Langmead, Michael C. Schatz, Jimmy Lin, Mihai Pop and Steven L. Salzberg (2009). Searching for SNPs with cloud computing Genome Biology, 10 (R134) : doi:10.1186/gb-2009-10-11-r134

    AddThis Social Bookmark Button

    NGS Informatics: Hail to the Chief

    September 17th, 2009

    Bio-IT World’s Kevin Davies has a nice interview with David Dooling, who heads informatics here at the Genome Center and still finds time for his PolITiGenomics blog.  Dooling joined the center in 2001, as the Human Genome Project was wrapping up.  Now, he oversees about half of our informatics group – including IT personnel as well as the developers of our LIMS and automated data pipelines.

    All three groups, now that I think about it, have had to address significant challenges during our transition to a next-generation sequencing center.  Our LIMS deals with tens of millions of transactions per month, with a back-end database whose tables sometimes have billions of records.  Our automated pipeline (or APIPE) group develops all of the data pipelines that make whole-genome sequencing feasible – primary data analysis, alignment, coverage reporting, mutation detection, etc.  And the IT group must address the exponentially growing needs of data transfer and compute time for all of it – not an easy job.

    Despite these monumental tasks, under the leadership of David and others we’re currently “on a good path” to handle the current generation of sequencing tools.  Of course, that may change in the next couple of years, when technologies like Pac Bio’s SMRT platform begin cranking out single-molecule sequences 1,000 bases long or longer.

    In-House and Open Source

    Bio-IT World is heavily read by providers of commercial informatics tools, and this is reflected somewhat in the interview.  Davies often asks whether we’re working with any specific vendors, or considering any commercial tools.  Often enough we are – certainly for storage and data transfer systems, things that can’t be built from the ground up.  Yet whenever possible, we opt for the open-source solution.  Every workstation here, for example, is Linux.  We have but one Windows PC, and it’s not allowed to connect to the internet.  Most of our LIMS system and many of our in-house tools were written in Perl.

    A Tough Nut for Commercial Vendors

    There are, of course, commercial alternatives to anything.  Yet vendors face significant hurdles in marketing products to large genome centers.  The tools that we use are often highly customized, and must continually evolve to address new technological developments.  Take aligners for example.  In the early days of Illumina sequencing, we licensed some commercial software – SLIMsearch and SXOG, for example – because there simply were no good alternatives to ELAND.  Then Maq came along, offering better functionality and performance in a free and open source program (offered, no less, by our trusted friends across the pond).  Exorbitantly priced licenses, needless to say, were quickly not renewed.

    Now there are numerous commercial solutions, and we’re often wooed by companies like CLC bio.  Yet for every commercial aligner there’s half a dozen free/open-source alternatives, developed by academic groups that we respect and trust (Maq/BWA from Sanger, Bowtie from UMD, etc.), and many of these tools are pretty damn good.  A commercial option would have to be so incredible, so vastly superior to what’s currently available for us to consider a paid license.  With Bowtie and BWA mapping lanes of 15 million reads in just a couple of hours, the bar is already set pretty high.

    Outsourcing Sequencing?

    David offers, I think, a polite response to the question of whether we’d ever outsource our sequencing to a third party.  Personally, I can offer two reasons why this will probably never happen.  First, we’re already pretty happy with Illumina, a platform that can deliver whole human genomes at high coverage in just a few weeks.  All available evidence suggests that throughput will only continue to grow, and before long I expect we’ll be doing a genome on a single flowcell or less.  Of course, cost is a consideration (Illumina runs aren’t cheap).  It’s very possible that a company like Complete Genomics might be able to offer similar yields at a substantially reduced cost.  We do use companies like IDT and Agilent, for example, to synthesize oligo sequences that we might make in house.  They can make them cheaper, and faster, than we can.

    There is a second, and perhaps more compelling reason to keep sequencing in-house – because we’re in the business of research, and data is precious.  With our current capacity we can track the progress of sequencing runs in real-time, monitor error rates and alignment rates, and assess results the moment data is off of machines.  We maintain a forensics-lab-like “chain of custody” on the data from start to finish.  Doing so offers a certain sense of security, and confidence, when we use the results to tackle some of the most fundamental questions in biology.

    AddThis Social Bookmark Button

    WUCGI: WashU Cancer Genomics Initiative

    August 27th, 2009

    WUCGI

    Yesterday afternoon was the kickoff party launching WashU’s Cancer Genomics Initiative (CGI), better known as our goal to sequence 150 cancer genomes in the coming year.

    Cancer Sequencing Ramps Up

    Under the leadership of Wilson, Ley, and Elaine Mardis, our group sequenced the first cancer genome, from a woman who had died of AML (M1), and published the results in Nature last fall.  Three weeks ago came the sequel.  In the New England Journal of Medicine, we published the complete genome of another M1 leukemia, this time from a man who’s been treated and remains in full remission.  In less than a year, the number of Illumina runs required to sequence a tumor genome dropped by over 80%, from 98 runs in AML1 to just 16.5 runs in AML2.

    It’s not just the sequencing throughput that makes WUCGI a realistic effort.  Many groups have Illumina sequencers, some even more than we do.  Some of the most critical advances have taken place behind the scenes – for example, the variant detection pipelines developed by David Larson, Ken Chen, Chris Harris, and others.  Sequencing on this scale would not be possible without the IT and informatics infrastructure, built under the leadership of David Dooling and Gary Stiehr, that gives us the computational firepower to run whole-genome analyses.

    Two Genomes Down, 150 To Go

    With two genomes published, the center leadership has set an ambitious goal: To sequence 150 cancer genomes in the coming year. Obviously, these will include more AML samples, hopefully some with therapy-related changes or abnormal cytogenetics.  In collaboration with Matt Ellis and others at the Siteman Cancer Center, we’ll be tackling breast cancer as well.  No doubt we’ll be revisiting lung cancer, for which we sequenced candidate genes as part of the Tumor Sequencing Project (TSP) consortium.  As part of the Cancer Genome Atlas (TCGA) consortium, we’re working on glioblastoma multiforme (brain cancer) and ovarian cancer.  Also, intriguingly, I hear rumors that there will be some sequencing of less common, largely unexplored cancers like multiple myeloma.

    As Tim Ley said yesterday, it’s thrilling to be a part of this.  We truly are entering the golden age of cancer genomics.

    AddThis Social Bookmark Button

    Drowning in the Flood of Next-Gen Data

    April 18th, 2008

    Working at the WashU Genome Center, I expect to encounter datasets that are large even by bioinformatician standards. But as we transition from traditional 3730-based sequencing to next-generation platforms, I’m beginning to appreciate just how much additional infrastructure we’ll need to handle the data flow. In the Medical Genomics group we’re constantly pushing up against capacity – servers, disk space, and man hours. None of these are in adequate supply for what’s ahead.

    This is not to say that we’re without resources. In fact, the infrastructure already in place is considerable. We have about 500 computational servers (1600 cores) and nearly a petabyte (1,000 terabytes) of disk space. There’s an LSF system through which we submit and monitor jobs on The Blades.

    You Didn’t Need That Done TODAY, Did you?

    I submit about 1,000 small jobs and notice they’re all pending:

    The 62,000 job backlog

    No doubt that’s because there are 61,000 jobs in front of me. We have a few different “queues” into which jobs can be submitted. The “short” queue is for jobs that execute in less than 15 minutes. At one job per core if every job finishes in 15 minutes, it looks like my jobs will start in about 9 hours. Oy.

    The powers that be around here are rushing to build up our resources. As I’m not part of management, I can’t say for sure how long it will take to get the disk space and hardware we need. One thing I do know: we need a lot, and we need it soon.

    AddThis Social Bookmark Button