NGS Informatics: Hail to the Chief
September 17th, 2009Bio-IT World’s Kevin Davies has a nice interview with David Dooling, who heads informatics here at the Genome Center and still finds time for his PolITiGenomics blog. Dooling joined the center in 2001, as the Human Genome Project was wrapping up. Now, he oversees about half of our informatics group – including IT personnel as well as the developers of our LIMS and automated data pipelines.
All three groups, now that I think about it, have had to address significant challenges during our transition to a next-generation sequencing center. Our LIMS deals with tens of millions of transactions per month, with a back-end database whose tables sometimes have billions of records. Our automated pipeline (or APIPE) group develops all of the data pipelines that make whole-genome sequencing feasible – primary data analysis, alignment, coverage reporting, mutation detection, etc. And the IT group must address the exponentially growing needs of data transfer and compute time for all of it – not an easy job.
Despite these monumental tasks, under the leadership of David and others we’re currently “on a good path” to handle the current generation of sequencing tools. Of course, that may change in the next couple of years, when technologies like Pac Bio’s SMRT platform begin cranking out single-molecule sequences 1,000 bases long or longer.
In-House and Open Source
Bio-IT World is heavily read by providers of commercial informatics tools, and this is reflected somewhat in the interview. Davies often asks whether we’re working with any specific vendors, or considering any commercial tools. Often enough we are – certainly for storage and data transfer systems, things that can’t be built from the ground up. Yet whenever possible, we opt for the open-source solution. Every workstation here, for example, is Linux. We have but one Windows PC, and it’s not allowed to connect to the internet. Most of our LIMS system and many of our in-house tools were written in Perl.
A Tough Nut for Commercial Vendors
There are, of course, commercial alternatives to anything. Yet vendors face significant hurdles in marketing products to large genome centers. The tools that we use are often highly customized, and must continually evolve to address new technological developments. Take aligners for example. In the early days of Illumina sequencing, we licensed some commercial software – SLIMsearch and SXOG, for example – because there simply were no good alternatives to ELAND. Then Maq came along, offering better functionality and performance in a free and open source program (offered, no less, by our trusted friends across the pond). Exorbitantly priced licenses, needless to say, were quickly not renewed.
Now there are numerous commercial solutions, and we’re often wooed by companies like CLC bio. Yet for every commercial aligner there’s half a dozen free/open-source alternatives, developed by academic groups that we respect and trust (Maq/BWA from Sanger, Bowtie from UMD, etc.), and many of these tools are pretty damn good. A commercial option would have to be so incredible, so vastly superior to what’s currently available for us to consider a paid license. With Bowtie and BWA mapping lanes of 15 million reads in just a couple of hours, the bar is already set pretty high.
Outsourcing Sequencing?
David offers, I think, a polite response to the question of whether we’d ever outsource our sequencing to a third party. Personally, I can offer two reasons why this will probably never happen. First, we’re already pretty happy with Illumina, a platform that can deliver whole human genomes at high coverage in just a few weeks. All available evidence suggests that throughput will only continue to grow, and before long I expect we’ll be doing a genome on a single flowcell or less. Of course, cost is a consideration (Illumina runs aren’t cheap). It’s very possible that a company like Complete Genomics might be able to offer similar yields at a substantially reduced cost. We do use companies like IDT and Agilent, for example, to synthesize oligo sequences that we might make in house. They can make them cheaper, and faster, than we can.
There is a second, and perhaps more compelling reason to keep sequencing in-house – because we’re in the business of research, and data is precious. With our current capacity we can track the progress of sequencing runs in real-time, monitor error rates and alignment rates, and assess results the moment data is off of machines. We maintain a forensics-lab-like “chain of custody” on the data from start to finish. Doing so offers a certain sense of security, and confidence, when we use the results to tackle some of the most fundamental questions in biology.







