Why Bioinformatics Analysis Is Not Free

The rise of next-generation sequencing has been a boon for the field of genetics. It’s fueled invaluable “big science” projects, like the Cancer Genome Atlas and the 1,000 Genomes Project. Now, in 2015, the throughput of the current instruments has a lot of people excited. At the high end, the Illumina X Ten system puts large-scale sequencing efforts (10,000-50,000 whole genomes) within reach. Yet it also encourages investigators or groups with smaller cohorts to fund sequencing studies of their own.

At both ends of the spectrum, both researchers and their funding agencies have answered the siren call of fast, cheap genome sequencing. Yet it seems like a good time to remind everyone that the $1,xxx per genome price tag only covers the consumables and technician labor. For that price, you get raw sequencing data in the form of FASTQ files. Large-scale sequencing centers like ours will align the reads to the human reference sequence and deliver the results in a BAM file, but that’s generally it. Going beyond data generation — to do variant calling, downstream quality control, annotation, etc. — requires analysis. And analysis is not free.

The Cost of NGS Analysis

Even basic analysis of raw sequencing data comes with several costs. Usually, these are not accounted for when people talk about the price of sequencing.

Long-term data storage

A single BAM file for a 30x whole genome is about 100 gigs of disk space. So a modest-sized study of 500 samples will require 50 terabytes of disk just for the BAMs. Even lossy compressed data formats (e.g. CRAM) won’t solve the data storage issue. When your sequencing system can crank out 18,000 genomes a year, the disk space requirements are simply enormous.

Computational resources

Processing the raw sequence data — mapping it to the reference sequence, marking duplicates, and generating BAM files — requires a lot of computing power. We’re planning studies of 10,000 whole genomes or more, when each sequenced genome comprises 90 billion base pairs of sequence (30x coverage). Variant calling, base recalibration, and annotation for large-scale genomic datasets also have high computational demands. Maintaining a computing cluster that can handle this burden is expensive, as are the per-cycle costs of cloud computing.

Analysis Time

The analysis itself must be carried out by someone with bioinformatics expertise. Even with robust, highly automated pipelines in place, an analyst is necessary to collect the data as it comes from the production team, configure the right analysis (aligner, reference assembly, variant calling strategy, annotation sources, etc), monitor progress, and compile results. Analysts are typically salaried employees, and a single project could consume weeks or months of their time.

When Analysis Is Not Included

Here’s the problem with the rapidly decreasing costs of genome sequencing under the current funding climate: no one wants to pay for analysis. Look, I get it. You heard all of this song-and-dance about the $1,xxx genomes, and now you’re being told that the analysis might be just as expensive (if not more) than the sequencing! It’s the classic razor/razor blade or printer/ink scam all over again!

Given that gut reaction, it’s tempting to go with one of these alternative strategies:

“No, we don’t need analysis”

If you go this route, you should expect to get BAM files. For some people (e.g. people named Gabor or Goncalo), this is fine: they have analysis expertise and personnel. For others, this means you’ll probably have to pay someone to do the analysis anyway. It might as well be the people who generated your data for you.

Also, I should point out that NIH-funded sequencing data must be submitted to public repositories, often without embargo. Every day that you delay the necessary analysis is more time for your competitors to make use of this public data.

“There will be a separate RFA for analysis”

Certain funding agencies have adopted this model for large-scale projects: they issue one RFA for generating the sequencing data. Then they issue another RFA for groups to do the analysis. This de-coupling of data production and analysis serves no one. I’m sorry, but the researchers who generated sequencing data are probably best equipped to analyze it. They understand the nuances, and they also don’t have to go to dbGaP to get permissions (we all know this is a nightmare).

Second, this arrangement delays the results/interpretation phase of the project. The sequencing centers aren’t funded to analyze data coming off the instruments, and the analysis centers often don’t get started until it’s complete. It’s an inefficient process and I don’t understand this model at all.

“It’s OK, we just bought [software name]”

There are commercial software solutions designed to help investigators and small labs analyze large NGS datasets. In my opinion, there are two main issues with these. First of all, their developers invest a huge amount of development time building a pretty, easy-to-use interface… but the algorithms are ones we were using a year ago. The second and more pressing issue is the high cost of these tools.

“We’ll use grad students or postdocs”

There’s always a temptation to rely on “free” labor in the form of graduate students or postdocs. But these hapless souls probably don’t yet have the required expertise, so they’ll spend two months learning how to run BWA (probably by asking people like us for guidance). That time might be better spent helping write grants and papers, and leaving the analysis to people who do it every day as their job.

The Benefits of Funding Analysis

There’s sometimes a bit of sticker shock when writing the budget for a project analysis, but there are also many important benefits. First of all, it ensures that you’ll get high-quality results delivered in a format that you can actually use. Second, it allows the optimal analyses to be selected and performed by people who are specialists at it. The in-house analysts can also be faster, because they can often begin working on it before data might otherwise be passed off to an external group.

Perhaps most importantly, funding the analysis portion of a project makes it far more likely to succeed because it’s a collaborative effort. It lets the sequencing experts contribute their expertise, while the disease/phenotype experts contribute their own. Everyone plays to their strengths, and everyone’s efforts are supported by the funds of the project. This, in my opinion, is the best way to do great science.