The Open Source Software Debate in NGS Bioinformatics

The rise of next-generation sequencing technology has been a boon for the field of bioinformatics, since the unprecedented throughputs — along with the diversity of possible applications in research and healthcare —  brought forth a new generation of software tools for sequence analysis and interpretation. The fact that the growth in sequencing throughput has outstripped Moore’s Law for several years running has forced incredible innovation in the field of bioinformatics tool development, because we no longer have the luxury of more computing power than we could ever possible use.

The demand for new and improved analytical tools has only increased as next-gen sequencing technologies became accessible to the wider research community. NGS bioinformatics has become an industry of its own: researchers can now make a career out of it, and countless private organizations are trying to sell it. Unlike the market for sequencing technology, which is dominated by Illumina, the market for sequence analysis tools and platforms remains wide open.

Open Source Software Innovation

Importantly, many of the most innovative and popular tools, such as the BWA-MEM aligner, are open source software packages developed by academic researchers. The free-to-use, open source license is undoubtedly a huge factor in their success, as it conferred several key advantages:

  • Rapid adoption by the research community to establish a strong user base
  • Community-sourced code improvements and support
  • Incorporation and expansion into other tools and pipelines
  • Free, fully-featured hosting on sites like SourceForge

There’s also a general sentiment among programmers and bioinformaticians — the people who build, implement, and apply software tools — that free, open-source software is a good and noble thing. Particularly when it provides a cheap alternative to commercial software monopolies, e.g. the rise of Linux as a competitor to Microsoft Windows.

Disadvantages of Free Open Source

Although choosing a free, open-source software model for bioinformatics tools has benefits, it also carries some disadvantages. Just as bioinformatics analysis is not free, neither is software development. Good bioinformatics software must be maintained, supported, and improved to remain competitive and useful in this rapidly-evolving field. This can become a substantial burden for developers, one that takes time away from developing new tools, writing grants, publishing papers, etc.. Often, promising bioinformatics tools don’t get this follow-through, which leads many researchers to hesitate before adding a new software component to their pipeline.

The open nature of the code can also be a disadvantage, because it allows one’s competitors to see exactly what was done, and how it was done — information that they can incorporate into their own competing tools. Most open source licenses also permit commercial entities to freely modify and adapt software into a “product” that they can sell for a profit. This is essentially the business model for many commercial NGS software providers.

The Financial Crunch

In theory, bioinformaticians can continue to develop and publish open source software because their work is supported by grant funding. This model works quite well in a world where there’s plenty of grant money to go around. But we don’t live in that world: research budgets are shrinking, and competition for grants is higher than ever. This means that the researchers who develop crucial bioinformatics tools may not be able to find the funding to improve and support it. That leaves two options:

  1. Stop supporting, developing, and improving the software tool, or
  2. Identify new, alternative funding sources that could support the work

A possible solution for door number two would be to find a way to generate revenue from bioinformatics tools. In theory, allowing them to be licensed by other groups and organizations could provide the funds to sustain development. Of course, one loses some of the advantages of free open-source software: this will limit adoption of the tool, and can also be damaging to its reputation.

The Middle Ground: Free for Non-Commercial Use

There is a middle ground, which is developing open source software packages that are free for non-commercial use. In essence, this allows researchers at academic and nonprofit institutions to use the tool without paying for it. Commercial users, however — biotechs, pharmaceutical companies, bioinformatics software/service providers — must negotiate and pay for a license. It’s not a perfect solution, because such a license does not allow a tool to be hosted on SourceForge and may limit adoption of the tool by the private research community.

This is the licensing model that we use for VarScan, our tool for identifying germline variants and somatic mutations in NGS data. The license, as made formal in the publications in Bioinformatics and Genome Research, is “free for non-commercial use.” In other words, the binaries and source code are freely available, and researchers at nonprofit/academic institutions can use them however they’d like to. Commercial users, however, must obtain a license through WashU’s Office of Technology Management (OTM).

We are not alone in this: you’ll notice that other widely-used NGS analysis tools such as SOAP2 and GATK also have a license requirement for commercial users. We’re all motivated by the same thing: the desire to continue supporting our software for the research community, but the inability to support that by grants alone. I’m sorry, but there’s just not enough public funding that supports bioinformatics software maintenance and improvement. There’s not enough public funding in general. Call your elected officials and point this out, please. In the current funding climate, licensing software to commercial entities is often the only way to survive.

And I’ll go out on a limb here to argue that it’s only fair. Biotechs and pharmaceutical companies use next-gen sequencing to develop patents, drugs, crops, other products that they sell for enormous profits. NGS software companies charge steep prices for their products and services, most of which they sell to biotechs, hospitals, and clinics. Rather than all profits going to the shareholders of such companies, a small portion should perhaps support the researchers and institutions who developed these tools in the first place.

Why Bioinformatics Analysis Is Not Free

The rise of next-generation sequencing has been a boon for the field of genetics. It’s fueled invaluable “big science” projects, like the Cancer Genome Atlas and the 1,000 Genomes Project. Now, in 2015, the throughput of the current instruments has a lot of people excited. At the high end, the Illumina X Ten system puts large-scale sequencing efforts (10,000-50,000 whole genomes) within reach. Yet it also encourages investigators or groups with smaller cohorts to fund sequencing studies of their own.

At both ends of the spectrum, both researchers and their funding agencies have answered the siren call of fast, cheap genome sequencing. Yet it seems like a good time to remind everyone that the $1,xxx per genome price tag only covers the consumables and technician labor. For that price, you get raw sequencing data in the form of FASTQ files. Large-scale sequencing centers like ours will align the reads to the human reference sequence and deliver the results in a BAM file, but that’s generally it. Going beyond data generation — to do variant calling, downstream quality control, annotation, etc. — requires analysis. And analysis is not free.

The Cost of NGS Analysis

Even basic analysis of raw sequencing data comes with several costs. Usually, these are not accounted for when people talk about the price of sequencing.

Long-term data storage

A single BAM file for a 30x whole genome is about 100 gigs of disk space. So a modest-sized study of 500 samples will require 50 terabytes of disk just for the BAMs. Even lossy compressed data formats (e.g. CRAM) won’t solve the data storage issue. When your sequencing system can crank out 18,000 genomes a year, the disk space requirements are simply enormous.

Computational resources

Processing the raw sequence data — mapping it to the reference sequence, marking duplicates, and generating BAM files — requires a lot of computing power. We’re planning studies of 10,000 whole genomes or more, when each sequenced genome comprises 90 billion base pairs of sequence (30x coverage). Variant calling, base recalibration, and annotation for large-scale genomic datasets also have high computational demands. Maintaining a computing cluster that can handle this burden is expensive, as are the per-cycle costs of cloud computing.

Analysis Time

The analysis itself must be carried out by someone with bioinformatics expertise. Even with robust, highly automated pipelines in place, an analyst is necessary to collect the data as it comes from the production team, configure the right analysis (aligner, reference assembly, variant calling strategy, annotation sources, etc), monitor progress, and compile results. Analysts are typically salaried employees, and a single project could consume weeks or months of their time.

When Analysis Is Not Included

Here’s the problem with the rapidly decreasing costs of genome sequencing under the current funding climate: no one wants to pay for analysis. Look, I get it. You heard all of this song-and-dance about the $1,xxx genomes, and now you’re being told that the analysis might be just as expensive (if not more) than the sequencing! It’s the classic razor/razor blade or printer/ink scam all over again!

Given that gut reaction, it’s tempting to go with one of these alternative strategies:

“No, we don’t need analysis”

If you go this route, you should expect to get BAM files. For some people (e.g. people named Gabor or Goncalo), this is fine: they have analysis expertise and personnel. For others, this means you’ll probably have to pay someone to do the analysis anyway. It might as well be the people who generated your data for you.

Also, I should point out that NIH-funded sequencing data must be submitted to public repositories, often without embargo. Every day that you delay the necessary analysis is more time for your competitors to make use of this public data.

“There will be a separate RFA for analysis”

Certain funding agencies have adopted this model for large-scale projects: they issue one RFA for generating the sequencing data. Then they issue another RFA for groups to do the analysis. This de-coupling of data production and analysis serves no one. I’m sorry, but the researchers who generated sequencing data are probably best equipped to analyze it. They understand the nuances, and they also don’t have to go to dbGaP to get permissions (we all know this is a nightmare).

Second, this arrangement delays the results/interpretation phase of the project. The sequencing centers aren’t funded to analyze data coming off the instruments, and the analysis centers often don’t get started until it’s complete. It’s an inefficient process and I don’t understand this model at all.

“It’s OK, we just bought [software name]”

There are commercial software solutions designed to help investigators and small labs analyze large NGS datasets. In my opinion, there are two main issues with these. First of all, their developers invest a huge amount of development time building a pretty, easy-to-use interface… but the algorithms are ones we were using a year ago. The second and more pressing issue is the high cost of these tools.

“We’ll use grad students or postdocs”

There’s always a temptation to rely on “free” labor in the form of graduate students or postdocs. But these hapless souls probably don’t yet have the required expertise, so they’ll spend two months learning how to run BWA (probably by asking people like us for guidance). That time might be better spent helping write grants and papers, and leaving the analysis to people who do it every day as their job.

The Benefits of Funding Analysis

There’s sometimes a bit of sticker shock when writing the budget for a project analysis, but there are also many important benefits. First of all, it ensures that you’ll get high-quality results delivered in a format that you can actually use. Second, it allows the optimal analyses to be selected and performed by people who are specialists at it. The in-house analysts can also be faster, because they can often begin working on it before data might otherwise be passed off to an external group.

Perhaps most importantly, funding the analysis portion of a project makes it far more likely to succeed because it’s a collaborative effort. It lets the sequencing experts contribute their expertise, while the disease/phenotype experts contribute their own. Everyone plays to their strengths, and everyone’s efforts are supported by the funds of the project. This, in my opinion, is the best way to do great science.

Predicted Highlights of ASHG 2015

ashg 2015I’m excited that the looming threats of government shutdown and hurricane landing have retreated, so I’ll make it to the American Society of Human Genetics meeting this week in Baltimore. Here are some events I’m looking forward to.

Cancer Genetics in the Genomics Era

Wednesday, October 7th, 11:00 a.m. to 1:00 p.m. in Hall F, Level 1

Adam Kibel and I are moderating this invited session, which features four distinguished speakers from the field of cancer genomics:

  • Mike Dyer of St. Jude Children’s Research Hospital, on somatic mutations in pediatric solid tumors and identification of druggable pathways
  • Li Ding of the McDonnell Genome Institute at WashU, on broad and in-depth computational analyses for large scale cancer genetics.
  • Jan Korbel of the European Molecular Biology Lab, on the germline studies of the Pan-Cancer Analysis of Whole Genomes Project.
  • Josh Stuart, of the UC Santa Cruz, on the newly uncovered mutation/CNA patterns and integrative subtypes revealed by TCGA’s Pan-Cancer project

Breast and Prostate Cancer Genetics

Wednesday, October 7th, 2:30 p.m. to 4:30 p.m. in Ballroom I, Level 4

This abstract-driven session will offer an update on the genetic basis of susceptibility to these two common cancer types. The 8 presentations span a wide array of approaches — GWAS, genetic modifiers, exome sequencing, family studies — all designed to discover and characterize predisposition alleles.

At 2:45 p.m., I’ll describe our recent work with Paul Goodfellow (formerly of WashU, now at OSU) to uncover the genetic basis of early-onset breast cancer via exome sequencing.

The Art and Science of Science Communication

Thursday, October 8th, 9:00 a.m. to 10:30 a.m. in Hall F, Level 1

Chris Gunter is moderating this symposium on effective communication of scientific processes/findings to multiple audiences:

  • Ed Yong, author of the blog “Not Exactly Rocket Science”;
  • Liz Neeley, executive director of The Story Collider
  • Andrea Downing, a BRCA1 mutation carrier and patient advocate

It’s thrilling to see a session like this, which may not be the typical “hard science” of traditional symposia, but addresses the critical task of sharing our methods and findings to win continued grant funding and public support.

The Genetic Basis of Disease

Thursay is jam-packed with good content for gene hunters like myself. Here are some of the sessions that I’m particularly interested in:

Thursday, 2:30-4:30

  • Session 25, Powering Up Complex Trait Genetics (Ballroom III, Level 4)
  • Session 38: Adult-onset Neuropsychiatric Disease (Room 315, Level 3)
  • Session 29: The Ever-Changing Chromosome (Room 318, Level 3)
  • Session 30: Hard and Soft Tissue Syndromes (Holiday Ballroom 1, Hilton 2nd Floor)
  • Session 31: Genetics/Genomics Education (Holiday Ballroom 4, Hilton 2nd Floor)

Thursday, 5:00-7:00

  • Session 32: Human-wide Association Studies (Ballroom I, Level 4)
  • Session 33: Decoding Variants in Coding Regions (Ballroom III, Level 4)
  • Session 34: Reproductive Genetics (Room 307, Level 3), including a presentation by our collaborator Renee George on the role of rare loss-of-function variants in spermatogenic failure.
  • Session 36: Approaches for Genomic Analysis (Room 316, Level 3)
  • Session 37: Clinical Genetics (Room 318, Level 3)
  • Session 38: Clinical Impact of Genetic Variation (Holiday Ballroom 1, Hilton 2nd Floor)
  • Session 39: Mendel and Beyond (Holiday Ballroom 4, Hilton 2nd Floor)

Going Platinum: Building a Better Genome

Friday, October 9th, 2:15-4:15 in Room 316, Level 3

Our longtime collaborator Deanna Church is moderating this session, which will launch with Karyn Meltz Steinberg’s talk on single haplotype human genomes generated from long molecule sequencing will cover what we consider to be true platinum human genome assemblies. Spoiler warning: it’s not just high-coverage Illumina WGS data. That’s like, silver (if not bronze).

The Big Finish: Saturday Sessions

Many ASHG attendees travel home on Saturday (if not before), but the agenda this year is tantalizing.

  • Opening up big data (10:30-12:30 in Ballroom III, Level 4), moderated by Joe Pickrell of NYGC and Joanna Mountain of 23andMe.
  • Integrating genomes and transcriptomes to understand disease (1:45-3:45 in Hall F, Level 1), moderated by Michael Clark (Personalis) and Tuuli Lappalainen (NYGC). One of the featured speakers is our co-director Elaine Mardis, who will discuss the correlative power of DNA to RNA in cancer genomics at 1:45 pm in Hall F, Level 1.

Other less formal events that I’m looking forward to include:

  • #ASHG15 Tweetups, the traditional and slightly-awkward rendezvous of meeting Twitterati.
  • Breakfast and dinner meetings with my collaborators from Texas, Finland, and elsewhere.

My schedule for ASHG is very tight this year, but I’d love the chance to meet MassGenomics readers and fellow bloggers. Please give me a shout on Twitter if you’ll be there.

How to Catch a Virus: Targeted Capture for Viral Sequencing

Metagenomic profiling, also called metagenomic shotgun sequencing (MSS) represents a powerful application made possible by the digital nature of next-gen sequencing technologies. In it, one basically sequences a sample isolate obtained from somewhere — a shovelful of dirt, a scoop of plankton, or anything else that contains living organisms. MSS has proven particularly useful to studies of the human microbiome, or in layman’s terms, all of the bacteria/viruses/fungi that live in our bodies.

Many such microbiota are beneficial or simply commensal (not doing harm) with us. Others, like methicillin-resistant Staphylococcus aureus (MRSA), can cause severe disease. Most efforts to chart the human microbiome have focused on bacteria, whose relatively stable genomes make them amenable to assay development. Viruses, in contrast, are somewhat under-studied. Part of that is due to the small size and highly variable nature of viral genomes.

A new study in Genome Research showcases a capture-based enrichment strategy to improve virome sequencing. The ViroCap panel was developed by Todd and Kristine Wylie, who happen to be colleagues of mine at the McDonnell Genome Institute. The panel enriches for nucleic acids from 34 families of DNA or RNA viruses that infect vertebrate hosts, beautifully illustrated in Figure 1 from the paper:

Virome capture panel

Wylie et al, Genome Research 2015

At the time of the ViroCap design, NCBI GenBank contained the sequenced genomes of around 440 viral species, for a total of about 1 Gbp (billion base pairs) of sequence. Yet the maximum size of a capture reagent (for Nimblegen SeqCap EZ) was 200 million bp. So the authors winnowed down the list by removing:

  • Bacteriophages (only infect bacteria)
  • Human endogenous retroviruses (already in our constitutional genome)
  • Viruses that infect only fungi, archaea, algae, or invertebrate hosts

The resulting targets represent 34 viral families, comprising 190 annotated genera and 337 different species. After considerable bioinformatics efforts, the authors produced a ~200 Mbp sequence target and worked with Nimblegen to have it designed.

ViroCap Evaluation in Clinical and Research Samples

To validate the new reagent, the authors leveraged two small cohorts of patient samples that had tested positive for viral infection by molecular and/or PCR-based detection assays. Illumina sequencing libraries were created for each of the sixteen (total) samples, and then sequenced in parallel with and without the ViroCap enrichment. The results are pretty striking:

Performance Metric Clinical Samples (n=8) Research Samples (n=8)
Viruses detected (MSS): 10 14
Viruses detected (ViroCap): 11 18
Median coverage breadth (WSS): 2.1% 2.0%
Median coverage breadth (ViroCap): 83.2% 75.6%

ViroCap enables better detection and improved overall breadth of coverage for viral genomes. Figure 1 illustrated this very well. Here’s the coverage of norovirus (often the fact of cruise ship outbreaks) in sample P6:

ViroCap norovirus capture sequencing

Norovirus coverage comparison (Wylie et al, Genome Res 2015)

You’re looking at the depth of coverage achieved across the reference by metagenomic shotgun sequencing (top right, in red) compared to the coverage of ViroCap sequencing. The breadth of coverage was 51% higher with ViroCap, and the average depth went from about 3x to 180x by my estimate. Here’s Influenza A (H3N2):

Influenza A sequencing Virocap

Coverage comparison of Influenza A (Wylie et al, Genome Res 2015)

In this case, the virus went from essentially undetected (2 reads) in WSS to 20x-140x average depth with ViroCap.

Variable Virus Genomes

One potential criticism of capture-based assays for viral sequencing is that highly variable genomes might not be well-captured due to substantial divergence from the reference sequence used to design probes. We know that 100% sequence identity isn’t required, or else capture sequencing methods (e.g. exome sequencing) would never have become a mainstay for human genetics. Yet viral genomes are both variable and highly mutable, so it’s important to know how well ViroCap addresses that.

To investigate this, the authors looked at samples positive for anelloviruses, a highly divergent group of single-standed DNA viruses that have a common core genome but up to 50% nucleotide sequence diversity. In those samples, contigs with sequence identity as low as 62% were completely covered with ViroCap sequencing. The most divergent contig observed had 58% identity and was missing about 13% of the target region, suggesting that viral genomes diverging ~40% or more from the reference will begin to lose coverage with ViroCap.

In summary, Wylie et al have developed a valuable resource for viral metagenomic sequencing that should have immediate utility in both research and clinical settings.

Wylie TN, Wylie KM, Herter BN, & Storch GA (2015). Enhanced virome sequencing through solution-based capture enrichment. Genome research PMID: 26395152