Clinical Sequencing: Addressing the Dreaded VUS

July 19, 2017 by Dan Koboldt

The adoption of exome and whole genome sequencing as frontline genetic tests has definitively proven to increase the speed and success rate of molecular diagnosis. These improvements should not only help end the diagnostic odyssey for many patients, but also allow faster intervention and genome‐informed clinical care.

Yet the task of interpreting exome‐ or genome‐wide variants of possible clinical relevance remains daunting for geneticists and genetic counselors. Unbiased sequencing yields a substantial number of variants, many of which are private to an individual or family and whose pathogenicity is difficult to assess. As a result, genetic testing reports increasingly contain variants of unknown significance (VUS).

The Problem with VUS

VUS are a growing problem in this era of widespread clinical exome sequencing. They’re generally included in clinical reports when associated with a relevant patient phenotype. Yet they’re also the most time-consuming class of variants to interpret for both the clinical laboratory personnel and the clinician receiving the report. I imagine they’re difficult to explain to patients as well (“Yes, you have a variant. No, we don’t know what it means.”).

A VUS classification simply means that a variant didn’t meet the ACMG-defined criteria to be classified as pathogenic/likely-pathogenic or benign/likely-benign. Sometimes it’s due to information that’s missing, such as the variant’s mutation status (i.e. is it de novo) or its effect on protein function/splicing. Other times, the information at hand is conflicting: computational algorithms have different estimates of pathogenicity, or the variant has been reported in both disease patients and healthy individuals.

The VUS problem is not limited to clinical sequencing, by the way. Researchers conducting studies of rare/Mendelian disorders also struggle with this uncertainty when trying to uncover new disease-causing genes and mutations.

VUS-busters at ASHG

The bad news is that VUS are a significant issue. The good news is that many groups are working on methods, tools, databases, and other resources to shift would-be-VUS into a more useful category on a clinical report. I worked with several of them to propose an invited session on VUS-busting for the American Society of Human Genetics Meeting, which takes place in Orlando in October. I’m pleased to report that the programming committee responded with enthusiasm, and selected our invited session as a clinical spotlight!

Session Title: VUS-busters: Cutting-edge Strategies for Interpreting Variants in Clinical and Research Sequencing

Date: Thursday, October 19, 2017

Time: 4:15-6:15 p.m.

Moderators: Dan Koboldt (Nationwide Children’s Hospital) and Aaron Quinlan (University of Utah)

Description: This session brings together a diverse panel of experts in clinical genomic medicine to address issues of central importance to both researchers and clinicians. They will describe state‐of‐the‐art approaches for alleviating the interpretation bottleneck, including aggregated genome databases, comprehensive variant annotation, and phenotype‐driven analysis. They will also describe best practices for solving difficult‐to‐ diagnose cases: disorders arising from unusual inheritance models, somatic and germline mosaicism, pathogenic noncoding variants, structural variants, and multifactorial genetic bases. Together, they will provide cohesive guidance for improving the speed and success of variant interpretation.

I’m pleased to announce these four speakers, each of whom will bring a unique perspective to the session:

Anne O’Donnell-Luria (Boston Children’s Hospital), on using large-scale, diverse reference databases to improve variant interpretation
Kim McBride (Nationwide Children’s Hospital), on identifying oligogenic causes of congenital heart disease.
Nara Sobreira (Johns Hopkins University), on novel analytic approaches used to solve unsolved whole exome sequencing data.
Peter Robinson (Jackson Laboratory), on phenotype-driven analysis of exome and genome data.

Special thanks to Anne, Kim, Nara, Peter, and Aaron for helping me prepare a competitive invited session proposal, and to the programming committee for supporting our vision. See you in Orlando this October!

The Real Cost of Sequencing

April 8, 2016 by Dan Koboldt

The real cost of sequencing is as hard to pin down as a sumo wrestler. Working in a large-scale sequencing laboratory offers an interesting perspective on the duality of the so-called “cost per genome.” On one hand, we see certain equipment manufacturers and many people in the media tossing around claims that sequencing a genome now costs under $1,000. On the other, we write grant budgets and estimates based on actual costs, which include things like sample assessment, variant calling, and data storage. With these incorporated, the cost per genome is not that low, even for large projects.

I came across a wonderful opinion piece at Genome Biology, in which the authors discuss the evolution of sequencing and computing technologies over the past 60 years. Admittedly, I found it a bit daunting at first, because theories of computation and “conceptual frameworks” don’t excite me. Once I pushed past the organizing principle stuff, however, I found it contained some shrewd perspectives on the current state and near future of genomics.

Big Data: Large Scale Sequencing

Credit: Muir et al, Genome Biology, 2016

The rise of next-gen sequencing factors significantly in the big data paradigm for genomics. Rather than trot out the sequencing cost versus Moore’s law figure, the authors provided some compelling illustrations of the dramatic increase in the pace and quantity of sequencing. The most striking of these was a pie chart of the sequence data contributed by large-scale projects.

The Cancer Genome Atlas (TCGA) dwarfs everyone else, with 2300 Terabases of sequencing data. This is ten times the amount generated by the 1,000 Genomes Project, and 30 times the amount in the Alzheimer’s Disease Sequencing Project (ADSP).

Costs and Economies of Scale

A key concept highlighted by the authors is the interplay between fixed and variable costs. The sequencing technologies utilized for the Human Genome Project had considerable up-front costs (i.e. instrument purchase) and somewhat fixed per-sample costs. In contrast, next-generation sequencing has a high up-front cost, but a reduced per-sample cost as volume increases. In other words, the more genomes we produce, the less they cost. True, this economy of scale has an upper limit, but the current throughput of an Illumina X Ten system — 18,000 human whole genomes per year — provides enormous capacity.

Interestingly, the opposite paradigm-shift is taking place in the computing industry. Until recently, the model for computing mirrored NGS: large up-front cost of buying the servers, but lower variable costs for running them. In some ways, this erected a barrier for smaller labs hoping to tackle complex problems, because they might not be able to afford enough computing equipment to handle the workload. Yet cloud computing and computing-as-a-service platforms have largely removed the need for that up-front investment. Anyone can buy as much computing power as they need on the Amazon or Google clouds. Although the variable cost (per CPU hour) is higher than that of a large data center, there’s no large fixed cost at the front end. As the authors put it:

This new regime, in which costs scale with the amount of computational processing time, places a premium on driving down the average cost by developing efficient algorithms for data processing.

As a bioinformatician, I think this is a good thing, because it forces us to improve our software tools and pipelines to become as efficient as possible.

Although cloud computing offers tremendous appeal, it faces some challenges for widespread adoption in our field. Most sequencing take place in academic settings, where equipment purchases are often exempt from indirect fees (because the university can write off depreciation). Also, many investigators don’t have to pay for the basic utilities required to run computing equipment (e.g. electricity and cooling). These factors encourage us to stick with the traditional computing model, rather than shifting to cloud computing which will be subject to indirect costs.

Breaking Down the Cost of Sequencing

Muir et al, Genome Biology, 2016

We tend to measure the cost of sequencing as bases per dollar, or more recently, X dollars per genome. Both funding agencies and sequencing customers like to ask how much an exome or a genome costs. This single-price figure has some disadvantages:

It’s not always clear what that dollar figure includes. Is it purely the sequencing run cost, or does it account for non-free things like sample assessment, handling, and bioinformatics analysis? Notice how they’re not included in the figure at right.
It obscures the true cost breakdown of a sequencing project into its constituent parts, which complicates cost estimates and makes it harder to adapt to changes like the shift to cloud computing.
It can lead to unrealistic expectations. People hear about this $1,000 genome, so they come to us for a whole-genome sequencing quote, and get upset when (1) it’s not that low, and (2) we have to add other costs, like sample handling, to the estimate.

Unrealistic expectations are a source of constant frustration for us. When we provide estimates for a sequencing project, we include analysis time as a recommended (but often not required) line item. Of course, no one wants to pay for analysis — they just want the sequencing. Sometimes this is just fine — we provide sequencing for a number of collaborators who are capable at NGS analysis. Other times, the customer later asks “How do I open this BAM file to see my variants?”

Sorry, but high-quality variant calls require analysis, and as I’ve written before, bioinformatics analysis is not free.

One thing that concerns me about the current state of federal funding (for sequencing) in the United States is that large-scale projects emphasize data production, not data analysis. The RFA for NHGRI’s large-scale sequencing program (CCDG) mandated that 80% of the budget go to data production. Yet as the authors of this opinion piece correctly point out:

As bioinformatics becomes increasingly important in the generation of biological insight from sequencing data, the long-term storage and analysis of sequencing data will represent a larger fraction of project cost.

I couldn’t agree more.

References

Muir P, Li S, Lou S, Wang D, Spakowicz DJ, Salichos L, Zhang J, Weinstock GM, Isaacs F, Rozowsky J, & Gerstein M (2016). The real cost of sequencing: scaling computation to keep pace with data generation. Genome biology, 17 (1) PMID: 27009100

The Open Source Software Debate in NGS Bioinformatics

November 13, 2015 by Dan Koboldt

The rise of next-generation sequencing technology has been a boon for the field of bioinformatics, since the unprecedented throughputs — along with the diversity of possible applications in research and healthcare — brought forth a new generation of software tools for sequence analysis and interpretation. The fact that the growth in sequencing throughput has outstripped Moore’s Law for several years running has forced incredible innovation in the field of bioinformatics tool development, because we no longer have the luxury of more computing power than we could ever possible use.

The demand for new and improved analytical tools has only increased as next-gen sequencing technologies became accessible to the wider research community. NGS bioinformatics has become an industry of its own: researchers can now make a career out of it, and countless private organizations are trying to sell it. Unlike the market for sequencing technology, which is dominated by Illumina, the market for sequence analysis tools and platforms remains wide open.

Open Source Software Innovation

Importantly, many of the most innovative and popular tools, such as the BWA-MEM aligner, are open source software packages developed by academic researchers. The free-to-use, open source license is undoubtedly a huge factor in their success, as it conferred several key advantages:

Rapid adoption by the research community to establish a strong user base
Community-sourced code improvements and support
Incorporation and expansion into other tools and pipelines
Free, fully-featured hosting on sites like SourceForge

There’s also a general sentiment among programmers and bioinformaticians — the people who build, implement, and apply software tools — that free, open-source software is a good and noble thing. Particularly when it provides a cheap alternative to commercial software monopolies, e.g. the rise of Linux as a competitor to Microsoft Windows.

Disadvantages of Free Open Source

Although choosing a free, open-source software model for bioinformatics tools has benefits, it also carries some disadvantages. Just as bioinformatics analysis is not free, neither is software development. Good bioinformatics software must be maintained, supported, and improved to remain competitive and useful in this rapidly-evolving field. This can become a substantial burden for developers, one that takes time away from developing new tools, writing grants, publishing papers, etc.. Often, promising bioinformatics tools don’t get this follow-through, which leads many researchers to hesitate before adding a new software component to their pipeline.

The open nature of the code can also be a disadvantage, because it allows one’s competitors to see exactly what was done, and how it was done — information that they can incorporate into their own competing tools. Most open source licenses also permit commercial entities to freely modify and adapt software into a “product” that they can sell for a profit. This is essentially the business model for many commercial NGS software providers.

The Financial Crunch

In theory, bioinformaticians can continue to develop and publish open source software because their work is supported by grant funding. This model works quite well in a world where there’s plenty of grant money to go around. But we don’t live in that world: research budgets are shrinking, and competition for grants is higher than ever. This means that the researchers who develop crucial bioinformatics tools may not be able to find the funding to improve and support it. That leaves two options:

Stop supporting, developing, and improving the software tool, or
Identify new, alternative funding sources that could support the work

A possible solution for door number two would be to find a way to generate revenue from bioinformatics tools. In theory, allowing them to be licensed by other groups and organizations could provide the funds to sustain development. Of course, one loses some of the advantages of free open-source software: this will limit adoption of the tool, and can also be damaging to its reputation.

The Middle Ground: Free for Non-Commercial Use

There is a middle ground, which is developing open source software packages that are free for non-commercial use. In essence, this allows researchers at academic and nonprofit institutions to use the tool without paying for it. Commercial users, however — biotechs, pharmaceutical companies, bioinformatics software/service providers — must negotiate and pay for a license. It’s not a perfect solution, because such a license does not allow a tool to be hosted on SourceForge and may limit adoption of the tool by the private research community.

This is the licensing model that we use for VarScan, our tool for identifying germline variants and somatic mutations in NGS data. The license, as made formal in the publications in Bioinformatics and Genome Research, is “free for non-commercial use.” In other words, the binaries and source code are freely available, and researchers at nonprofit/academic institutions can use them however they’d like to. Commercial users, however, must obtain a license through WashU’s Office of Technology Management (OTM).

We are not alone in this: you’ll notice that other widely-used NGS analysis tools such as SOAP2 and GATK also have a license requirement for commercial users. We’re all motivated by the same thing: the desire to continue supporting our software for the research community, but the inability to support that by grants alone. I’m sorry, but there’s just not enough public funding that supports bioinformatics software maintenance and improvement. There’s not enough public funding in general. Call your elected officials and point this out, please. In the current funding climate, licensing software to commercial entities is often the only way to survive.

And I’ll go out on a limb here to argue that it’s only fair. Biotechs and pharmaceutical companies use next-gen sequencing to develop patents, drugs, crops, other products that they sell for enormous profits. NGS software companies charge steep prices for their products and services, most of which they sell to biotechs, hospitals, and clinics. Rather than all profits going to the shareholders of such companies, a small portion should perhaps support the researchers and institutions who developed these tools in the first place.

Why Bioinformatics Analysis Is Not Free

October 29, 2015 by Dan Koboldt

The rise of next-generation sequencing has been a boon for the field of genetics. It’s fueled invaluable “big science” projects, like the Cancer Genome Atlas and the 1,000 Genomes Project. Now, in 2015, the throughput of the current instruments has a lot of people excited. At the high end, the Illumina X Ten system puts large-scale sequencing efforts (10,000-50,000 whole genomes) within reach. Yet it also encourages investigators or groups with smaller cohorts to fund sequencing studies of their own.

At both ends of the spectrum, both researchers and their funding agencies have answered the siren call of fast, cheap genome sequencing. Yet it seems like a good time to remind everyone that the $1,xxx per genome price tag only covers the consumables and technician labor. For that price, you get raw sequencing data in the form of FASTQ files. Large-scale sequencing centers like ours will align the reads to the human reference sequence and deliver the results in a BAM file, but that’s generally it. Going beyond data generation — to do variant calling, downstream quality control, annotation, etc. — requires analysis. And analysis is not free.

The Cost of NGS Analysis

Even basic analysis of raw sequencing data comes with several costs. Usually, these are not accounted for when people talk about the price of sequencing.

Long-term data storage

A single BAM file for a 30x whole genome is about 100 gigs of disk space. So a modest-sized study of 500 samples will require 50 terabytes of disk just for the BAMs. Even lossy compressed data formats (e.g. CRAM) won’t solve the data storage issue. When your sequencing system can crank out 18,000 genomes a year, the disk space requirements are simply enormous.

Computational resources

Processing the raw sequence data — mapping it to the reference sequence, marking duplicates, and generating BAM files — requires a lot of computing power. We’re planning studies of 10,000 whole genomes or more, when each sequenced genome comprises 90 billion base pairs of sequence (30x coverage). Variant calling, base recalibration, and annotation for large-scale genomic datasets also have high computational demands. Maintaining a computing cluster that can handle this burden is expensive, as are the per-cycle costs of cloud computing.

Analysis Time

The analysis itself must be carried out by someone with bioinformatics expertise. Even with robust, highly automated pipelines in place, an analyst is necessary to collect the data as it comes from the production team, configure the right analysis (aligner, reference assembly, variant calling strategy, annotation sources, etc), monitor progress, and compile results. Analysts are typically salaried employees, and a single project could consume weeks or months of their time.

When Analysis Is Not Included

Here’s the problem with the rapidly decreasing costs of genome sequencing under the current funding climate: no one wants to pay for analysis. Look, I get it. You heard all of this song-and-dance about the $1,xxx genomes, and now you’re being told that the analysis might be just as expensive (if not more) than the sequencing! It’s the classic razor/razor blade or printer/ink scam all over again!

Given that gut reaction, it’s tempting to go with one of these alternative strategies:

“No, we don’t need analysis”

If you go this route, you should expect to get BAM files. For some people (e.g. people named Gabor or Goncalo), this is fine: they have analysis expertise and personnel. For others, this means you’ll probably have to pay someone to do the analysis anyway. It might as well be the people who generated your data for you.

Also, I should point out that NIH-funded sequencing data must be submitted to public repositories, often without embargo. Every day that you delay the necessary analysis is more time for your competitors to make use of this public data.

“There will be a separate RFA for analysis”

Certain funding agencies have adopted this model for large-scale projects: they issue one RFA for generating the sequencing data. Then they issue another RFA for groups to do the analysis. This de-coupling of data production and analysis serves no one. I’m sorry, but the researchers who generated sequencing data are probably best equipped to analyze it. They understand the nuances, and they also don’t have to go to dbGaP to get permissions (we all know this is a nightmare).

Second, this arrangement delays the results/interpretation phase of the project. The sequencing centers aren’t funded to analyze data coming off the instruments, and the analysis centers often don’t get started until it’s complete. It’s an inefficient process and I don’t understand this model at all.

“It’s OK, we just bought [software name]”

There are commercial software solutions designed to help investigators and small labs analyze large NGS datasets. In my opinion, there are two main issues with these. First of all, their developers invest a huge amount of development time building a pretty, easy-to-use interface… but the algorithms are ones we were using a year ago. The second and more pressing issue is the high cost of these tools.

“We’ll use grad students or postdocs”

There’s always a temptation to rely on “free” labor in the form of graduate students or postdocs. But these hapless souls probably don’t yet have the required expertise, so they’ll spend two months learning how to run BWA (probably by asking people like us for guidance). That time might be better spent helping write grants and papers, and leaving the analysis to people who do it every day as their job.

The Benefits of Funding Analysis

There’s sometimes a bit of sticker shock when writing the budget for a project analysis, but there are also many important benefits. First of all, it ensures that you’ll get high-quality results delivered in a format that you can actually use. Second, it allows the optimal analyses to be selected and performed by people who are specialists at it. The in-house analysts can also be faster, because they can often begin working on it before data might otherwise be passed off to an external group.

Perhaps most importantly, funding the analysis portion of a project makes it far more likely to succeed because it’s a collaborative effort. It lets the sequencing experts contribute their expertise, while the disease/phenotype experts contribute their own. Everyone plays to their strengths, and everyone’s efforts are supported by the funds of the project. This, in my opinion, is the best way to do great science.