Just as the reach of next-generation sequencing has continued to grow — in both research and clinical realms — so too has the community of NGS users. Some have been around since the early days. The days of 454 and Solexa sequencing. Since then, the field has matured at an astonishing pace. Many standards were established to help everyone make sense of this flood of data. The recent democratization of sequencing has made next-gen sequencing available to just about anyone.
And yet, there have been growing pains. With great power comes great responsibility. To help some of the newcomers into the field, I’ve drafted these ten commandments for next-gen sequencing.
1. Thou shalt not reinvent the wheel. In spite of rapid technological advances, NGS is not a new field. Most of the current “workhorse” technologies have been on the market for a couple of years or more. As such, we have a plethora of short read aligners, de novo assemblers, variant callers, and other tools already. Even so, there is a great temptation for bioinformaticians to write their own “custom scripts” to perform these tasks. There’s a new “Applications Note” every day with some tool that claims to do something new or better.
Can you really write an aligner that’s better than BWA? More importantly, do we need one? Unless you have some compelling reason to develop something new (as we did when we developed SomaticSniper and VarScan), take advantage of what’s already out there.
2. Thou shalt not coin any new term ending with “ome” or “omics”. We have enough of these already, to the point where it’s getting ridiculous. Genome, transcriptome, and proteome are obvious applications of this nomenclature. Epigenome, sure. But the metabolome, interactome, and various other “ome” words are starting to detract from the naming system. The ones we need have already been coined. Don’t give in to the temptation.
3. Thou shall follow thy field’s conventions for jargon. Technical terms, acronyms, and abbreviations are inherent to research. We need them both for precision and brevity. When we get into trouble is when people feel the need to create their own acronyms when a suitable one already exists. Is there a significant difference between next-generation sequencing (NGS), high-throughput sequencing (HTS), and massively parallel sequencing (MPS)?
Widely accepted terms provide something of a standard, and they should be used whenever possible. Insertion/deletion variants are indels, not InDels or INDELs DIPs. Structural variants are SVs, not SVars or GVs. We don’t need any more acronyms!
These commandments address behaviors that get on my nerves, both as a blogger and a peer reviewer.
4. Thou shalt not publish by press release. This is a disturbing trend that seems to happen more and more frequently in our field: the announcement of “discoveries” before they have been accepted for publication. Peer review is the required vetting process for scientific research. Yes, it takes time and yes, your competitors are probably on the verge of the same discovery. That doesn’t mean you get to skip ahead and claim credit by putting out a press release.
There are already examples of how this can come back to bite you. When the reviewers trash your manuscript, or (gasp) you learn that a mistake was made, it looks bad. It reflects poorly on the researchers and the institution, both in the field and in the eyes of the public.
5. Thou shalt not rely only on simulated data. Often when I read a paper on a new method or algorithm, they showcase it using simulated data. This often serves a noble purpose, such as knowing the “correct” answer and demonstrating that your approach can find it. Even so, you’d better apply it to some real data too. Simulations simply can’t replicate the true randomness of nature and the crap-that-can-go-wrong reality of next-gen sequencing. There’s plenty of freely available data out there; go get some of it.
6. Thou shalt obtain enough samples. One consequence of the rapid growth of our field (and accompanying drop in sequencing costs) is that small sample numbers no longer impress anyone. They don’t impress me, and they certainly don’t impress the statisticians upstairs. The novelty of exome or even whole-genome sequencing has long worn off. Now, high-profile studies must back their findings with statistically significant results, and that usually means finding a cohort of hundreds (or thousands) of patients with which to extend your findings.
This new reality may not be entirely bad news, because it surely will foster collaboration between groups that might otherwise not be able to publish individually.
Data Sharing and Submissions
7. Thou shalt withhold no data. With some exceptions, sequencing datasets are meant to be shared. Certain institutions, such as large-scale sequencing centers in the U.S., are mandated by their funding agencies to deposit data generated using public funds on a timely basis following its generation. Since the usual deposition site is dbGaP, this means that IRB approvals and dbGaP certification letters must be in hand before sequencing can begin.
Any researchers who plan to publish their findings based on sequencing datasets will have to submit them to public datasets before publication. This is not optional. It is not “something we should do when we get around to it after the paper goes out.” It is required to reproduce the work, so it should really be done before a manuscript is submitted. Consider this excerpt from Nature‘s publication guidelines:
Data sets must be made freely available to readers from the date of publication, and must be provided to editors and peer-reviewers at submission, for the purposes of evaluating the manuscript.
For the following types of data set, submission to a community-endorsed, public repository is mandatory. Accession numbers must be provided in the paper.
The policies go on to list various types of sequencing data:
- DNA and RNA sequences
- DNA sequencing data (traces for capillary electrophoresis and short reads for next-generation sequencing)
- Deep sequencing data
- Epitopes, functional domains, genetic markers, or haplotypes.
Every journal should have a similar policy; most top-tier journals already do. Editors and referees need to enforce this submission requirement by rejecting any manuscripts that do not include the submission accession numbers.
8. Thou shalt not take unfair advantage of submitted data. Many investigators are concerned about data sharing (especially when mandated upon generation, not publication) from fear of being scooped. This is a valid concern. When you submit your data to a public repository, others can find it and (if they meet the requirements) use it. Personally, I think most of these fears are not justified — I mean, have you ever tried to get data out of dbGaP? The time it takes for someone to find, request, obtain, and use submitted data should allow the producers of the data to write it up.
Large-scale efforts to which substantial resources have been devoted — such as the Cancer Genome Atlas — have additional safeguards in place. Their data use policy states that, for a given cancer type, submitted data can’t be used until the “marker paper” has been published. This is a good rule of thumb for the NGS community, and something that journal editors (and referees) haven’t always enforced.
Just because you can scoop someone doesn’t mean that you should. It’s not only bad karma, but bad for your reputation. Scientists have long memories. They will likely review your manuscript or grant proposal sometime in the future. When that happens, you want to be the person who took the high road.
Research Ethics and Cost
9. Thou shalt not discount the cost of analysis. It’s true that since the advent of NGS technology, the cost of sequencing has plummeted. The cost of analysis, however, has not. And making sense of genomic data — alignment, quality control, variant calling, annotation, interpretation — is a daunting task indeed. It takes computational resources as well as expertise. This infrastructure is not free; in fact, it can be more expensive than the sequencing itself.
Without analysis, your sequencing data, your $1,000 genome, is about as useful as a chocolate teapot.
10. Thou shalt honor thy patients and their samples. Earlier this month, I wrote about how supposedly anonymous individuals from the CEPH collection were identified using a combination of genetic markers and online databases. It is a simple fact that we can no longer guarantee a sequenced sample’s anonymity. That simple fact, combined with our growing ability to interpret the possible consequences of an individual genome, means a great deal of risk for study volunteers.
We must safeguard the privacy of study participants — and find ways to protect them from privacy violations and/or discrimination — if we want their continued cooperation.
This means obtaining good consent documents and ensuring that they’re all correct before sequencing begins. It also means adhering to the data use policies those consents specify. As I’ve written before, samples are the new commodity in our field. Anyone can rent time on a sequencer. If you don’t make an effort to treat your samples right, someone else will.