Data sharing is essential in the fields of genetics and genomics. It remains one of the core principles of federally-funded “big science” — large consortium efforts to conduct research at incredible scale. The Human Genome Project and its descendants — HapMap, 1000 Genomes, ENCODE, TCGA — are prime examples of such efforts. The resources that they have created (and provided, virtually without restriction) for the research community are priceless.
What makes these resources long-lasting and significant is the fact that they’re open-access for all. Perhaps more importantly, these impressive datasets were made available during the project, not afterward. This is no small thing. Anyone who’s participated in large-scale genomics projects probably understands just how much effort is required to QC and submit data incrementally, make data freezes, and ensure that they’re available to the community.
The Downside of Data Sharing
There are, of course, some disadvantages to sharing data in real time. During the HGP, a certain maverick took advantage of the public genome data as it was generated, using it to scaffold his company’s private, competing human genome draft assembly. Today, the “scoop” of public data is more ubiquitous for a number of reasons:
- Central repositories. Most large-scale projects submit their data to a single place (e.g. dbGaP) where it’s relatively easy to find.
- Rapid access to compressed data. High-speed internet access is everywhere — heck, Google’s floating balloons in New Zealand to bring it to distant tribes — and it’s faster than ever. An exome BAM file is about 10 gigabytes; even at modest speeds, you can download one in a matter of minutes. Compressed VCFs come even quicker.
- Democratization of sequencing. Most investigators now have access to rapid, inexpensive sequencing of exomes or whole genomes. They can do 20 or 30 exomes, find a gene of interest, and then quickly look in large datasets such as TCGA for recurrence.
In short, obtaining and utilizing the datasets of “big science” initiatives is easier than ever. This is good news for the research community, so long as everyone plays fair. But let’s be honest, we live in the real world where that doesn’t always happen.
Data Embargo and First Rights
Most of these ambitious, expensive, long-term projects have a data use policy deisgned to protect the investment of money, samples, time, and other resources. Some data may be under embargo for a certain time, meaning that it’s submitted to public repositories but not available for download. This isn’t really open access, though, so it’s a policy that’s being used less and less. Instead, there’s usually a publication embargo — an understanding that the participants in the project get first rights to publish on their data.
In TCGA, the data use policy seems quite clear: no one gets to publish on a TCGA cancer type’s data until the first major publication, the “marker paper”, has come out. This is understood quite well, at least by TCGA participants. Remember that they’re in the unique position of generating the data, meaning that they see it before anyone else and are usually quite capable of analyzing it on their own. Nevertheless, everyone waits for and collaborates on the marker paper before going off on their own.
The Enforcement Problem
There’ s a major problem with this policy, however, and that’s enforcement. When the data are made available, there’s no way to physically stop outside investigators from using it, even from writing up manuscripts and submitting them for publication. Unless the editor or peer reviewers are aware of a project’s data embargo status, it’s quite possible for those manuscripts to reach publication before the marker paper. Just this month, there was a paper in a high-profile journal that used (among other things) embargoed TCGA data.
Obviously the two lines of defense — the data use policy, and the editors and referees who reviewed the manuscript, failed to prevent this from happening. What happens now? I have heard anecdotal evidence that, in the past, such violations ultimately resulted in a paper being withdrawn. This is perhaps what should come to pass, though I don’t know if it will. In either case, the damage is done.
Points Against Publication Embargo
It needs to be said that there are questions about whether these embargo policies are in the best interest of the research community. Data that was generated using public funding belongs to the public, and that includes the project’s competitors. I’m keenly aware of the fact that some people disagree with funding “big science” projects. It means that many smaller grant proposals must go without funding. There are also those of the opinion that, if the participants in a project can’t get their shit together and publish before someone else does, that’s their problem.
Unfortunately, it’s not quite so simple, as anyone who’s tried to write a marker paper with a consortium understands quite well. With the vast amount of data (and egos) involved, these projects are cumbersome. I would argue, however, that the landmark publications coming out of these studies are worth the wait. Look at the incredible resources that “big science” projects have provided our community:
- The HGP provided the reference, enabling us to annotate and find variants in the genome.
- The HapMap Project yielded a comprehensive genetic map and gave rise to high-throughput genotyping, without which GWAS would not have been possible.
- The 1000 Genomes Project helped spur the development of NGS technologies, algorithms, and file formats, as well as dramatically expanding the catalogue of human genetic variation
- The Cancer Genome Atlas has and continues to yield critical comprehensive molecular profiles of common cancer types
- The ENCODE project has laid the groundwork for understanding the composition and function of elements in the genome.
These resources are priceless, and I would argue that they would not exist without the big science projects behind them. And yet, such efforts are doomed if investigators, journal editors, and peer referees fail to respect and enforce their data use policies.