Data Sharing, Embargo, and Big Science

Data sharing is essential in the fields of genetics and genomics. It remains one of the core principles of federally-funded “big science” — large consortium efforts to conduct research at incredible scale. The Human Genome Project and its descendants — HapMap, 1000 Genomes, ENCODE, TCGA — are prime examples of such efforts. The resources that they have created (and provided, virtually without restriction) for the research community are priceless.

What makes these resources long-lasting and significant is the fact that they’re open-access for all. Perhaps more importantly, these impressive datasets were made available during the project, not afterward. This is no small thing. Anyone who’s participated in large-scale genomics projects probably understands just how much effort is required to QC and submit data incrementally, make data freezes, and ensure that they’re available to the community.

The Downside of Data Sharing

There are, of course, some disadvantages to sharing data in real time. During the HGP, a certain maverick took advantage of the public genome data as it was generated, using it to scaffold his company’s private, competing human genome draft assembly. Today, the “scoop” of public data is more ubiquitous for a number of reasons:

Central repositories. Most large-scale projects submit their data to a single place (e.g. dbGaP) where it’s relatively easy to find.
Rapid access to compressed data. High-speed internet access is everywhere — heck, Google’s floating balloons in New Zealand to bring it to distant tribes — and it’s faster than ever. An exome BAM file is about 10 gigabytes; even at modest speeds, you can download one in a matter of minutes. Compressed VCFs come even quicker.
Democratization of sequencing. Most investigators now have access to rapid, inexpensive sequencing of exomes or whole genomes. They can do 20 or 30 exomes, find a gene of interest, and then quickly look in large datasets such as TCGA for recurrence.

In short, obtaining and utilizing the datasets of “big science” initiatives is easier than ever. This is good news for the research community, so long as everyone plays fair. But let’s be honest, we live in the real world where that doesn’t always happen.

Data Embargo and First Rights

Most of these ambitious, expensive, long-term projects have a data use policy deisgned to protect the investment of money, samples, time, and other resources. Some data may be under embargo for a certain time, meaning that it’s submitted to public repositories but not available for download. This isn’t really open access, though, so it’s a policy that’s being used less and less. Instead, there’s usually a publication embargo — an understanding that the participants in the project get first rights to publish on their data.

In TCGA, the data use policy seems quite clear: no one gets to publish on a TCGA cancer type’s data until the first major publication, the “marker paper”, has come out. This is understood quite well, at least by TCGA participants. Remember that they’re in the unique position of generating the data, meaning that they see it before anyone else and are usually quite capable of analyzing it on their own. Nevertheless, everyone waits for and collaborates on the marker paper before going off on their own.

The Enforcement Problem

There’ s a major problem with this policy, however, and that’s enforcement. When the data are made available, there’s no way to physically stop outside investigators from using it, even from writing up manuscripts and submitting them for publication. Unless the editor or peer reviewers are aware of a project’s data embargo status, it’s quite possible for those manuscripts to reach publication before the marker paper. Just this month, there was a paper in a high-profile journal that used (among other things) embargoed TCGA data.

Obviously the two lines of defense — the data use policy, and the editors and referees who reviewed the manuscript, failed to prevent this from happening. What happens now? I have heard anecdotal evidence that, in the past, such violations ultimately resulted in a paper being withdrawn. This is perhaps what should come to pass, though I don’t know if it will. In either case, the damage is done.

Points Against Publication Embargo

It needs to be said that there are questions about whether these embargo policies are in the best interest of the research community. Data that was generated using public funding belongs to the public, and that includes the project’s competitors. I’m keenly aware of the fact that some people disagree with funding “big science” projects. It means that many smaller grant proposals must go without funding. There are also those of the opinion that, if the participants in a project can’t get their shit together and publish before someone else does, that’s their problem.

Unfortunately, it’s not quite so simple, as anyone who’s tried to write a marker paper with a consortium understands quite well. With the vast amount of data (and egos) involved, these projects are cumbersome. I would argue, however, that the landmark publications coming out of these studies are worth the wait. Look at the incredible resources that “big science” projects have provided our community:

The HGP provided the reference, enabling us to annotate and find variants in the genome.
The HapMap Project yielded a comprehensive genetic map and gave rise to high-throughput genotyping, without which GWAS would not have been possible.
The 1000 Genomes Project helped spur the development of NGS technologies, algorithms, and file formats, as well as dramatically expanding the catalogue of human genetic variation
The Cancer Genome Atlas has and continues to yield critical comprehensive molecular profiles of common cancer types
The ENCODE project has laid the groundwork for understanding the composition and function of elements in the genome.

These resources are priceless, and I would argue that they would not exist without the big science projects behind them. And yet, such efforts are doomed if investigators, journal editors, and peer referees fail to respect and enforce their data use policies.

Trackbacks

Data Sharing, Embargo, and Big Science | MassGenomics | linkstream2 microblog says:

June 28, 2013 at 8:53 pm

[…] An interesting blog post on data release by consortia including ENCODE. http://massgenomics.org/2013/06/data-sharing-embargo.html […]
Links 6/29/13 | Mike the Mad Biologist says:

June 29, 2013 at 3:52 pm

[…] Data Sharing, Embargo, and Big Science First find your tuatara (or how to sequence a genome) Florida Entomologist Finds ‘Jurassic’ Nest with Over a Million Wasps Should we make animals smarter? Brain research raises the possibility of a very exotic future (this article assumes that such animals wouldn’t be vicious or use their new-found smarts to drive other species to extinction) ‘Liberated’ mice from Italian lab now housed in poor conditions Methane leaks of shale gas may undermine its climate benefits: If methane leak rates are more than 3 percent of output, fracking of shale gas formations may be boosting greenhouse gas emissions rather than lowering them. (this isn’t new, but when I read most econ blogs on the topic of methane there seems to be no recognition of this) […]
Onco Seq du 1er juillet 2013 | mdauria2013 says:

July 3, 2013 at 9:34 am

[…] le traitement des cancers lymphoïdes”. Partage de données L’auteur du blog Mass Genomics analyse les enjeux relatifs au partage de données issues de projets “Big Science” tels que le Human […]