Open Sesame! The Risks and Rewards of Open Data for Researchers
Jon Tennant is a PhD student based at Imperial College London, investigating extinction and biodiversity patterns of Mesozoic tetrapods – anything with four legs or flippers. Prior to this, there was a brief interlude where Jon was immersed in the world of science policy and communication, which has greatly shaped his view on the broader role that science can play, and in particular, the current ‘open’ debate. He tweets as @Protohedgehog, and blogs for the EGU.
Open access is a done deal. There is a clear global recognition that free and unrestricted access to research articles is a good thing for everyone. However, the movement towards more transparent scientific communication is not resting on its laurels. There is a new frontier in open science and that is data. Funders are increasingly mandating that researchers make their primary data available so that it can be built upon, though it’s not all plain sailing. Many researchers remain unconvinced that it’s in their own personal interest to share their data. In a post last year in the Scholarly Kitchen, Kent Anderson laid out some of the risks as he perceives them as a publisher.
To help me write this post, Phill Jones of Digital Science put together some objections that he occasionally hears from those researchers who remain unconvinced. Below, I will explore and try to counter some of these frequently stated objections.
1) “If I make my data available, somebody will re-analyse it in order to undermine my conclusions. It’s easy to misrepresent data and I don’t want to make it easy for detractors to undermine my work unfairly.”
Raw data is the fuel of science. Re-analysing data and assessing conclusions forms one of the key pillars of research: reproducibility. Without making data openly available for assessment, research isn’t really research, it’s just anecdote. The debate that comes from the critical reanalysis of data and results is central to how science functions. We should embrace the increased visibility that opening up our data gives to all stakeholders involved in scholarly communications.
Developing upon research, whether to produce positive or negative results is an important way in which research progresses and knowledge accumulates. Furthermore, sharing your research data opens up more doors than it closes. It shows that you have confidence in your own work and have nothing to hide. It encourages scientific debate, scrutiny, and enquiry, and can lead to new innovations. From personal experience, I can also say that opening up my own data has enabled me to find new collaborators, enabling me to make a greater scientific impact.
It’s not about adding bricks to a wall – sometimes the wall has to be demolished to make way for a building. We shouldn’t be afraid to support researchers whose conclusions have subsequently been demonstrated to be incorrect. It’s OK to be wrong, and we should embrace this as a community.
2) “It takes me a long time to generate a data set. I frequently publish an initial analysis of data and then follow it up with analyses that are more in-depth or from a different perspective. If I publish my data, I loose control of it and somebody else might scoop me.”
This is where data citation comes in. The majority of data sharing platforms come with a Digital Object Identifier (DOI), or are licensed under a Creative Commons attribution license (CC BY), which means that data can be cited on the same level as research articles. This has yet to become widespread practice, but the more people that begin to cite data, the more common and accepted it will become. As a proponent of open science, I hope that one day soon, data citations will be considered just as valid as traditional article citations.
When your data is cited, you get additional credit for more research outputs than just a paper. If you’re into metrics, then the number of citations is a pretty important figure for an individual, which can play a role in funding applications, hiring, and promotion. A study by Heather Piwowar and Todd Vision shows that there is a clear citation advantage to making your data open. In addition to this, they found that this advantage persisted for years beyond the time period in which the researchers had maximally used their data to publish papers.
Some publishers are also now creating ‘data journals’. In these journals, data are published without any discussion or formal results. All this taken together, we are beginning to see a trend towards increasing reward for open data. For open science to move to the next level, the research community and funding bodies must recognise that data are at least as important as traditional research articles.
One possible solution to mitigate the risk of being ‘scooped’ is the creation of data embargoes, similar to the green open access embargoes that many funders and publishers have negotiated. An appropriate system of embargoes, where data is not released alongside a publication, but delayed for a suitable amount of time, would protect authors’ abilities to maximally exploit their own data for personal research and publication.
3) “The data that I have is in a highly specialized format that other people cannot necessarily read, let alone interpret. Aside from the fact that many repositories don’t accept my file, I hear that I have to make my data interpretable. I simply don’t have the time to process my data or convert its file type so that people outside of my field can take advantage of it.”
What I think is becoming clear is that community-dependent standards for data sharing simply do not exist. What would make this much simpler if guidelines existed to help researchers develop and construct their data in a manner that is interpretable and usable by external parties. Developing best practices for data reuse and sharing, including good metadata, common repositories, and proper citation for appropriate credit are all ways to try and achieve this.
There is an ever-expanding range of options for scientific data formats. Subject specific repositories, for instance often offer niche file support for specific disciplines. Institutional repositories are arguably lagging slightly in terms of file support, but some commercial offerings are highly flexible. Figshare, for example, will extend support for any file type upon request.
There are admittedly still more to discuss and several issues to explore. Questions regarding the opening up of commercially or medically sensitive, or industrially-funded, data are examples. Whether or not sensitive data are made openly available should be assessed on a case-by-case basis depending on the situation. There are also open questions surrounding data duplication and negative data, but I’ll leave those for a subsequent post.
Researchers all want to contribute to the global pot of knowledge, that’s why we do what we do. Let us return to and embrace this principle of open research, and make our data open – who knows what might be achieved, and who knows what might be lost by restricting access to it.
Comments
David Crotty
Data citations are useful developments, but they aren't a solution to the scooping problem because they will not provide the same level of career credit as a fully realized piece of original research. Being able to say, "the dataset I created was useful to someone else's discovery" isn't quite the same thing to a funder or tenure committee as saying, "here's what I discovered."
I suspect that there may also linger some negative reputational impact on being the person who collected the data but also the person who was unable to figure out what it means.
I do agree that open data will likely speed the pace of discovery, a good thing for society. But the academic career track is already pretty brutal, and this may make it even tougher.
Alice Casey
This is great post - raising some really important issues for those beyond the scientific community as well. Who gets credit and can capitalise on others' work being a major ongoing issue. Particularly important for niche/specialist pursuits where scrutiny takes place amongst smaller groups. Perhaps there are things that can be done around time delays on release of certain aspects of original data (however there are also good arguments to release rapidly for example relating to clear common benefit such as medical advances.) Not easy questions to answer and worth further debate. Thanks for writing!
Jon Tennant
Hi David and Alice,
Thanks for your comments - certainly much to ponder over.
I think regarding the level of career credit, data doesn't necessarily need to reach the same level as a final paper product. We're seeing a move towards an enriched system of evaluating the outputs of research, and simply the recognition that data is an important product of that is important. I guess it all depends on how each item is used post-publication, and how that is assessed.
That second comment made me a bit depressed when I realised you're probably right. Maybe I'm just optimistic, but I'd like to think my colleagues were more receptive to the idea of open sharing, not because I'm an idiot who can't work something out, but because I believe in the collaborative nature of science and sharing. Also, I don't think the point of opening data isn't just because you couldn't figure out what to do with it; it's because someone else might be able to combine it with other data and/or use it in novel ways simply which were not originally thought of. I really don't think that anyone's reputation could suffer for that.
The idea of data embargoes is quite interesting, so that the original data collector can maximise the use of their own data (although who 'owns' data is another interesting discussion to have). I think there are certainly issues that need considering such as how to act with patient data or other sensitive data. Perhaps this is something we could learn from the publishing industry on manuscript embargoes..
I do think the role of the level of data transparency and it's impact on research assessment and career pathways is one that could do with development though.
Jon
Phill Jones
Some great comments here.
I think that what it comes down to is that there are risks associated with sharing data but there are also rewards. It's generally accepted that more openness is good for science, or at least not detrimental, and the objections to it are based around the risks to individual scientists' career progression.
To me, this means that the question shouldn't be whether data ought to be shared, but how do we mitigate those risks and maximize the upside for individual researchers. It's a question of making sure that the balance of risk vs reward favors the latter.
Certainly data embargoes and data citation are ways to do this but even without doing so, there seems to be a growing number of scientists both at early career stage, like Jon, or at a later stages (like Todd Golde, a leading Alzheimer researcher speaking at the Alzheimer disease research summit earlier this month, http://www.alzforum.org/news/conference-coverage/alzheimers-disease-research-summit-2015-expanding-horizon) who say that on balance, sharing their data is good for their careers.
I've not done any in-depth research on it, admittedly, but just anecdotally, it seems that those who embrace data sharing say that the benefits outweigh the risks.