Not the best way to organise data

Over the past month or so, my colleague from figshare, Mark Hahnel and I have been working on an article for an upcoming special edition of Against the Grain guest edited by: Andrew Wesolek, Head of Digital Scholarship at Clemson University, Dave Scherer, Assistant to the Dean at Carnegie Mellon, and Burton Callicott, Librarian at College of Charleston. The focus of the special edition is to explore the future of journal articles as the container for scholarship. That is, as the digital scholarship age unfolds, will academics continue to use articles as the primary way they communicate? Will the nature of journals change out of all recognition, or will they even be replaced? It promises to be an interesting read, so keep an eye out for it. Mark and I were tasked with writing a piece on data management and sharing. Obviously, this is a big topic so we tried to cover as much ground as we could without being too cursory, I hope that we got the balance right.

In our article, we’ve explored the reasons why researchers are increasingly interested in data sharing, some of the barriers and challenges, and the relationship between traditional publishers and data repositories. One aspect that I found very interesting is the difference between structured and unstructured repositories and the arguments for and against their use. In the interests of full disclosure, Figshare does several things, including a data management solution for institutions; and visualization, linking and data hosting solutions for journals; all of which are built on the foundation of an unstructured repository. Having written that, I’m about to argue that the choice between the two is a false dichotomy and we need both.

What’s the difference?

Before listing the arguments, it makes sense to start by defining the terms. Put simply, structured repositories have more rules about the data that goes into them. Very often these types of repositories are intended to catalogue a specific type of data, with the aim of creating a super-data set that has been collectively gathered by many researchers in a certain field.

In many respects, this type of subject-specific, structured repository is related to the idea of industrial scale science that Timo Hannay, former Managing Director of Digital Science has discussed in the past. Traditionally, science has existed as a sort of cottage industry, in which individual labs, headed by principal investigators explored topics on their own terms. Today, disciplines like astronomy and the various -omics fields are pursuing a unified goal to answer larger questions than is possible through a single research project. This new model of industrial scale science inevitably requires standards for information interchange and so it makes sense for a repository to enforce those standards. A good example of a structured repository is the NIH’s Genbank, which is part of the International Nucleotide Sequence Database Collaboration. The NIH have published an annotated example record, so you can see how clearly the data is curated.

The advantages to having a highly curated database with enforced formats and standards are obvious. Given the volume of data that collaborations such as these generate, the data must be machine readable. The better codified the data formats are, the easier it is to write a computer program to read them and therefore, the easier it is to make use of the data.

Unstructured repositories are very different. In this type of data solution, the format of data has no restrictions and is not necessarily curated. In a sense, these types of solutions consider data in different ways to their structured counterparts, arguably with a different implicit definition of what data is. There are many different definitions of data, but one idea is that it is any digital product of scholarly research. This could mean anything from a video recording of ballet performance to a spreadsheet of numbers to a computer program.

Essentially, if structured repositories provide a place for data that is part of industrial scale scientific collaborations, unstructured repositories are where everything else goes.

When Unstructured Becomes Structured

Data scientists and many publishers working in the field of data and data linking strongly advocate the use of structured repositories where appropriate. Nature’s Scientific Data, which Figshare has partnered with, is a good example. As Andrew Hufton, Managing Editor at Scientific Data, said to me a few months ago.

‘We would like authors to put their data in the most appropriate place for that data’.

Tellingly, however, the repository most used by authors is Figshare (~30%), with most authors using some form of unstructured repository, whether that be an institutional one or a third party (like Figshare or Dryad). This illustrates that most authors don’t have data that conforms to an existing standard.

Over time, more and more data standards (and associated structured repositories) are emerging. The Registry of Research Data Repositories and BioSharing both maintain lists of several hundred. As techniques mature and the need to create a standard becomes apparent, professional societies and other communities such as the Open Microscopy Project and Research Data Alliance work to create standards, which then enables information and data to be more readily stored with a consistent structure and more easily reused.

Where Will The Balance Lie?

Data sharing is still a growing area of scholarly communication. Over time, it is likely that more data types will become codified with appropriate structured repositories. On the other hand, it is the nature of academic endeavour that researchers do new things. It takes a long time for any new technique to be widely adopted as a gold standard and even longer for a standard to be agreed on. Very often, the data that researchers are gathering is of a unique type, specific to the work that they are doing. Nobody knows just how much unstructured data is generated in the countless labs and offices around the world. Much of this data sits on computers under desks or on dropbox – undoubtedly, there’s a lot.

With funders and institutions increasingly asking for data to be available for review and reuse, it’s clear that all researchers need appropriate data sharing solutions and so both types of repositories are needed.