Open Data and Open Science: A New Initiative
Science is changing fast. Advances in computing combined with new methods enable scientists to collect, analyse, compare and display vast amounts of data, allowing work in areas that previously were inaccessible scientifically, including across disciplinary boundaries. In environmental science, new areas are emerging, improving our understanding of hazards, weather and climate, or how genetics shapes the resilience of ecosystems. Much of this research crosses natural and social sciences, and gives the possibility also of new applications of the science in financial services, including insurance, or for better environmental management.
The science of climate prediction illustrates many developments. Global coupled models of atmosphere, ocean, land and ice are run routinely in many global centres using some of the world’s largest supercomputers, resolving many of the world’s weather processes. As well as modelling the physical environment, these models routinely have a coupled ocean and land biosphere. That these models are numerically stable and produce realistic-looking results is a real change in the last twenty-five years. However, the models inevitably smooth the processes modelled, and do not always capture changes accurately. For this, they need to be compared carefully with detailed observations, also becoming available in ever-increasing detail from satellite platforms and in situ sensors. Surprises have included the rate of decrease of Arctic sea ice – which was not captured by models at the time – and the effect of forest burning in the tropics, as seen with satellites and are now being included increasingly in models. These developments result in petabytes of data and model output which need to be compared, sifted and analysed. This is well beyond the capabilities of any one team and needs to be organised globally so observations and different models can be compared. This is a challenge for traditional scientific publishing, which does not include publication of such diverse data along with the results in papers. Similar arguments can be made about understanding the patterns of hazards such as flooding, earthquakes and landslides, or about environmental genomics, which is starting to reanalyse the relationship between species and their evolution in the light of genetic analysis.
These new areas of science both drive and are driven by new areas of computing.
Scientists are generating petabytes of data, and need access to some of the world’s largest supercomputers in areas such as climate prediction. New approaches to science that can exploit complex existing data sets, such as data assimilation, are also emerging. Data assimilation is the process by which observations of a system are compared with and then incorporated into a model of a numerical model of that system.
Teams of environmental scientists now routinely need to include computer scientists and mathematicians in addition to more traditional disciplines if they are to take advantage of these new scientific techniques. Peer review grants panels and peer review journals also need to change to recognise the importance of this cross-disciplinary work. At this juncture, editorial boards and grant panels have the opportunity to influence the future of science by adopting policies that not only reward innovative research, employing these emergent skill sets, but also by providing incentives for scientists to make their data open and accessible. In spite of the support for open data – for instance by high-level support by scientific societies such as the American Geophysical Union – scientists still report many barriers to sharing their data, including concerns about lack of recognition and legal constraints (Schmidt et al, 2016).
Reproducibility of results is essential so that publications can be peer reviewed and challenged, but is difficult with traditional print journals if they do not support data publication – journals themselves therefore also need to change. Data repositories for data associated with publications will increasingly be necessary, including the metadata necessary to understand the data themselves and interpret them. Good practice on open data will overcome any reluctance of peer panels and reviewers to recognise the impact of these new areas, and encourage funding agencies and journal publishers to invest in them. A good early sign is the recent impact factor given to one of the first environmental data journals, Earth System Science Data, for 2015, of 8.268, which places it as having the second highest impact in Meteorology and Atmospheric Science and third in Multidisciplinary Geoscience. Together with the encouragement given to data publication in Nature and other high profile journals, this will transform the realisation of the need for these areas of science in the more general scientific community.
In 2013, the Belmont Forum, a group of 22 prominent funders of environmental science from 17 different countries, and the European Commission, recognised the changes in this area, and established a Collaborative Research Action on Data and e-Infrastructures for Global Change Research to assess what was needed, and in turn, develop a coordinated approach. Working groups of scientists from each participating country collaborated with the agencies and the wider scientific community to produce recommendations for improving e-infrastructure and data management in environmental sciences. The planning included issuing a questionnaire that was completed by 1,330 scientists across the world, which informed the development of the plan and policy (Schmidt et al, 2016). A policy, a set of principles, and a set of actions was written that each agency should adhere to. This policy and these principles were adopted by the Belmont Forum in 2015, and are now being implemented. The principles are:
- Widen access to data and promote the long-term preservation of data in global change research;
- Help improve data management and exploitation;
- Coordinate and integrate disparate organisational and technical elements;
- Fill critical global e-Infrastructure gaps;
- Share best practices; and
- Foster new data literacy
These principles need in turn to be adopted by the scientists and organisations that the agencies fund, and by the journals who publish the science. The activities build on excellent work by other bodies, such as the Research Data Alliance and the Group on Earth Observations, but now also involve major scientific funding agencies.
The implementation plan calls for parallel actions in three areas. The first is to align the data policies of the funding agencies and ensure they adhere to current best practice. This activity will be supported by legal and security advisory groups, so that data can be kept securely and in a way scientists can trust, as well as dealing with issues of privacy and confidentiality as needed in the many different global legal systems.
Second, there is a great need for training of scientific practitioners at all levels in best practice and in novel methods, and also informing computer scientists and mathematicians of the relevant problems in environmental sciences. In the past, new techniques could percolate gradually into a scientific discipline, but the speed of developments in science and also in commercial areas such as cloud computing means that a more proactive approach is required. Awareness-raising and training are needed in order that new techniques can be incorporated into environmental science in a timely way. Many of the Belmont Forum agencies already sponsor training, and this action will allow the best practice to be promulgated faster and more efficiently.
This area is evolving fast so we need exemplars to show good practice; see where there are barriers, and to allow curricula and policies to evolve. There is also coordination to ensure that these actions cross-fertilise, and to ensure that there is overall strong links to the wider scientific community.
There will be many opportunities to influence this work, and there is a web site describing the work in more detail. We encourage you to get involved as the initiative develops.
Robert Gurney, University of Reading, UK
Robert Gurney is Director of the NERC Environmental Systems Science Centre, and a Professor at Reading University. He is an hydrologist, with a BSc from King’s College London and a PhD from Bristol. Prior to joining Reading, he was Head of the Hydrological Sciences Branch at NASA Goddard Space Flight Centre, Maryland. He is particularly interested in the use of remote sensing in studying land surface changes. He advises UK and international bodies extensively, including being Chairman of the Science Advisory Committee for the EPSRC Basic Technology Programme, and a member of the EPSRC eScience Science Advisory Committee. He also represented NERC on the RCUK eScience Advisory Committee. ESSC hosts the Reading eScience Centre, which carries out demonstration projects applying distributed computing in environmental sciences.
Schmidt B., Gemeinholzer B., Treloar A. (2016) Open Data in Global Environmental Research: The Belmont Forum’s Open Data Survey. PLoS ONE 11(1): e0146695. doi:10.1371/journal.pone.0146695