Last month, Nature celebrated its 150th birthday. Founded by Imperial College London scientist Sir Norman Lockyer in 1869, Nature’s remit was to share the cutting edge research of often liberal-leaning scientists in an easily digestible manner. Content includes those legendary Nature journal papers, but also letters, reviews, and general opinion pieces that often do not feature in other journals. What better way to celebrate such a milestone birthday than to analyse Nature’s content to see how things have changed over its 150 year history.

Working with Dimensions data and some additional information from Nature, Dr Hélène Draux, a data scientist here at Digital Science, collaborated with Nature’s Richard Monastersky and Richard Van Noorden, to produce some infographics uncovering some of these trends, including:

  • The yearly amount of content published in Nature peaked at around 1960
  • A large amount of this is News, which has certainly dominated in the last twenty years
  • The biological sciences remain the most dominant field of research, with biochemistry and cell biology as the biggest subfield
  • The keyword analysis follows science’s big breakthroughs in understanding – with early entries focused on big natural phenomena and more recent ones reflecting advances in physics and molecular biology.
  • Research is increasingly a collaborative endeavour, with a growing number of authors on an article
  • The percentage of female authors has grown over time.
  • The largest contributors by country are diversifying over time
  • Collaborations are becoming more multinational

What the infographics don’t tell us is the data science story behind the trends. A quick chat with Hélène revealed a whole host of data curation steps that were taken to ensure that the stories were best representing research over time. For a start, the team had to identify what they were going to count as a piece of research output. A quick dive into Dimensions  shows that the most prolific Nature author is David Cyranoski, with 441 publications, largely down to the fact that he is an Asia-Pacific Correspondent for Nature, rather than a researcher publishing novel findings.

A review of the past 150 years allows us to hold a mirror up to the face of research, and see how it has changed. Knowledge of the history of research culture helps make sense of some of the trends we see. For example, Nature  grew more selective in later decades of the 20th century in part because of changes in editorial leadership. As Hélène puts it, they were “letters to the world”, a chance for scientists and natural philosophers to showcase their latest ideas freely.

The comprehensive nature of Dimensions metadata allows for many of these trends to be teased out of this vast pool of information. For example, article-level categorisation of papers allows us to see trends in publication volumes across different fields of research. Indeed, the detailed article metadata adds further context to trends seen over time, such as the prevalence of water in the titles and abstracts of the research articles published in the first few decades of Nature. Dimensions fields of research show that water was a fundamental focal point of study across many fields of natural and physical science. As our understanding of the behaviour of water developed, these fields became more specialised.

The areas of interest also track technological advances of research. While telescopes were used to look at research on a cosmological scale, as microscopes developed, the topics of interest shrunk down to focus on human-sized problems, and even further to look at the cells, proteins and cell contents that make us who we are, with the quantum realm becoming truly popular from the 2000s onwards.

While investigating the data behind these keywords, Hélène recalls a story whereby they were surprised that the use of the term “gene” had been used before it was invented – Nature may be at the forefront of research, this was still an impressive feat. It was thanks to the diversity in the knowledge of this team that this observation was quickly debunked pending closer inspection. The definition of gene that we used today was first used in the early 1960s, so why were there references to genes before? Was the word being used to describe something else? Was the city of Gene a hotbed for research breakthroughs? No; there was a much simpler explanation. In many journals, formatting requires longer words to be hyphenated and spread across two lines. The appearance of ‘gene’ picked up in the analysis was actually down to many a ‘generally’ that had been chopped into two to become ‘gene-’ and ‘rally’. This is one example of why having a range of people with different experiences working to solve a problem is important, as without this knowledge, making sense of the data would have taken even longer. This also reminds us to remain critical when we analyse data, and be mindful of the logical context within which data falls.

While the results of this retrospective look at Nature’s content shows us how research is changing over time, and often for the better, this research also gives a flavour of the level of analysis that is possible when high quality data is available. Through better metadata and more open research, we can get higher quality data, though this study also serves as a reminder to remain sceptical of all trends in data until they can be better understood in context, and also the benefits of having a diverse range of skills and experiences within a team to make the most sense of the trends being identified.