NLP Series: NLP and Digital Science
Continuing our blog series on Natural Language Processing, today’s article is from Steve Scott, Director of Portfolio Development at Digital Science. As a member of the founding management team, Steve has been involved in the majority of Digital Science’s early-stage portfolio investments, taking founders through product and business model validation to launch and growth. He has given out 32 Catalyst Grant awards since their inception, with five recipients going on to become Digital Science portfolio companies. An entrepreneur himself, Steve has founded, or been involved in setting up, three of his own companies. In his spare time, Steve enjoys building and riding his own bikes.
The value of NLP
In 1950 Alan Turing wrote a paper, “Computing Machinery and Intelligence” in which he outlines what we now know as the Turing Test. In it he says, “A computer would deserve to be called intelligent if it could deceive a human into believing that it was human.” While examples of his test portrayed in films usually use speech as the communication mechanism, text, the main ingredient for NLP, is equally suitable as a Turing Test. If a computer can write a prize-winning novel, or can analyse a researcher’s writing and act as their editor, it would pass a form of Turing Test. As computer vision and speech recognition have improved dramatically over the last 10 years, NLP is widely seen as the next key challenge in deep learning, and allowing computers to make sense of human language in ways that are valuable.
From a layman’s perspective, NLP allows non-programmers to extract useful information from computer systems. Think of the way Gmail automatically sorts your inbox into different categories and controls your spam folder, or how Alexa or Google Home can translate your voice into commands that return music, answers to questions, or switch on a light in your home. Smart homes of the future will also be more energy-efficient as they learn their inhabitants’ patterns and behaviours.
NLP attempts to make sense of unstructured data, for example, as text, and that data comes in an almost endless variety of forms, including papers, emails, abstracts, grant applications, etc. Our challenge is to find real-world problems and apply NLP to help overcome them.
From a Digital Science perspective, the two companies that best highlight the application of NLP to research challenges, Dimensions and Ripeta, share a number of benefits and features that capitalise on NLP to benefit their customers.
Over the last 10 years, Digital Science has funded and supported solutions to address the rapid growth in data generated by scientific research. The application of AI and Machine Learning to this data, in the form of unstructured textual data, has become a key focus for us. Our solutions allow for, among other things, better job-matching, improved conference identification, improved written English in papers and automated reports evaluation reproducibility. I want to focus on two examples of the application of NLP in action.
Dimensions is a scholarly search database that focuses on the broader set of use cases that academics now face. By including awarded grants, patents, and clinical trials alongside publication and Altmetric attention data, Dimensions goes beyond the standard publication-citation ecosystem to give the user a much greater sense of context of a piece of research. All entities in the knowledge graph may be linked to all other entities. Thus, a patent may be linked to a grant, if an appropriate reference is made. Books, book chapters, and conference proceedings are included in the publication index. All entities are treated as first-class objects and are mapped to a database of research institutions and a standard set of research classifications via machine-learning techniques.
One of the challenges faced by the Dimensions development team was how to classify publications, grants, policy papers, clinical trials, and patents using a common approach across types. This is key to allowing cross-referencing between multiple content types. In Dimensions, standardized and reproducible subject categorization is achieved algorithmically using an NLP approach. The team started by giving a subject expert the capacity to build a classification based on a set of search terms. Starting with a general search term, or a longer constructed search string, the expert starts to amass an inclusive set of objects that fall into the presumptive category. Concepts are extracted from the corpus that has been returned and the expert can then boost particular keywords, re-ranking the search results to produce a different relevance score, or they can exclude objects that include particular terms. After repeating this process the expert (who is an expert in the subject but not an expert in computer coding) can define a field in a way that a computer can understand. This approach allows the computer to codify a set of rules that can be applied reproducibly to any content.
One of the problems in categorizing articles is the human factor. We are constantly learning and changing our opinion, so in order to have a standardized basis for analysis of categories we need to remove the vagaries of the human classifier. Using NLP, we can build a useful, reproducible definition of an arbitrary categorization system that will automatically be applied to any new content that is brought into Dimensions.
In a similar fashion to Dimensions, Ripeta has been trained on research outputs. Ripeta aims to improve the transparent and responsible reporting of research, allowing stakeholders to effectively take stock of their reproducibility and responsible reporting programme and enhance their practices. Analysing over 100 variables within a text that relate to reproducibility, Ripeta gives the user an assessment of the likelihood of being able to reproduce the results of that paper. Looking for things like study purpose; code acknowledgements; data availability statements; software programmes used (along with version numbers) gives what in effect is a credit score for that paper. Publishers and grant funding bodies can now analyse their archives and future grants and publications in order to ensure that funding is being used to conduct transparent and reproducible science.
What these companies offer are ways to increase efficiency, reduce costs and ultimately support better research. In the case of Dimensions that means giving the users a much greater sense of context of a piece of research. For Ripeta, that means shining a light on funded research to ensure it’s improving its efforts around reproducibility.
In the next ten years, we will see NLP capabilities expand and be embedded in new products and services, helping researchers navigate ever-expanding data outputs and allowing new ways to extract and interpret meaningful analysis from past and present papers.