NLP Series: Natural Language Processing and Paper Digest

20th May 2020

| Suze Kundu

This latest article in our blog series on Natural Language Processing comes from the co-founders of 2019 Catalyst Grant winners Paper Digest. Dr Yasutomo Takano is a project researcher at University of Tokyo. Dr Cristian Mejia is a specially appointed assistant professor at TokyoTech. Nobuko Miyairi is a strategic advisor at Paper Digest, and a scholarly communications consultant.

What is Paper Digest?

Paper Digest is an automated summarisation service specialised in academic literature. It aims to help non-native English-speaking researchers by reducing the burden of reading the ever-increasing pile of research articles written in English. As ‘English as a second language’ (ESL) researchers ourselves, we keenly felt this disadvantage, and decided to develop this tool. To our surprise, it has been well-received by native English speakers as well, because everyone can benefit from such a time-saving tool.

At its core, Paper Digest helps users assess whether a given academic article is worth their time for more careful reading. This is done by offering a list of sentences picked verbatim from the document, which are expected to provide more information than those shown in the abstract. In the NLP parlance, this is known as extractive-based summarisation.

A pile of papers — Paper Digest is a tool that summarises the key points of an academic paper using natural language processing and extractive-based summarisation

**How Paper Digest works**

In order to find the key pieces of information, instead of reading the article in a linear manner from introduction to conclusion, we use the analogy of networks. Imagine if we decomposed the article into sentences and mix them together in a box. As we draw sentences from the box, we use string to tie together those sentences that are similar to another one previously drawn. By scrambling the sentences we have lost contextual information. However, we can still assess whether a pair of sentences are similar by looking into their vocabulary: they use the same keywords, synonyms, or refer to the same concepts. The more similar they are, the shorter the string we use to tie them. Once the box is empty, we end up with what resembles a network of sentences. Here and there we may find some groups of sentences tightly connected, where at least one sentence is playing a central role in keeping the bundle together. What Paper Digest presents, as a result, is a list of those central sentences.

As a baseline, the above approach works, but lots of efforts have been put in to optimise our methodology; from how to better split the document into sentences, to better definitions of what being “similar” means. Typical NLP evaluation methods such as Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and Bilingual Evaluation Understudy (BLEU) show that our algorithm performs well. However, we refrain from using those evaluation scores because in the end there is still a gap between what a machine and a human can understand as to what makes a good summary.

The future of Paper Digest and NLP

Our current focus is to better understand our users’ needs and optimise the algorithm accordingly. For instance, a Ph.D. student who is interested in writing a review article might be looking for methods and statistical significance tests, while someone writing a science communication piece might be interested in other things. They may both have different opinions about what they deem to be the ‘most important’ sentences from the same document. To capture these nuances, we have put in place a feedback system in our interface, so that users can give a ‘like’ to each of the extracted sentences to indicate their agreement. By accumulating this feedback from our users over time, they play a huge role in helping us improve the algorithm, and making this application of NLP as useful for as broad a range of people as possible.