NLP Series: Natural Language Processing to speed up academic writing
In this week’s edition of our blog series on Natural Language Processing, we hear from two members of the team at Writefull, the academic writing support tool. Dr Hilde van Zeeland is Chief Applied Linguist at Writefull. After having completed an MSc and PhD in Applied Linguistics at the University of Nottingham, UK, she worked for several years as a language testing consultant and a scientific information specialist before joining Writefull. Dr Juan Castro is one of the founders of Writefull. He finished his PhD in Artificial Intelligence at the University of Nottingham, UK. He then did a few post docs at the same university before founding Writefull.
Writing is key to science. Whether it is journal articles, book chapters, reports or conference proceedings, most research is communicated through written texts. For most researchers however, writing takes up more time and effort than they would like. Fortunately, we now have Writefull: a tool that uses the latest Natural Language Processing (NLP) techniques to speed up the writing process.
NLP is a strand of Artificial Intelligence that refers to the automatic understanding and generation of human language. It can be applied to many purposes, such as predictive text, automatic translation, and text categorisation. Whatever the application of NLP, its techniques often rely on the training of models on vast amounts of data. While these models process batches of data, they acquire knowledge needed for the task at hand. For predictive text, for example, they require recurrent linguistic strings.
NLP models and Writefull
To help with academic writing, we need models to do three things:
1) to learn the recurrent patterns of academic texts;
2) to recognise when an author’s language does not follow these patterns, and;
3) to change such language so that it follows the expected patterns.
At Writefull we have spent the last few years developing and training models that do just that. We offer an editor in which researchers can write their text. They then get automatic feedback on their writing, and can accept or reject Writefull’s suggestions. The models that Writefull uses to give feedback have been trained on millions of journal articles. Thanks to this, they can spot when the author’s writing deviates from the norm – that is, from the expected language patterns as acquired from our dataset. In many cases, such deviations will be grammatical errors, but they can also include things like awkward wording or unnecessary commas.
Writefull can suggest changes to academic writing based on the likelihood of a word or sentence to be correct.
Traditional language checking software uses grammar rules to check for fixed elements in a sentence. For example, they might ensure that the right prepositions precede certain nouns by coding rules such as: correct ‘at progress’ into ‘in progress’.
Programming rules are definitely easier than training models. However, once models work well, they are much more powerful. Rules are limited; even thousands of rules wouldn’t cover all of the mistakes that authors can make, whereas models can cope with any input: their knowledge is generalisable to any sentence. To give you an example, Writefull recently corrected ‘time of the day and day of the week’ into ‘time of day and day of the week’. Writefull knew that, in this context, ‘the’ precedes ‘week’, but not ‘day’. There are many of these usage-based norms, and it is impossible to cover all of these in a rule set, but a model, if trained sufficiently, will eventually learn them.
Another downside of rules is their black-or-white nature. If an author’s sentence triggers a rule, it will then be corrected regardless of the context. This may lead to false corrections. Models, on the other hand, look at the context to judge what suggestions are needed and, based on this, can give nuanced feedback. When Writefull spots that something is off in a text, it often gives the author the probability of their phrase and compares this to alternatives. For example, when writing, “He is sitting on the sun” in the Writefull editor, Writefull shows that “He is sitting in the sun’ is a more probable alternative, with 82% likelihood of the latter versus 18% for the former. In cases like this, Writefull does not give a harsh correction, but an insight into the likelihood of the author’s wording versus alternatives. Language correctness is, after all, not always black-or-white. Messiness and ambiguity, both inherent to language, are two key challenges in the field of NLP.
A challenge to Writefull – and to any NLP application – is noisy input. If an author writes sentences that are very different from the language that Writefull’s models know from training (i.e., from the journal articles), Writefull may fail to give accurate feedback. Think of an author messing up word order or making several serious grammar mistakes in one sentence. The challenge is therefore to identify those cases where it is best to not suggest anything, for a suggestion might turn out to be incorrect.
At Writefull, we’re continuously exploring avenues to make our feedback even more accurate and complete. While Writefull currently gives feedback on many language features, including the use of punctuation, prepositions, subject-verb agreement, etc., there are still plenty of science-specific features to cover. Academic writing might use virtually the same grammar as other genres, but it is highly specific on other things, such as word use. We now have the technology in-house to expand – and in doing so, we’re keeping a close eye on developments in the NLP field.