Mining patents in the Cloud (part 1): the SureChem data processing pipeline
May 28, 2012
This is the first part of a three-part blog series on how the SureChem team have rebuilt their data processing pipeline using AWS tools.
Digital Science recently launched SureChemOpen, a free service to help research chemists to find interesting chemistry in patents. This article is about the text mining infrastructure that makes SureChemOpen possible.
To start with, we’ll talk about what SureChemOpen is, what it’s used for, how to use it, and the data required to enable patent chemistry searching. We’ll then describe the text mining process: i.,e. how to go from a textual patent document, through annotation and chemistry detection, to building a database of patent chemistry. The SureChemOpen pipeline is built using Amazon Web Services technologies, and a future article will describe how we implemented the text mining pipeline in the cloud, using EC2, SQS, S3, and other technologies. Also to come is a design discussion about the scalability, reliability, complexity, data integrity, and the performance of our cloud-based data mining implementation.
What is SureChemOpen?
SureChemOpen is a search engine for chemists interested in patent chemistry, such as researchers at institutions working on drug discovery. Typical uses for SureChemOpen are to check if particular compounds have been protected (and thus may or may not be patentable), or to identify new or unexplored types of compound which may be candidates for research projects. SureChemOpen is a free version of the SureChem product portfolio, allowing low-intensity searches for free. Premium offerings (SureChemDirect and SureChemPro) are planned for release later this year.
At the core of SureChemOpen is a text mining pipeline, which we’ll describe in detail in the next section. But fundamentally, we start with a corpus of patent documents, run them all through a cloud-based data processing pipeline, and in the process build up a collection of chemical name annotations, chemicals found in images, and where possible, chemical structure data for names.
Every chemical we find is added to a searchable database, which allows chemists to find “interesting” chemistry. A typical chemistry search might involve entering one compound, and searching for any compounds with a 95%+ similarity level. Or a chemist may enter one part of a chemical compound, and search for all other compounds that contain that fragment as a “substructure”. After finding one or more interesting compounds, the chemist will naturally want to view the documents that contain them. Clicking one more more structures from a SureChemOpen structure results list shows a list of matching documents; these can then be opened, and the matching names or image annotations shown.
The Text Mining Process
So what is really involved in generating data for SureChemOpen? The data mining process can be broken down into the following discrete tasks.
1. Text annotation
The first step in the pipeline is text annotation. Here, we take the raw text of the document (typically HTML, provided by patent offices), and run it through a machine-learning based named-entity recognition tool, referred to as the SureChem Entity Extractor (EE). The tool is used to find “systematic” chemical names, for example:
The SureChem EE identifies chemical names in text by first tokenizing around white space and other significant separators, then calculating a probability for whether each token is chemical.
The probability is calculated based on which “n-grams” the token contains, where the presence of certain n-grams is a strong indicator that the full token is actually a systematic chemical. An n-gram is a sequence of characters of length n; so for example the word “example” has the following 4-grams: exam, xamp, ampl, and mple. We identify the 4-grams of each potential chemical name, then combine the “chemical” probability for each of these 4-grams to get the overall likelihood that the name is chemical. A finely tuned threshold ensures a high “F-measure” score, meaning that we find the vast majority of chemical names, and very few false-positives.
A machine-learning model isn’t all there is, however. We use dictionaries to find well known drug names, as well as heuristics and certain post-processing steps to improve the quality of our annotations.
There are two outputs of the annotation task: annotations and names. Annotations are simply start and end positions for the chemicals in the document; these are stored in a database and can be extracted later for rendering the document with chemistry. The names are sent on to the next “downstream” task for further processing.
2. Convert names to structures
Next, we try to generate chemical structures for every name detected in the previous step. There are several third-party tools (both commercial and open source) that take one or more names, then provide the chemical structure data that the name corresponds to. The most common structure format output by the tools is the .mol file.
We try our best to convert the name by passing it to five different tools. This can result in more than one chemical compound being generated for a given name; we capture everything and ensure that searching and exporting handle these cases appropriately.
Names that can be converted are sent, with all generated chemistry, for standardization and storage (see step 4, below). Non-converting names however are sent for OCR correction...
3. Optical Character Recognition Correction
Unfortunately, not every name can be converted to a structure. Sometimes this is because the name just isn’t known to the tools, and occasionally because we’ve falsely identified some text as a chemical. But often, names don’t convert because they contain errors introduced through Optical Character Recognition (OCR).
Many patents (even newly published ones) are digitized using OCR, which can mean slight mistakes in chemical names because OCR classifiers are typically trained to recognize prose, rather than systematic names. Common OCR errors include spurious spaces being inserted into chemical names (often around commas, as is typical in prose), or certain numbers being changed to similar looking letters (the number 1 changed to the letter l, for example).
The next step in our pipeline is to try to correct these. We use a combination of heuristics, dictionary lookups, and third party tools to create correction candidates. Every correction candidate is sent back to the previous step, and if convertable to a structure it is treated as non-corrected names, and sent for structure standardization and storage.
4. Structure Standardization and Storage
Every structure generated in the SureChemOpen data processing pipeline is ultimately processed by what we typically call our “Structure Handler”. The Structure Handler is responsible for processing every chemical generated by earlier steps. This means standardization, error checking, chemical property calculation and storage.
We use a third party standardizer and error checker provided by ChemAxon, which (using a custom configuration) ensures the output is a valid chemical in a consistent form. Automated chemistry extraction can generate spurious chemicals, so by running a series of careful checks (such as checking size, or atomic makeup) we can ensure the storage of meaningful chemistry. Similarly, standardization steps such as de-aromatization ensure that all chemicals are in a consistent form, making chemists lives easier and reducing duplication in our database (see below)
Chemicals that pass standardization and error checking are added to a searchable database, along with a number of derived properties. After being added to this database, the chemicals will appear in search results on SureChemOpen. Often, different names will generate the same chemical structure (think “water” and “dihydrogen monoxide”), in these cases we detect the duplicate chemistry and only store one searchable chemical.
Each chemical successfully processed by the Structure Handler (whether resulting in a new chemical or recognition as a duplicate) will now have a unique ID, which will be sent on (with the originating name) to the Entity Mapper task.
5. Entity Mapping
The final step in the SureChemOpen text annotation pipeline is entity mapping. So far, we’ve seen that documents have been annotated with chemistry, names have been converted to chemical structures, and chemical structures have been stored in a searchable database. But what’s missing is a link between annotations in documents and the chemicals generated from them. Without this information, it’s impossible to find documents that match results from chemical searches; it also makes it hard to show chemical structures for annotations in documents.
The Entity Mapper, therefore, is passed pairs of names and chemical IDs, and updates the database of annotations to ensure that the relationship from annotation to chemistry is recorded.
Another aspect to the data processing pipeline not mentioned above is extraction of chemistry from images. In the SureChemOpen pipeline, this is done in a similar way to name extraction. Documents are sent to a task that retrieves clipped images from patents, and processes them using CLiDE (a third party tool for detecting chemical compounds in images).
The resulting image annotations are stored in the database, chemistry is standardized and stored by the Structure Handler, and image annotations are associated with detected chemistry. The only significant difference from text processing is that chemistry from images is aggressively filtered because it’s very difficult to prevent non-chemical images from being processed, and false positives can easily be detected.
Part two of this series will focus on how we've utilised AWS technologies to build the data processing pipeline. Stay tuned.