Mining Patents in the Cloud (part 2): Amazon Web Services

15th June 2012

| james-siddle

Introduction

In part one of this series of articles on SureChem’s text mining process, we described the different processing steps required to extract searchable chemistry from patent documents. In part two below, we describe how the pipeline is implemented using different cloud services provided by Amazon.

To implement the text mining process described previously, we use Java-based processes integrated using Amazon Web Services (AWS) technologies. AWS provides compute power via Elastic Compute Cloud (EC2), message queueing via the Simple Queueing Service (SQS), data storage via the Simple Storage Service (S3), and metrics via CloudWatch. We also use greylog2 for low level logging. SureChemOpen code is Java-based primarily because many of our third-party dependencies are only available as Java libraries. One consequence of this is that interaction with AWS services is easily achieved using the AWS Java SDK.

The pipeline itself consists of many independent tasks running on several EC2 nodes that communicate via SQS and S3, with monitoring provided by CloudWatch and greylog2. Below, we’ll describe each of the major cloud technologies, and the role they play in the pipeline.

Elastic Compute Cloud (EC2) Nodes

EC2 nodes provide the computing power for the pipeline. Each node hosts one or more Java Virtual Machine instances, each of which runs one or more tasks such as text annotation, name to structure conversion, etc. The number of JVM instances and the tasks they host is controlled via configuration files, which can be modified depending on workload. We use a combination of foreman and an in-house task management framework written in Java to manage which tasks are running in which processes.

Currently we have just one standard configuration, which has been fine-tuned for performance on a single type of processing node to make optimal use of a particular quantity of memory and CPU resource. We’re working on an automated configuration mechanism which will add or remove task instances or EC2 nodes depending on workload, the chef automation framework helps with this.

Simple Queuing Service

SQS provides the queuing mechanism for units of work. Each task in our pipeline is associated with a particular queue, from which it reads messages that describe units of work to perform. For example a message may indicate that a particular document should be annotated, a name be converted into structures, or a chemical structure standardized and stored in a central database.

The result of performing a particular unit of work is some combination of database writes, S3 writes, and new messages on “downstream” queues. Once a task has read a message, SQS locks it for a certain amount of time to allow the task to perform the unit of work – this ensures that other tasks reading from the same queue won’t read the same message, and perform the same work.

Once the unit of work is complete, the task deletes the message from the input queue. If, however, there is some problem in processing the message – perhaps a timeout caused by hanging code, or an exception or segmentation fault causing the process to restart – then the message won’t be deleted, and after a preset time period will be made available again. This can have some unfortunate consequences such as churning or blocked processes, so it’s important to add safeguards such as task restart delays, or error queues for messages that repeatedly fail to be processed successfully – we employ both.

The combination of EC2 nodes, SQS (or another queueing system), with self-contained tasks that operate on individual messages provides the primary means of scaling the pipeline. To increase throughput of a particular part of the pipeline, we just add more instances of a particular task, potentially on new EC2 nodes depending on resource utilization. In our current pipeline, we have four annotators running, four name to structure converters, and ten structure handlers – among other tasks.

Our message formatting needs are fairly simple, requiring the transmission of easily encoded strings and numbers. So for message encoding and decoding, we use Google Gson. This provides a simple, easy to use encoding / decoding mechanism allowing messages to be defined using POJOs (Plain Ordinary Java objects).

Simple Storage Service

While EC2 and SQS can be combined to produce a working pipeline, one important missing ingredient is temporary storage space. SQS messages are (as of the current date) limited in size to 64k, which we’ve found is not enough for every type of intermediate processing artifact we generate. Chemical structure data, for example can exceed 64k in size, so messages that need to transmit such data (e.g. from Name-To-Structure tasks to Structure Handler tasks) cannot rely on messages to hold the required data.

As such, we use S3 for temporary storage. Essentially S3 plays the role of distributed file system for our pipeline. Any files required for processing are written into S3 buckets, then can easily be read by downstream tasks for further processing. In addition to providing the means for transmitting more than 64k between tasks, S3 also records these intermediate processing artifacts making it easier to diagnose processing errors, and easier to re-process downstream work.

CloudWatch

CloudWatch is a metrics gathering service, which is automatically updated by many different AWS services. Any use of SQS, for example, is automatically recorded by CloudWatch and numerous free metrics are made available for monitoring purposes. A simple command line service is available, in addition to graphs via the AWS web console.

There are seven pre-defined CloudWatch metrics captured for SQS, all of which are useful for day-to-day monitoring. One particular metric has proved invaluable – number of messages deleted. We’ve found that a simple summation of this metric, on an hourly basis over several days, provides an excellent overview of throughput for each of our tasks. It’s also a great way of seeing how work has progressed down the pipeline, and where bottlenecks are – simply overlay the message deletion counts for your different queues to see what happened when for a particular batch of input messages.

Using this approach, we’ve been able to determine when text annotation for a batch of documents finishes, when new names from those documents are converted to structures, and when the resulting structures are standardized and stored. All of which is useful information for tweaking pipeline configuration and estimating processing times for large datasets.

greylog2

So far we’ve seen how AWS technologies provide infrastructure and monitoring for a distributed, scalable pipeline. The final piece in the puzzle is low-level monitoring, which is essential for any non-trivial pipeline. We use the open source greylog2 to capture and collate low level logs generated by JVM processes.

Like many Java-based software systems we instrument our code using Log4J, which provides a uniform interface for capturing logging messages and transmitting them to various destinations. A number of Log4J “appenders” are available for greylog2, which transmit log messages to predefined greylog2 systems via UDP. While delivery of UDP datagrams is not guaranteed (by design) it is typically sufficient for monitoring purposes; that said we also record log entries in local files providing a system-local copy which can be easily processed using command line tools and also acts as a backup.

Because the SureChemOpen pipeline is complex and has many moving parts, there are numerous different aspects of the system that require monitoring on a daily basis. Separating these aspects out of the raw stream of messages can be difficult, so we use multiple Log4J appender definitions in combination with greylog2 streams to allow different processing aspects to be filtered via the greylog2 user interface. For example we send all error messages to one stream, all document processing messages to another stream, all task startup messages to another … etc. This allows for fast detection of errors, and makes regular status checking easy.

Intermission

That’s all for now, but tune in to the final part of this series which will discuss the design of the pipeline. Specifically what worked well, what didn’t work quite so well, and a few pitfalls we encountered. In the meantime, here’s an inquisitive husky enjoying a singing wine glass.