Monitor your ML model in Apache Spark with Deequ

Once you have built and trained your ML model, you have to maintain it in production. This great post provides a detailed overview of many options and challenges to do this effectively.

In general, one crucial task is to validate your incoming data, i.e., you have to check and assure that your input data is clean, before using it to train your model.

You can always perform this task manually, by coding these checks by yourself. However, if you are using Apache Spark, you should consider to use the Deequ library.

In this post I am giving a quick overview of why you should monitor your ML pipeline, and I’m providing some example Java code to implement input data validation using the Deequ library.

Monitoring ML in production overview

Generally speaking, there are three important things to do when monitoring a ML pipeline.

Validate your input data before training your model

You perform basic checks on:
- Data format: schema, format, etc…
- Data and features distribution: null values, duplicated values, minimum and maximum, categorical values, etc…
Validate your model before deployment

By keeping a hold-out dataset, you evaluate performance of the new model on this set. If this check fails, you should break the pipeline, i.e., not deploy the model.
Rollback easily with CI/CD

By setting up a continuous integration/continuous deployment (CI/CD) pipeline, whenever a check fails, you can easily break the deployment and rollback to the previous working model.

Validate your input data

I will describe here how you can implement the first step (“validate your input data”) of the monitoring pipeline described above in Apache Spark.

Deequ

Deequ is library developed by the AWS Labs. It is a built for Apache Spark, and implemented in Scala.

According the the project homepage:

Deequ’s purpose is to “unit-test” data to find errors early, before the data gets fed to consuming systems or machine learning algorithms.

The purpose of this library is to define data checks in a declarative way and to generate and run Spark code that will perform these checks for you automatically.

Example Java code

You can find the full list of examples here on GitHub.

All examples are implemented in Scala. If you are creating Spark jobs using Java, you may find the following example code useful.

This is a super-easy “Hello, world” example, just to help you setup your Java code and start experimenting with Deequ.

To summarize:

Create a standard Spark job in Spark. To run it, you have to spark-submit it.
Create a new SparkSession and use it to read some dataset (a Parquet in this example).
Create an instance of VerificationSuite. You will add all the checks you need to this object. In this example, I just check that some column named data_id is complete, i.e., it contains no null values.
By running the VerificationSuite you get back a VerificationResult that contains the results of your checks.

Conclusions

In this post I have explained the basic steps to monitor your ML pipeline. I have also described how you can create an Apache Spark job to inspect and check your input data using the Amazon Deequ library.