Machine Learning (ML) models are unique and optimized for specific problems, yet their model performance can fluctuate over time. ML models used in organizations’ operational processes continuously require monitoring to revalidate their performance on an ongoing basis. One needs to monitor whether a ML model’s performance remains acceptable and whether patterns of new incoming data are still accurately being captured by the model. One of the top reasons why performance of ML models decreases over time is caused by data drift. In this article we will dive into data drift and how ML Ops contributes to handling it. This article is part of a series of blog posts on ML Ops. See a previous post for a general introduction to ML Ops.
What is Data drift
Data drift is a change in the distribution of data over time. In case of operational machine learning models, data drift is the change in the distribution of a baseline data set on which the model was trained and the current real-time production data. Real-time production data distributions can drift from the baseline data set over time due to changes in the real world or changes in measures. We can broadly classify 3 types of drift: concept drift, data drift, and upstream data changes.
- Concept drift happens when the statistical properties of the target variable itself change. The meaning of what you are trying to predict changes and therefore the model will not work well for this updated definition. For example, the definition of what is considered a fraudulent transaction could change over time.
- Data drift happens when the statistical properties of the underlying variables that predict an outcome change. A classic example is the natural drift in data due to seasonality.
- Upstream Data Change happens when there is a change in the data pipeline upstream which has an impact on the model performance. For example, camera’s being replaced and as a consequence the units of measurement change.
ML Ops as an enabler to detect data drift and easily retrain ML models
Given that the performance of a ML model can change over time once it is deployed into production, the best course of action is to monitor for changes in performance and retrain models when needed. By implementing the ML Ops principles: automation, reusability, reproducibility and manageability, you can protect your models from undesired performance degradation. ML Ops (as discussed in our ML-Ops solution) draws on DevOps principles and practices. It consists of best practices for the delivery of ML models and enables you to address the mentioned types of data drift.
Using ML Ops pipelines and monitoring tools, both concept drift (distribution of predictions) and data drift (both data- and feature contribution distributions) can be monitored. When drift in a model is detected, the next step is identifying which features are causing the drift. It can be the case that several features have drifted but not have caused a meaningful drift in the model because these features have a low importance on the model. Identifying the feature that causes the drift and are of great importance to the model, are crucial to the performance of the model and should therefore receive better attention when retraining your model.
When ML model development, deployment and monitoring have been established with the ML Ops principles in mind, retraining your model becomes “a piece of cake”. Configuring the newly perceived data (combined with your baseline dataset) as your new training data, automatically triggers the training pipeline that trains, evaluates and validates the new models. Trained models with an increased performance, are automatically deployed via the continuous deployment pipeline. Ensuring the best performing model is in production!
VIQTOR DAVIS is hosting a ML Ops-in-a-day workshop where you will learn everything there is to know about ML Ops. Interested in finally generating the promised business value from your ML initiatives? Get in touch with our experts and join the ML Ops-in-a-day workshop.