Tutorial: Automate Data Validation and Model Monitoring Pipelines with DVC and Evidently
Imagine you're in charge of weekly batch scoring jobs in a retail setting, where accurately predicting customer behavior is crucial. The challenge? Ensuring your machine learning models remain precise and efficient as time progresses, and verifying that your data consistently reflects the real-world scenario. This tutorial will equip you with the skills to use DVC and Evidently, transforming them into powerful allies for automating data validation and model monitoring pipelines. Tailored for Data Scientists, ML Engineers, MLOps professionals, and Team Leads, this guide offers a streamlined approach to boost and sustain your model's performance in the ever-evolving business landscape.
- Mikhail Rozhkov
- January 19, 2024 • 10 min read
Evidently + DVC integration example
Feel free to clone the repository provided. It's more than a learning tool; it's a flexible reference architecture that you can adapt to fit your unique use cases.
In the realm of Machine Learning Operations (MLOps), ensuring the robustness and reliability of models is paramount. Using the right tools can significantly enhance your MLOps practices.
DVC is an open-source tool that brings agility and reproducibility to data science projects by treating data and model training pipelines as software. It connects versioned data sources and code with pipelines, track experiments, register models — all based on GitOps principles.
Evidently is an open-source Python library to evaluate, test, and monitor ML models. It has 100+ built-in metrics and tests on data quality, data drift, and model performance and helps interactively visualize them.
When used together, DVC and Evidently tools offer a comprehensive solution for training, predicting, and monitoring ML models.
💡 Want to learn more about DVC and Evidently?
- Iterative Tools for Data Scientists & Analysts course with DVC
- Open-source ML observability course with Evidently
This tutorial teaches you how to build DVC pipelines for training and monitoring jobs, parse Evidently reports, and version reference datasets.
By the end of this tutorial, you will learn how to implement an ML monitoring architecture using:
- Evidently to perform data quality, data drift, and model quality checks.
- DVC to run monitoring jobs and version monitoring artifacts
- DVCLive to save monitoring metrics from Python scripts and visualize in VS Code.
Using a Python virtual environment, you can run the example on a local machine.
Dataset. You will be diving into a Kaggle dataset focused on Bike Sharing Demand. The goal is to predict hourly bike rental volumes.
ML Application. Use historical usage and weather data to predict bike rental demand. Essential for operational efficiency and customer service.
- Applicable in sectors like retail, transportation, and energy for demand prediction.
- Ensures models stay relevant and effective despite changing data patterns.
We expect that you:
- Have learned the for DVC by following the Get Started with DVC guide
- Went through the Evidently Get Started Tutorial and can generate visual and JSON Reports with Metrics.
To follow this tutorial, you'll need the following tools installed on your local machine:
- Python version 3.11 or above
- VS Code and DVC Extension for VS Code
💡 Note: we tested this example on macOS/Linux.