December '22 Heartbeat
This month you will find:
🦮 MLOps Guide,
🧪 DVC extension for VS Code Experimentation,
🌌 A Fable about MLOps,
❣️ Unstructured Data Query Language coming,
📈 DVC Live Experiment Tracking,
🚀 GTO GitOps model registry tutorial,
👀 New CML Commands, and more!
- Jeny De Figueiredo
- December 16, 2022 • 6 min read
Unlike most of the text you've read over the past two weeks, this Heartbeat was 100% human generated. 😉
Welcome to December! Wow, what a year! We introduced an online course, added five new tools (TPI, GTO, MLEM, DVC Extension for VS Code, and a Model Registry in Iterative Studio) plus tons of new features to DVC, CML, and Iterative Studio. We also were thrilled to emerge from the pandemic and meet so many of you in person at conferences around the world. We are excited about what's in store for 2023, and we thank you all for being such fantastic community members. While there are still challenging events happening around the globe, there is much to be thankful for and victories to celebrate! Bring on 2023!
For their engineering final project at Insper, Arthur Olga, Gabriel Monteiro, Guilherme Leite, and Vinicius Lima created the MLOps Guide, which provides a Complete MLOps development cycle using DVC, CML, and IBM Watson. The multi-page guide covers the principles of MLOps as well as a full tutorial for building an MLOps environment. It covers data and model versioning, feature management and storing, automation of pipelines and processes, CI/CD for machine learning, and continuous monitoring of models. The guide uses both DVC and CML and includes videos outlining the project and much of the coding, as well as a project repository that you can work through.
MLOps Guide (Source link)
Eryk Lewinson wrote a fabulous, in-depth tutorial on experiment tracking using our new DVC Extension for VS Code. He starts off with, “One of the biggest threats to productivity in recent times is context switching.” As a Community Manager, I can so relate! 😅 He posits that the extension is a great way to both code our experiments and evaluate and compare them happily in our IDE, without having to jump back and forth between platforms.
Eryk uses a credit card risk dataset and project to show most of the capabilities of the DVC Extension for VS Code and take us through all the steps to show the entire workflow and the resulting project structure. He notes the best points of the extension are its experiment bookkeeping with an emphasis on reproducibility and its extended plotting capabilities including live plotting to visualize model performance while the model is still being trained. He goes over some tricks and functionality of the extension as well.
A Fable About MLOps… and Broken Dreams (Source link)
Alex Burlacu tells a great story and provides many tips on his experience in MLOps in this piece on his blog called A Fable About MLOps… and Broken Dreams. The tale is likely all too familiar to many of you in our Community in addition to being validating and entertaining to read. He offers some great prerequisites for beginning your MLOps journey including quickly finding and accessing your data, seeding that model training code, and recording your experiment configuration. Last of these he recommends MLFlow, but as the previous summary from Eryk points out, this can be done very effectively with the new DVC extension AND be truly fully reproducible. 🤗
Generally, he recommends starting early and starting small with MLOps. More technically, he recommends a simple data collection and discovery system, data versioning with DVC, replicable experiments, experiment tracking, ML serving, testing, and CI/CD. It's all great advice and fun to read!
ML Pipeline Decoupled - I managed to write a framework-agnostic ml pipeline with DVC, Rust, and Python
Sheikh Samsuzzhan Alam, aka Mr. Data Psycho, writes this great piece that reminds us that DVC is language agnostic! While Python is the most popular language used in Data Science and with DVC, there are some instances where you may want to use languages such as Rust to speed up memory efficiency and offer a faster solution for parts of your project. The good news is you can! Mr. Data Psycho extols the virtues of DVC’s pipelining feature and shows how to use Rust (Polars) as a pre-processing framework, Sci-kit Learn for model training, and the rest in Python. Using the yaml files, each stage could be put together using dependencies written in whatever language your heart desires! You can find the repo for the project here. R users may be interested in this related content here, here, and here.
If you’d like an online CheatSheet for DVC you can find one here created by Igor Chubin. Pick a command from the drop-down menu and bam 💥, you’ve got the info you need! It’s very cool, but do always remember to check our docs here, here, and here; we are always updating them!
DVC Cheat Sheet (Source link)
Aleksandr Dudko, Anatoly Bolshakov, Denis Nosov, and Vladimir Krestov, of Akvelon, wrote this great tutorial on using MLEM to make the process of integrating, packaging, and deploying machine learning models much easier. In the tutorial, they show how to do this with Akvelon’s .NET and Java clients for use in existing or new Web (ASP.Net, Java Spring), Mobile (Xamarin, Android), and Desktop (WPF, WinForms, Java Spring, Java Spring). Explore the project directory here.
Akvelon enables non-Python apps to integrate machine learning models with MLEM (Source link)
We’ve been listening to the greater Community and know you’d like to see easier experiment tracking from DVC and we’re on it! The latest release of DVCLive helps bring that goal to fruition. Now you can track your experiments with only a couple of lines of code directly from your notebook or your .py file. You can start with just a repo with Git and DVC initialized, using your existing tools; eliminating the need for a hosted solution or setting up a server or database. Keep track of all the metadata related to the experiment in your Git provider of choice (GitHub/GitLab), and your cloud storage, and share with your team when you are ready. In addition, you can use Iterative Studio to share the results of your experiments with teammates.
Ariel Biller's Experiment Tracking meme (Source link)
Do you use Amazon S3, Azure Blob Storage, or Google Cloud Storage? We have a new solution for finding and managing your datasets of unstructured data like images, audio files, and PDFs! Extend your DVC environment with the first unstructured data query language (think SQL -> DQL) for machine learning. We are looking for beta customers for this new tool.
Schedule a meeting with us if that's what you're needing!
A model registry is a tool to catalog ML models and their versions. Models from your data science projects can be discovered, tested, shared, deployed, and audited from there. Learn how to build a model registry in a DVC Git repo without involving any extra services, integrations, and APIs in this new post from Alex Guschin!
On January 11th, Francesco Calcavecchia will be joining us to share about his recent contribution to MLEM through his work on GTO and how this helps him in his work at E.On Energie Deutschland with creating a Git-based model registry.
Francesco Calcavecchia on Designing a model Registry with Legacy Systems using DVC and GTO
Our global, all-remote team works hard, but we also have fun! We have a weekly All-Hands meeting where our teams report progress via pre-recorded video so that everyone can be prepared to discuss the topic during the meeting.
As we all level up our video production skills, the videos have started to get more fun! Jesper Svendsen inserted this FlappyDeeVee video in the middle of our Iterative Studio update! Try the game here! Confession: I can’t get past the first pipe! 😆
Stay tuned to our Newsletter for more content from the Community and what we will be up to conference-wise in 2023!
The CML team recently made updates to their commands to make them more intuitive. If you were used to the old ones, do not fret, info will pop up in the CLI to remind you if you use the old commands and what the new ones are. In the meantime, you can get up to date on the changes here.
Our Notebooks to DVC Pipeline for Reproducible Experiments from Rob de Wit was noted in Deep Learning Weekly.
🤖 Issue #276 is now live! This week in deep learning: AI with the right dose of curiosity, notebooks to DVC pipelines for reproducible experiments, generating human-level text with contrastive search, an open-source data exploration tool, and more.https://t.co/JXUkrOEYzC— Deep Learning Weekly (@dl_weekly) November 16, 2022
Do you have any use case questions or need support? Join us in Discord!
Head to the DVC Forum to discuss your ideas and best practices.