August ’20 Heartbeat
Catch our monthly updates- featuring the CML release, DVC meetup recap, a new video tutorial series, and the best reading about pipelines and DataOps.
- Elle O'Brien
- August 10, 2020 • 5 min read
DeeVee avoids the summer sun at Mount Rainier National Park.
Welcome to our August roundup of cool news, new releases, and recommended reading in the MLOps world!
At the beginning of July, we went live with a new project: Continuous Machine Learning, or CML for short. If you hadven't heard, CML is an open-source toolkit for adapting popular continuous integration systems like GitHub Actions and GitLab CI for machine learning and data science. This release marks a new stage for our organization: while CML can work with DVC, and both are built around Git, CML is designed for standalone use. That means we're supporting TWO projects now!
Luckily, we received plenty of encouraging and helpful feedback following the CML release. CML was on the front page of Hacker News for most of release day! We also got covered on Heise, a popular German IT news source. I (Elle, a proud part of the CML team!) also gave a talk presenting our approach as part of the MLOps World meeting, which is now available for online viewing.
Of course, we're fielding lots of questions too! We've compiled some of the most common questions (and their answers!) in our last Community Gems post, and CML developer David G. Ortega has written a tutorial for a much-asked-for use case: doing continuous integration with on-demand GPUs.
If you have comments, questions, or feature requests about CML, we really want to hear from you. A few ways to be in touch:
Last week, we had another meetup! DVC Ambassador Marcel kicked us off with a short talk about how he's using DVC as part of his causal modeling approach to bioinformatics. It's cool stuff. Then, I talked a bit about CML and did some live-coding. The beauty of live-coding is getting to answer questions in real-time, and if you're totally new to the idea of continuous integration (or want to understand how CML works with GitHub Actions/GitLab CI) seeing a project in-action is one of the best ways to learn.
You can watch a recording of the meetup online now (it's lightly edited to remove some pesky Zoom trolls), and join our Meetup group to get updates for the next one. In future meetups, we'd love to support community members sharing their work, so get in touch if you'd like to present.
We're starting up some new YouTube features! If you haven't seen our channel, check it out and consider subscribing for hands-on tutorials and demos. Our first video introduced continuous integration and GitHub Actions, and the second showed how to use DVC and free Google Drive storage to add external data storage to a GitHub project.
In the coming weeks, we'll be covering:
- Using CML and GitHub Actions with hardware for deep learning, like on-premise GPUs
- Understanding Vega plots and making data viz part of your CI system
- Some DVC basics to supplement our docs
We're huge fans of a recent Python Bytes episode featuring Ines Montani, founder of Explosion and one of the makers of the incredible SpaCy library for NLP (seriously, I have the highest recommendations for SpaCy).
My @PythonBytes episode is out now!— Ines Montani 〰️ (@_inesmontani) July 23, 2020
🎙️ Listen here: https://t.co/fHLF2hR4cM
My picks of the week are:
🐙 TextAttack by @jxmorris12: https://t.co/jySYrtzzp8
🦉 Data Version Control (DVC) @DVCorg: https://t.co/3610F6kv8v
🐍 Built-in generic types in 3.9
Ines' episode discussed DVC, and DVC is going to be integrated with SpaCy in their 3.0 release. SpaCy + DVC is going to be a powerhouse and we can't wait.
Another cool software project: Casper da Costa-Luis, DVC
contributor and creator of the popular
tqdm library, has published a tab-completion
script generator for Python applications!
shtab, as it's called, was
originally designed for DVC, but Casper developed it into a generic tool that
can be used for virtually any Python CLI application. Check out
shtab on GitHub and read the release
(Tab) Complete Any Python Application in 1 Minute or Less
Our friends at DAGsHub have released a script to help DVC users upgrade their pipelines to the new DVC 1.0 format! Says Simon, a DAGsHub engineer, in his tutorial:
In this post, I'll walk you through the process of migrating your existing project from DVC ≤ 0.94 to DVC 1.X using a single automated script, and then demonstrate a way to check that your migration was successful.
Read the blog and get migrating (but don't worry if you can't; DVC 1.0 is backwards compatible).
Automatically migrate your project from DVC≤ 0.94 to DVC 1.x
Here are some of our favorite blogs from around the internet 🌏.
- Déborah Mesquita, data scientist (and an excellent writer to follow), published a tutorial about DVC pipelines that is truly deserving of the moniker "ultimate guide". It's a start-to-finish case study about a typical machine learning project, with DVC pipelines to automate everything from grabbing the data to training and evaluating a model. Also, it comes with a video tutorial if you prefer to watch instead of read!
The ultimate guide to building maintainable Machine Learning pipelines using DVC
- Software engineer Vaithy Narayanan created the first ever ☝️ CML user blog! Vaithy created a pipeline that covers data collection to model training and testing, and used CML to automate the pipeline execution whenever the project's GitHub repository is updated. He ends with some insightful discussion about the strengths and weaknesses of the approach.
Using Continuous Machine Learning to Run Your ML Pipeline
Ryan Gross, a VP at Pariveda Solutions, blogged about the future of data governance and the lessons from DevOps that might save the day. Honestly, you should probably start reading for this cover image alone.
The Rise of DataOps (from the ashes of Data Governance)
Thanks everyone, that's it for this month. We hope you're staying safe and making cool things!