October ’20 Heartbeat
This month, hear about our international talks, new video docs on our YouTube channel, and the best tutorials from our community.
- Elle O'Brien
- October 12, 2020 • 3 min read
Double DeeVee! One of these birds is on a layover before heading to Germany.
DVC developer Paweł Redzyński (he's written a lot of the code behind
dvc plots) is giving at talk at the Data Science Summit
in Poland! The virtual meeting is on October 16, but talks are available for
streaming on demand up to a week before. Paweł's talk is part of the DataOps &
Development track, where he'll be sharing about CML and GitHub Actions (note
that it'll be delivered in English).
CEO Dmitry Petrov dropped into the Data Engineering Melbourne meetup to talk about Data Versioning and DataOps! He spoke about the differences between end-to-end platforms and ecosystems of tools, and how this distinction informs the development of software like DVC and CML (hint: we picked tools over platforms).
Keep an eye on this meetup, which is now accessible to folks on all continents thanks to the magic of the internet :)
Data Engineering Melbourne
DevOps for science: using continuous integration for rigorous and reproducible analysis
PyData Global has a fantastic lineup of talks spanning science and engineering, so please consider joining!
DVC Ambassador Mikhail Rozhkov co-hosted the Machine Learning REPA (Reproducibility, Experiments and Pipelines Automation) track of DataFest 2020, and DVC showed up in full force! There were talks from Dmitry, ambassador Marcel Ribeiro-Dantas, and myself about all aspects of MLOps and automation.
DataFest is over (until next year, anyway), but visit the ML-REPA community for ongoing content and opportunities for networking.
Since the summer, we've been building our YouTube channel. It's going great- we've gotten more than 18,000 views in the last few months and 1,500 subscribers!
Our latest video in the MLOps Tutorials series introduced using GitHub Actions for model testing- instead of training a model in continuous integration, the idea is to train locally and "check-in" your favorite model for testing in a standardized environment. This approach lets you completely control the environment, infrastructure, and code used to evaluate your model, and save the run in a place that's easy to share (GitHub!).
We'll be going deeper into the art and craft of testing ML models in the next few weeks, so stay tuned. Another big initative is adding videos to our docs: since video seems like a popular format for a lot of learners, we're working to supplement our official docs with embedded videos. Check out our first installment on the Getting Started with Data Versioning.
Our community makes some amazing tutorials. Here are a few on our radar:
Data scientist and full-stack developer Ashutosh Hathidara shared an end-to-end machine learning project made with DVC and CML… and released it in video form! It's a neat setup and a nice model for folks to study.
Another detailed and easy-to-follow tutorial, with a similarly impressive scope, appeared on Heise Online. This project puts together DVC, Cortex, and ONNX to develop and deploy a model trained on the Fashion MNIST dataset (note: the article is in German, and I read it with Chrome's English translation).
Managing and commissioning ML models
You'll also want to check out anno.ai's tutorial about managing large datasets with DVC and S3 storage- it's detailed, but also a quick-start guide informed by the team's practical experience.
MLOps and Data: Managing Large ML Datasets with DVC and S3 (Part 1)
Data scientist and mathematician Khuyen Tran blogged about why and how to start using DVC- and her tutorial includes Google Drive remote storage, a feature we're especially excited about. Check it out and follow along with her code examples!
Introduction to DVC: Data Version Control Tool for Machine Learning Projects
And to end on a thoughtful note… have you seen this thread by ML Engineer Shreya Shankar? She beautifully summarizes many of the ideas and technical challenges our community thinks about every day. Read and reflect!
In good software practices, you version code. Use Git. Track changes. Code in master is ground truth.— Shreya Shankar (@sh_reya) October 8, 2020
In ML, code alone isn't ground truth. I can run the same SQL query today and tomorrow and get different results. How do you replicate this good software practice for ML? (1/7)