May ’20 DVC❤️Heartbeat
Every month we share news, findings, interesting reads, community takeaways, and everything else along the way. Look here for updates about DVC, our journey as a startup, projects by our users and big ideas about best practices in ML and data science.
- Elle O'Brien
- May 14, 2020 • 4 min read
A big hello from DVC mascot DeeVee.
DVC turns 3. On May 4th, we celebrated DVC's third birthday! Fearless leader Dmitry Petrov wrote a retrospective about how the team has grown and what we've learned from our users, contributors, and colleagues. Thanks to everyone who celebrated with us!
Ambassador program launched. DVC has just kicked off our ambassador program with the help of our first ambassador, Marcel Ribeiro-Dantas. Marcel is an early-stage researcher at the Institut Curie, a veteran ambassador of the Fedora Project, and a data science blogger. Becoming an ambassador is a way for folks who are passionate about contributing to the DVC community to get recognized for their efforts. It's also a way for us to help volunteers with financial support for meetups and travel, as well as chances to work more closely with our team. The program is ideal for anyone who already likes blogging about DVC, contributing code, and hosting get-togethers (virtual or otherwise), but especially advanced students and early career data scientists and engineers! Learn more about it here.
DVC is part of 2020 Google Season of Docs. Another way to get involved with DVC is through Google Season of Docs, a program we're participating in for the second year in a row. This program is for technical writers to get paid experience working with the DVC team in fall 2020. Right now, we're accepting proposals from interested writers. Find out more here.
5000 GitHub Stars. It finally happened- we passed 5,000 stars on our GitHub repo!
Coincident with DVC's 3rd birthday, we shared a pre-release of DVC 1.0. The release is expected in a few weeks, but you can experiment with 1.0 now (and make tickets in our project repo if you get a bug 🐛). Some major new features include:
Run cache, a cache of pipelines you've reproduced on your local workspace. If you re-run
dvc reproon a pipeline version that's already been executed, run cache will save you compute time by returning the cached result.
Multi-stage DVC files. Users reported that their DVC pipelines changed a lot, so we've made pipeline
.dvcfiles more human-readable and editable for fast redesigns.
Plots We've got plots powered by Vega-Lite for making beautiful vizualizations comparing model performance across commits! Developer Paweł Redzyński is hard at work:
Visual aids come to DVC 1.0, with my little help. pic.twitter.com/Fd1qVr7rHb— Pablito (@Paffciu1) May 12, 2020
You can read more about the big updates coming in DVC 1.0 in our birthday blog.
Developers weren't the only ones hustling this month…
First ever virtual DVC Meetup. Marcel, our new ambassador, lead an initiative to organize a virtual meetup! Marcel shared his latest scientific work about creating a new comprehensive dataset about mobility during the COVID-19 pandemic and then passed off the mic to our two guest speakers. Data scientist Elizabeth Hutton spoke how she was building a workflow for her NLP team with DVC, and DAGsHub co-founder Dean Pleban shared his custom remote file system setup for modeling Reddit post popularity. It was quite well-attended for our first ever virtual hangout: we logged 40 individual logins to the meetup with more than 30 people staying the whole time! A video of the meetup is on the event page, so you can still check out the talks and discussion we enjoyed.
Some blogs we like. As usual, there's a lot of share-worthy writing in the data science and MLOps space:
- Tania Allard wrote an intensely readable, extremely sharp guide to practical steps anyone can take to improve the reproducibility of their ML projects. She really nails the complexity of the workflow and the importance of decoupling code and data (which we obviously agree with very much 😏). The graphics are also 💯- Tania is a developer advocate to follow.
10 top tips for reproducible Machine Learning
- Vimarsh Karbhari blogged about how teams that work with data can strategize better about versioning their data and analysis pipelines. On the opposite end of giving very practical recommendations, Vimarsh stresses a deliberate and careful approach. He emphasizes how the team's choices should depend on factors like project maturity and how much flexibility is going to be needed. It's a solid overview of how to begin thinking about MLOps at a high level.
ML Ops: Data Science Version Control
- Over at AutoRegresed, Jack Pitts shared a thorough tutorial about using Pipenv, DVC and Git together. As a trio, this manages dependencies and versions the working environment, source code, dataset and trained models. It's not only a cool use case, but a very clear step-by-step explanation that should be easy to try at home. Stay till the end for a neat trick about deploying a model as a web service with Pipenv and DVC.
Pipenv and DVC: Reproducibility in Data Science
Last, here are some of our favorite tweets to read this past month:
Data version control from @DVCorg is one of the best new tools I've used in a while. Moving data via the cloud is just a push or pull command away.— Liam Brannigan (@braaannigan) May 6, 2020
Recommend for anyone who works on multiple machines or shares data with collaborators
Getting around to learning @DVCorg, and loving it so far. Versioning data with git-style semantics gives you a lot of functionality with surprisingly little cognitive overhead.— Tim Garvin (@tcgarvin) May 8, 2020
Thank you, thank you very much.