January ’20 DVC❤️Heartbeat

Every month we share news, findings, interesting reads, community takeaways, and everything else along the way. Some of those are related to our brainchild DVC and its journey. The others are a collection of exciting stories and ideas centered around ML best practices and workflow.

Elle O'Brien
January 17, 2020 • 3 min read

We spread the joys of version control and donuts at PyData LA.

Welcome to the New Year! Time for a recap of the last few weeks of activity in the DVC community.

News

We were honored to be named a Project of the Year by Open Data Science, Russia's largest community of data scientists and machine learning practitioners. Check out our ⭐️incredibly shiny trophy⭐️!

DVC is the "project of the year" according to @odsai_en!
😱🏆🎉
Open Data Science the largest DS community we know, with over 40K active members, great courses and it's own conf Data Fest.
Many thanks to the organizers and voters!
This is the best surprize gift for the team!!🥳 pic.twitter.com/LZgewjM582
— 🦉DVC (@DVCorg) December 24, 2019

DVC hit 100 individual contributors on Github! To celebrate our 100^th contributor, Vera Sativa, we sent her $500 to use on any educational opportunity and her own DeeVee (that's our rainbow owl). We also awarded educational mini-grants to two of DVC's biggest contributors, Vít Novotný, and David Příhoda.

Vera (center, flashing a peace sign) thanked us with this lovely picture of DeeVee and her team, Odd Industries. They are making some extremely neat tools for construction teams using computer vision.

We were at PyData LA! Our fearless leader Dmitry gave a talk and we set up a busy booth to meet with the Pythonistas of Los Angeles. It was a cold and blustery day, but visitors kept showing up to our semi-outdoor booth. We're sure they came for the open source version control and not the donuts.

The DVC team and PyData volunteers who heroically staffed our booth in the rain.

Our engineer and technical writer Jorge reported:

We were super happy to meet all kinds of data professionals and enthusiasts in several fields who are learning and adopting DVC with their teams – including several working with privacy-sensitive medical records, very cool!

From the community

Here are some rumblings from the machine learning (ML) and data science community that got us talking.

A machine learning software wishlist. Computer scientist and writer Chip Huyen tweeted about her ML software wishlist and kicked off a big community discussion.

I've been thinking about the software stack for machine learning. Tools I'd love to see.

1. Pip for pretrained models.
2. Version control for datasets.
3. GPU-friendly CI. Travis CI, Circe CI don't support GPUs. Jenkins is a pain.
4. Fast dataframes. Why is Pandas so slow?
— Chip Huyen (@chipro) December 6, 2019

Her tweet resonated with a lot of practitioners, who were eager to discuss the solutions they'd tried. Among the many thoughtful replies and recommendations, we were thrilled to see DVC mentioned.

We're using @DVCorg for 2) and it works great. 🙂
— Kristijan (6/100 videos) (@kristijan_moves) December 6, 2019

If you haven't already, definitely check out Chip's thread, and follow her on Twitter for more excllent, accessible content about ML engineering. We're thinking hard about these ideas and hope the discussion continues on- and offline.

A gentle intro to DVC for data scientists. Scientist Elle O'Brien published a code walkthrough about using DVC to make an image classification project more reproducible. Specifically, the blog is a case study about version control when a dataset grows over time. If you're looking for a DVC tutorial geared for data scientists, this might be up your alley.

Start Version Controlling your Machine Learning Datasets

Make your machine learning and data science projects reproducible with open source tools.

medium.com

Start Version Controlling your Machine Learning Datasets

Ideas for data scientists to level up their code Machine learning engineer Andrew Greatorex posted a blog called “Down with technical debt! Clean Python for data scientists.” Andrew highlights something we can easily relate to: the “science” part of data science, which encourages experimentation and flexibility, sometimes means less emphasis on readable, shareable code. Andrew writes:

"I’m hoping to shed light on some of the ways that more fledgling data scientists can write cleaner Python code and better structure small scale projects, with the important side effect of reducing the amount of technical debt you inadvertently burden on yourself and your team.”

In this blog, DVC gets a shout-out as Andrew’s preferred data versioning tool, used in conjunction with Git for versioning Python code. Thanks!

Down with technical debt! Clean Python for data scientists.

medium.com

An introduction to MLOps Engineer Sharif Elfouly wrote an approachable guide to thinking about MLOps, the growing field around making ML projects run efficiently from experimentation to production. He summarises why managing ML projects can be fundamentally different than traditional software development:

“The main difference between traditional software and ML is that you don’t only have the code. You also have data, models, and experiments. Writing traditional software is relatively straightforward but in ML you need to try out a lot of different things to find the best and fastest model for your use-case. You have a lot of different model types to choose from and every single one of them has its specific hyperparameters. Even if you work alone this can get out of hand pretty quickly.”

Sharif gives some recommendations for tools that work especially well for ML, and he writes that DVC is the “perfect combination for versioning your code and data.” Thanks, Sharif! We think you’re perfect, too.