September ’20 Heartbeat

This month, catch us on the Software Engineering Daily Podcast, check out our favorite DVC and CML tutorials and projects, and celebrate 1000 YouTube subscribers!

Elle O'Brien
September 09, 2020 • 3 min read

News

Dmitry on Software Engineering Daily

Our CEO Dmitry Petrov was interviewed on the much-beloved Software Engineering Daily podcast! Host Jeff Meyerson kicked off the discussion:

Code is version controlled through Git, the version control system originally built to manage the Linux codebase. For decades, software has been developed using git for version control. More recently, data engineering has become an unavoidable facet of software development. It is reasonable to ask–why are we not version controlling our data?

For the rest of the episode, listen here!

Data Version Control with Dmitry Petrov

softwareengineeringdaily.com

Contributor's meetup

Last week, we held a meetup for contributors to DVC! Core maintainer Ruslan Kupriev hosted a get-together for folks who contribute new features, bug fixes, and more to the community. If you missed it, you can watch it on YouTube.

New videos

We've released several new videos to our growing YouTube channel- and cool news, we passed 1,000 subscribers! The support has been surprising in the best way possible. We're seeing a lot of repeat commenters and folks from the DVC meetups! It's been so rewarding to get positive feedback from the community and we're planning to build our YouTube presence even more.

Even Skeletor finds joy in this.

We now have 4 tutorials in our MLOps series. In the latest, we cover how to use your own GPU (on-premise or in the cloud) to run GitHub Actions workflows. Check it out and give it a try, the code examples are freely available :)

We also made our first ever "explainer" video to talk through how DVC works in five minutes.

As always, video requests are welcome! Reach out and let us know what topics and tutorials you want to see covered. And we appreciate any likes, shares, and subscribes on our growing YouTube channel.

From the community

A three-part CML series (featuring R!)

DVC ambassador Marcel Ribeiro-Dantas has published two of three tutorial blogs in a series on CML! Marcel's use case is especially cool because he's using R, plus some causal modeling related to his work in bioinformatics, with GitHub Actions.

In Part I, Marcel introduces his project and how he uses DVC, CML and GitHub Actions together (with his custom R library).

Continuous Machine Learning - Part I

by Marcel Ribeiro-Dantas

mribeirodantas.xyz

In Part II, Marcel takes a deeper dive into Docker. He explains how to create a your own Docker image and test it. This case should be helpful for folks who want to include the CML library in their own Docker container.

Continuous Machine Learning - Part II

by Marcel Ribeiro-Dantas

mribeirodantas.xyz

Real Python talks DVC

Kristijan Ivancic of Real Python, a library of online Python tutorials and lessons, created a seriously impressive DVC tutorial (this thing is a beast 🐺- it has a table of contents!)

How cool is this artwork?

And, the Real Python podcast discussed their DVC tutorial (plus the joys of version control for data!) on a recent episode.

Episode 25: Data Version Control in Python and Real Python Video Transcripts

The Real Python Podcast

realpython.com

Episode 25: Data Version Control in Python and Real Python Video Transcripts

Recommended reading

There's a lot of cool stuff happening out there in the data science world 🌏!

Fabiana Clemente, Chief Data Officer of YData, published a blog for The Startup about four reasons to start using data version control- and, with her expertise in data privacy, she's especially well-qualified to explain the role of DVC in compliance and auditing! Check out her blog (it comes with a quick-start tutorial, too).

4 reasons why data scientists should version data

How to start data versioning using DVC

medium.com

4 reasons why data scientists should version data

Ryzal Kamis at the AI Singapore Makerspace shared a blog (the first of two!) about creating end-to-end CI/CD workflows for machine learning. In his first blog, Ryzal gives a high-level overview of the need for data version control and compares several tools in the space. Then he gives a walkthrough (quite easy to follow!) of how DVC fits in his workflow. We're eagerly awaiting the second installment of this series, which promises to bring more advanced automation scenarios and a CI/CD pipeline.

Data Versioning for CD4ML

Part 1

makerspace.aisingapore.org

Isaac Sacolick, contributing editor at InfoWorld, penned an article about the growing field of MLOps and its role in data-driven businesses. He writes:

Too many data and technology implementations start with poor or no problem statements and with inadequate time, tools, and subject matter expertise to ensure adequate data quality. Organizations must first start with asking smart questions about big data, investing in dataops, and then using agile methodologies in data science to iterate toward solutions.

Read the rest here: