March '22 Community Gems

A roundup of technical Q&A's from the DVC and CML community. This month: CML updates, working with multiple datasets, using DVC stages, and more.

Milecia McGregor
March 30, 2022 • 2 min read

What is the difference between using `dvc exp run` and `dvc repro`?

This is a really good question from @v2.03.99!

When you use dvc exp run, DVC automatically tracks each experiment run. Using dvc repro leaves it to the user to track each experiment.

You can learn how dvc exp run uses custom Git refs to track experiments in this blog post and you can see a quick technical overview in the docs here.

What is a good way to debug DVC stages in VSCode?

A great question here from @quarkquark!

You can debug in VSCode by following the steps below:

Install the debugpy package.
Navigate to "Run and Debug" > "Remote Attach" > localhost > someport.
In a terminal in VSCode, python -m debugpy --listen someport --wait-for-client -m dvc mycommand

This should help you debug the stages in your pipeline in the IDE and you can find more details here.

Is there a way to list what files (and ideally additional info like location, MD5, etc) are within a directory tracked by DVC?

Thanks for asking @CarsonM!

You should be able to use DVC to list the directory contents of your DVC remotes without pulling the repo. Here's an example of the command you can run:

$ dvc list https://github.com/iterative/dataset-registry/ fashion-mnist/raw

If we have multiple datasets, is it recommended to have 1 remote per dataset or to have 1 remote and let DVC handle the paths?

This is a really interesting question from @BrownZ!

It really depends on your use case. Separated remotes might be useful if you want to have granular control over permissions for each dataset.

In general, we would suggest a single remote and setting up a data registry to handle the different datasets through DVC.

Is there a mailing list for subscribing to CML releases?

It's awesome community members like @pria want to keep up with our releases!

You can follow all of our releases via GitHub notifications. You can browse release notes at https://github.com/iterative/cml/releases. You can also subscribe to release updates by clicking the Watch button in the top-right, navigating to Custom, and checking the Releases option.