March '21 Community Gems

A roundup of technical Q&A's from the DVC community. This month: remote storage integration, hyperparameter tuning, best practices for managing experiments and more.

Jeny De Figueiredo
March 31, 2021 • 3 min read

Q: Will DVC work with <my remote cloud storage of choice?>

We recently had questions about this, specifically regarding Huawei Cloud and Backblaze B2 Storage. The answer is any cloud storage that has an S3 interface will work with DVC and both of the aforementioned do! In addition DVC works with Azure, Google Drive, GS, OSS, and SSH. Learn more about S3 combatibility integrations and all available remote storage capabilities here.

Thanks to @luke and @Samuel H from Discord for asking these questions that led to this Gem! 💎

Q: I had understood previously that DVC was not suitable for hyperparameter tuning. Has that changed?

Yes indeed! With DVC 2.0, the capabilities have evolved quite a bit! We have introduced experiments and metrics which enables you to track and compare the different runs of your models with various hyperparameters. You can check out the documents here and here to see all the details.

Thanks to @saif3r for helping us highlight the new features in DVC!

Q: Is it possible to set up a DVC repo with pipelines which have all the data (cache, input, output) on another (local) location outside the repo?

Thanks for the question @EEisbrenner!

One solution to this would be to keep your DVC cache on your mount, and use the symlink cache type so all of your data would remain on that mount, but for DVC's purposes it would only deal with files that are "inside" your repo (via symlinks). Note that your data on that mount would be stored in DVC's content-addressable cache format, and not in path/to/mount/foo.nc. Check out the docs on how to keep DVC cache on your mount here.

To actually work with foo.nc, you'd end up with a symlink foo.nc inside your git/DVC repo that points to some object in your DVC cache.
See these docs for info on how the cache link types work. For doing the initial dvc add step for your data without needing to copy it into the DVC/repo first, check out these docs.

I see you are having the following error:

$ dvc pull

Everything is up to date.
ERROR: failed to pull data from the cloud - 'data\rhinoceros.dvc' format error: extra keys not allowed @ data['outs'][0]['size']

$ dvc doctor

DVC version: 1.9.1 (exe)
---------------------------------
Platform: Python 3.7.9 on Windows-10-10.0.19041-SP0
Supports: All remotes
Cache types: hardlink
Cache directory: NTFS on C:\
Workspace directory: NTFS on C:\
Repo: dvc, git

You're colleague is likely running a newer version of DVC. Upgrade so that all are on the same version and you will be good to go!

Thanks @ojon for this important gem! 💎

Q: How do I create multiple pipeline (`dvc.yaml`) files for different experiments?

You could create separate directories for each experiment and keep your pipelines organized with separate dvc.yaml files. You can find more information on organization patterns for experiments here. Currently we are working on a way to compare metrics between different paths if using this method of keeping experiments in different directories. You can follow that issue here!

Thanks @tijoseymathew for your question in Discord!

Q: Is there a way to run "git checkout and "dvc checkout" in one command?

Yep! There's a way! We offer a Git hook for post-checkout, which automates DVC checkout right after git checkout. You can use dvc install to install that hook.
Check out these docs for all the info on installing Git hooks and here for a specific example!

Many thanks to @Thyrix for this question!

These docs will show you how to get a remote Google Drive set up! Be sure to setup the remote folder's permissions! For more information on sharing permissions in Google Drive see these docs.

Thanks @Carlos Lopez H for this important gem! 💎

At our April Office Hours Meetup we will be demo-ing pipelines as well as CML. RSVP for the Meetup here to stay up to date with specifics as we get closer to the event!

Join us in Discord to get all your DVC and CML questions answered!

← Back to blog

March '21 Community Gems

Q: Will DVC work with <my remote cloud storage of choice?>

Q: I had understood previously that DVC was not suitable for hyperparameter tuning. Has that changed?

Q: Is it possible to set up a DVC repo with pipelines which have all the data (cache, input, output) on another (local) location outside the repo?

Q: My peers and I share a repo where we have a folder that is versioned with DVC. I'm getting an error message when trying to pull data from the cloud. What could be causing it?

Q: How do I create multiple pipeline (dvc.yaml) files for different experiments?

Q: Is there a way to run "git checkout and "dvc checkout" in one command?

Q: How do I set a remote in Google Drive and share with someone else?

Ready to get started?

Q: How do I create multiple pipeline (`dvc.yaml`) files for different experiments?