March '21 Community Gems
A roundup of technical Q&A's from the DVC community. This month: remote storage integration, hyperparameter tuning, best practices for managing experiments and more.
- Jeny De Figueiredo
- March 31, 2021 • 3 min read
We recently had questions about this, specifically regarding Huawei Cloud and Backblaze B2 Storage. The answer is any cloud storage that has an S3 interface will work with DVC and both of the aforementioned do! In addition DVC works with Azure, Google Drive, GS, OSS, and SSH. Learn more about S3 combatibility integrations and all available remote storage capabilities here.
Thanks to @luke and @Samuel H from Discord for asking these questions that led to this Gem! 💎
Q: I had understood previously that DVC was not suitable for hyperparameter tuning. Has that changed?
Yes indeed! With DVC 2.0, the capabilities have evolved quite a bit! We have introduced experiments and metrics which enables you to track and compare the different runs of your models with various hyperparameters. You can check out the documents here and here to see all the details.
Thanks to @saif3r for helping us highlight the new features in DVC!
Q: Is it possible to set up a DVC repo with pipelines which have all the data (cache, input, output) on another (local) location outside the repo?
Thanks for the question @EEisbrenner!
One solution to this would be to keep your DVC cache on your mount, and use the
symlink cache type so all of your data would remain on that mount, but for
DVC's purposes it would only deal with files that are "inside" your repo (via
symlinks). Note that your data on that mount would be stored in DVC's
content-addressable cache format, and not in
path/to/mount/foo.nc. Check out
the docs on
how to keep DVC cache on your mount here.
To actually work with
foo.nc, you'd end up with a symlink
foo.nc inside your
git/DVC repo that points to some object in your DVC cache.
See these docs for info on how the cache link types work. For doing the initial
dvc add step for
your data without needing to copy it into the DVC/repo first,
check out these docs.
Q: My peers and I share a repo where we have a folder that is versioned with DVC. I'm getting an error message when trying to pull data from the cloud. What could be causing it?
I see you are having the following error:
$ dvc pull
Everything is up to date.
ERROR: failed to pull data from the cloud - 'data\rhinoceros.dvc' format error: extra keys not allowed @ data['outs']['size']
$ dvc doctor
DVC version: 1.9.1 (exe)
Platform: Python 3.7.9 on Windows-10-10.0.19041-SP0
Supports: All remotes
Cache types: hardlink
Cache directory: NTFS on C:\
Workspace directory: NTFS on C:\
Repo: dvc, git
You're colleague is likely running a newer version of DVC. Upgrade so that all are on the same version and you will be good to go!
Thanks @ojon for this important gem! 💎
You could create separate directories for each experiment and keep your
pipelines organized with separate
dvc.yaml files. You can find more
organization patterns for experiments here.
Currently we are working on a way to compare metrics between different paths if
using this method of keeping experiments in different directories.
You can follow that issue here!
Thanks @tijoseymathew for your question in Discord!
Yep! There's a way! We offer a Git hook for
post-checkout, which automates DVC
checkout right after
git checkout. You can use
dvc install to install that
Check out these docs for all the info on installing Git hooks and here for a specific example!
Many thanks to @Thyrix for this question!
Thanks @Carlos Lopez H for this important gem! 💎
At our April Office Hours Meetup we will be demo-ing pipelines as well as CML. RSVP for the Meetup here to stay up to date with specifics as we get closer to the event!
Join us in Discord to get all your DVC and CML questions answered!