October '20 Community Gems
A roundup of technical Q&A's from the DVC community. This month, learn how DVC files work, how to use DVC plots for multi-class classification problems, and how to deal with some spooky error messages 👻.
- Elle O'Brien
- October 26, 2020 • 4 min read
Happy Halloween from Pirate DeeVee!
DVC creates lightweight metafiles (
.dvc files) that correspond to large
artifacts in your project. These
.dvc files contain pointers to your artifacts
in remote storage (we use a simple content-based storage scheme). Because we use
content-based storage, the remote storage itself isn't designed for browsing
there are some discussions about
how to make stored files more "discoverable", and you can always identify them
manually by their contents and meta-information like timestamps).
.dvc files help establish meaningful links between human-readable
filenames and file contents in remote storage, as well as to use Git versioning
on your stored datasets and models. You can think of your DVC remote storage as
a compliment to your Git repository, not a replacement.
In other words… if you're not Git versioning your
.dvc files, you're not
versioning anything in DVC remote storage!
Yep- by default, DVC data transfer operations use a number of threads
proportional to the number of CPUs detected. But, there's a handy flag for
dvc pull and
dvc push that lets you override the defaults:
-j <number>, --jobs <number> - number of threads to run
simultaneously to handle the downloading of files from
the remote. The default value is 4 * cpu_count(). For
SSH remotes, the default is just 4. Using more jobs may
improve the total download speed if a combination of small
and large files are being fetched.
Q: I'm working on a multi-class classification task. Can
dvc plots show multiple precision recall curves- one for each class?
dvc plots doesn't support multiple linear curves on a single plot
dvc plots diff, of course!). But, you could make one precision
recall curve per class and display them side-by-side.
To do this, you'd want to write the precision recall curve values to separate
files for each class (
prc-1.json, etc.). Then you would run:
$ dvc plots show prc-0.json prc-1.json
And you'll see two plots side-by-side! A benefit of this approach is that when
dvc plots diff to compare precision recall curves across Git commits,
you'll get a comparison plotted for each class.
Q: Are you sure I should commit my
.dvc/config file? It contains my logging credentials for storage, and I'm nervous about adding it to a shared Git repository.
This is a common scenario- you don't necessarily want to broadcast your remote
storage credentials to everyone on your team, but you still want to check-in
your DVC setup (meaning, your
.dvc/config file). In this case, you want to use
local config file!
You can use the command
$ dvc config --local
to setup remote credentials that will be stored in
default, this file is in your
.gitignore so you don't have to worry about
accidentally committing secrets to your Git repository.
Check out the docs for more,
--global options for setting your configuration
for multiple projects and users respectively.
cml publish is a service for hosting files that are embedded in CML reports,
like images, audio files, and GIFS. By default, we have a limit of 2 MB per
If your files are larger than this (which can happen, depending on the machine
learning problem you're working on!) we recommend using GitLab's artifact
Based on discussions in the community,
we recently implemented a CML flag (
--gitlab-uploads) to streamline the
$ cml publish movie.mov --md --gitlab-uploads > report.md
Note that we don't currently have an analagous solution for GitHub, because GitHub artifacts expire after 90 days (whereas they're permanent in GitLab).
Q: I'm getting a mysterious error message,
Failed guessing mime type of file, when I try to use
cml publish. What's going on?
This error message usually means that the target of
cml publish- for example,
$ cml publish <target file>
is not found. Check for typos in the target filename and ensure that the file was in fact generated during the run (if it isn't part of your Git repository). We've opened an issue to add a more informative error message in the future.
Q: In my GitHub Actions workflow, I use
dvc metrics diff to compare metrics generated during the run to metrics on the main branch and print a table- but the table isn't showing any of the metrics from
main. What could be happening?
When a continuous integration runner won't report metrics from previous versions of your project (or other branches), that's usually a sign that the runner doesn't have access to the full Git history of your project or your metrics themselves. Here are a few things to check for:
- Did you fetch your Git working tree in the runner? Functions like
dvc metrics diffrequire the Git history to be accessible- make sure that in your workflow, before you run this function, you've done a
git fetch. We recommend:
$ git fetch --prune --unshallow
Are your metrics in your DVC remote? If your metrics are cached (which they are by default when you create a DVC pipeline), your DVC remote should be accessible to your runner. That means you need to add any credentials as repository secrets (or variables, in GitLab), and do
dvc pullin your workflow before attempting
dvc metrics diff.
Are your metrics in your local workspace? If you are not using a DVC remote, your metric files must be uncached and committed to your Git repository. To explore an example, say you have a pipeline stage that creates
$ dvc run -n mystage -m metric.json train.py
metric.json is cached and ignored by Git- which means that if you
aren't using a DVC remote in your CI workflow,
metric.json will effectively be
abandoned on your local machine! You can avoid this by using the
dvc run, or manually adding the field
cache: false to
your metric in
dvc.yaml. Be sure to remove your metrics from any
files, and commit and push them to your Git repository.
That's all for this month- Happy Halloween! Watch out for scary bugs. 🐛