September '20 Community Gems
A roundup of technical Q&A's from the DVC community. This month, we discuss customizing your DVC plots, the difference between external dependencies and outputs, and how to save models and data in CI.
- Elle O'Brien
- September 28, 2020 • 5 min read
If you're using DVC with an SSH-protected remote, DVC uses a Python library
paramiko to create a connection to your remote. There is a
paramiko expects RSA keys in OpenSSH key format, and can throw an error
if the keys are in an alternative format (such as default PuTTY formatted keys).
If this is the case, you'll likely see:
ERROR: unexpected error - ('... ssh-rsa ...=', Error('Incorrect padding',))
Yes, you can have as many separate parameter files as you'd like. It's only important that they are correctly specified in your DVC pipeline stages.
For example, if you have files
params_model.yaml in your project (perhaps to store hyperparameters of your
data processing and model fitting stages, respectively), you'll want to call the
right file at each stage. For example:
$ dvc run -n preprocess \
Q: Is there a way to automatically produce SVG plots from
dvc plot? I don't like having to click through the Vega-Lite GUI to get an SVG, and my plots look so small when I access them in the browser.
If your DVC plots (and by DVC plots, we mean Vega-Lite plots 😉) look small in
your browser, you can modify this programmatically! DVC generates Vega-Lite
plots by way of a few templates that come pre-loaded. The templates are in
.dvc/plots (assuming you're in a DVC directory).
Find the template that corresponds to your plot (if you didn't specify a plot
type in your CLI command, it's probably
default.json) and modify the
width paramters. Then save your changes.
One last tip: did you know about the
Vega-Lite CLI? It provides
functions for converting Vega-Lite plots to
(Vega) formats. To use this approach with DVC, you'll want to use the
--show-vega flag to print your plot specification to a
$ dvc plots --show-vega > vega.json
$ vl2svg vega.json
In short, external outputs and dependencies are files or directories that are tracked by DVC, but physically reside outside of the local workspace. This could happen for a few reasons:
- You want to version a dataset in cloud storage that is too large to transfer to your local workspace efficiently
- Your DVC pipeline writes directly to cloud storage
- Your DVC pipeline depends on a dataset or other file in cloud storage
An external output is declared in two ways: for example, if you have a file
data.csv in S3 storage, you can use
dvc add --external s3://mybucket/data.csv to begin DVC tracking the file
(there are plenty more details and tips about managing external data in our docs)).
You can also declare
data.csv as an output of a DVC pipeline with
dvc run -o s3://mybucket/data.csv.
An external dependency is a dependency of a DVC pipeline that resides in
cloud storage. It's declared with the syntax
dvc run -d s3://mybucket/data.csv.
One other difference to note: DVC doesn't cache external dependencies; it merely
checks if they have changed when you run
dvc repro. On the other hand, DVC
does cache external outputs. You'll want to set up an
in the same remote location where your files are stored. This is because the
default cache location (in your local workspace) no longer makes sense when the
dataset never "visits" your local workspace! An external cache works largely the
same as a typical cache in your workspace.
In many of our CML docs and videos, we've shown how to get CML on your CI (continuous integration) runner via a Docker container that comes with everything installed. But this is not the only way to use CML, especially if you want workflows to run in your own Docker container.
You can install CML via
npm, either in your own Docker container or in your CI
workflow (i.e., in your GitHub Actions
.yaml or GitLab CI
To install CML as a package, you'll want to run:
$ npm i -g @dvcorg/cml
Note that you may need to install additional dependencies if you want to use DVC plots and Vega-Lite commands:
$ sudo apt-get install -y libcairo2-dev libpango1.0-dev libjpeg-dev libgif-dev \
$ npm install -g vega-cli vega-lite
If you're installing CML as part of your workflow, you may need to install Node first- check out our docs for how to do this in GitHub Actions and GitLab CI.
Q: After running a GitHub Action workflow that runs a DVC pipeline, I want to save the output of the pipeline. Why doesn't CML automatically save the output?
By design, artifacts generated in a CI workflow aren't saved anywhere- they disappear as soon as the runner shuts down. So a DVC pipeline executed in your CI system might produce outputs, like transformed datasets and model files, that will be lost at the end of the run. If you want to save them, there are a few methods.
One approach is with auto-commits: a
git commit at the end of your CI workflow
to commit any new artifacts to your Git repository. However, auto-commits have a
lot of downsides- they don't make sense for a lot of users, and generally, it's
better to re-create outputs as needed than save them forever in your Git repo.
We created the DVC
run-cache in part
to solve this issue.
Here's how it works: you'll setup a DVC remote with access credentials passed to
your GitHub Action/GitLab CI via CML (see, for example,
Then you'll use the following protocol in your CI workflow (your workflow config
file in GitHub/GitLab):
$ dvc pull --run-cache
$ dvc repro
$ dvc push --run-cache
When you use this design, any artifacts of
dvc repro, such as models or
transformed datasets, will be saved in DVC storage and indexed by the pipeline
version that generated them. You can access them in your local workspace by
$ dvc pull --run-cache
$ dvc repro
While we think this is ideal for typical data science and machine learning workflows, there are other approaches too- if you want to go deeper exploring auto-commits, checkout the Add & Commit GitHub Action.
To be clear, CML isn't a competitor to Circle CI. Circle CI is more analogous to GitHub Actions or GitLab CI; it's a continuous integration system.
CML is a toolkit that works with a continuous integration system to 1) provide big data management (via DVC & cloud storage), 2) help you write model metrics and data viz to comments in GitHub/Lab, and 3) orchestrate cloud resources for model training and testing. Currently, CML is only available for GitHub Actions and GitLab CI.
So to sum it up: CML is not a standalone continuous integration system! It's a toolkit that works with existing systems, which in the future could include Circle CI, Jenkins, Bamboo, Azure DevOps Pipelines, and Travis CI. Feel free to open a feature request ticket, or leave a 👍 on open requests, to "vote" for the integrations you'd like to see most.