February '21 Community Gems
A roundup of technical Q&A's from the DVC community. This month: best practices for config files, pipeline dependency management,and caching data for CI/CD. Plus a new CML feature to launch cloud compute with Terraform!
- Elle O'Brien
- February 26, 2021 • 5 min read
Q: I noticed I have a DVC
config file and a
config.local file. What's best practice for committing these to my Git repository?
DVC uses the
config.local files to link your remote data
repository to your project.
config is intended to be committed to Git, while
config.local is not - it's a file that you use to store sensitive information
(e.g. your personal credentials - username, password, access keys, etc. for
remote storage) or settings that are specific to your local environment.
Usually, you don't have to worry about ensuring your
config.local file is
being ignored by Git- the only way to create a
config.local file is using the
--local flag explicitly in functions like
dvc remote and
commands, so you'll know you've made one! And your
config.local file is
.gitignored by default. If you're concerned, take a look and make sure there
are no settings in your
config.local file that you actually want in your
To learn more about
read up in our docs.
Q: What's the best way to install the new version of DVC in a Conda environment? I'm concerned about the
When you install DVC via
conda, it will come with dependencies like
The only exception when installing DVC as a Python library is with
might want to specify the kind of remote storage you need to make sure all
dependencies are present (like
boto for S3). You can run
pip install "dvc[<option>]", with supported options like
[ssh]. Or, use
[all] to include them all.
For more about installing DVC and its dependencies, check out our docs.
Q: How do I keep track of changes in modules that my DVC pipeline depends on? For example, I have a pipeline stage that runs a script
prepare.py, which imports a module
module.py changes, how will DVC know to rerun the pipeline stage?
If your DVC pipeline only lists
prepare.py as a dependency, then changing code
in module files won't trigger a re-run of the pipeline. Meaning that if you run
dvc repro after updating
module.py, DVC will simply return the result of
your last pipeline run and a message that nothing has changed.
To explain further why this happens:
DVC is platform agnostic and it doesn't know whether your command's executable
python, some other script interpreter, or a compiled binary for that
E.g. this is a valid stage:
dvc run -o hello.txt 'echo "Hello!" > hello.txt'(where the executable is echo).
DVC also doesn't know what's going on inside the command's source code.
Therefore, any file that your code requires internally should be explicitly
specified as a pipeline stage dependency (in CLI,
dvc run -d , or in YAML,
deps:) for DVC to track it.
If you're not interested in adding modules as explicit dependencies, there are a few other approaches:
- Make your
requirements.txtfile a stage dependency (if the loaded module comes from a package).
- Manually rebuild the pipeline (with
dvc repro --force <stage>.dvc) when you know an unmarked dependency is changed – although this is prone to human error.
- Have a version/build number comment in the main script that always gets updated when an unmarked dependency changes – this could be automated.
We also have an ongoing discussion about this issue on our GitHub repository, and we'd love your input. Please participate in this issue if you can here!
Q: My DVC pipeline has a lot of dependencies, and I don't want to manually write them all out in my
dvc.yaml file. Are there any ways to use wildcards (like
*) or specify directories as dependencies?
Yes, you can set a directory to be a dependency or an output of a DVC pipeline stage. This means you can have tens, hundreds, thousands or millions of dependency files in one directory, and all you have to declare in the pipeline is the address of that directory.
You're in luck, because we just shared this feature as part of the CML 0.3.0
pre-release! The pre-release introduced a new function,
cml runner, which
previous method for launching instances in the cloud from a CI workflow using Docker Machine.
In the new
cml runner function built on Terraform, you can deploy instances in
AWS and Azure with a single command (it used to take about 30 lines of code!).
For example, to launch a
t2.micro instance on AWS from your GitHub Actions or
GitLab CI workflow, you'll run:
cml runner \
--cloud aws \
--cloud-region us-west \
Q: My CI workflow creates a
[report.md](http://report.md) document that gets published to my pull request by CML. I want to save the
report.md file to my repository, too. Is this possible?
By default, files that are created in a GitHub Actions or GitLab CI workflow
only exist on the runner- as soon as the runner turns off, they vanish.
cml publish and
cml send-comment create persistent links to
data visualizations, tables, and other outputs of your workflow so you can view
them long after your run ends. However, by design, CML doesn't commit files to
your repository (not all users want this!)
What you're likely looking for is an auto-commit, to essentially
git add and
git commit files generated by the workflow to your repository. You can
manually write this code into your workflow file, or you can use a GitHub Action
tool like the
Auto Commit or
Add & Commit Actions.
Q: Do you have any suggested caching strategies with CML and DVC? My DVC pipeline runs in a CI workflow, and it depends on ~15 GB of data. I don't want to download this dataset to my runner every time the workflow runs.
Downloading data to a runner on every CI workflow can be needlessly time consuming, particularly when the data rarely changes.
While we don't have a CML-specific mechanism in the works for this use case, there are two main approaches we see as viable:
- Attach an EBS volume to the instance that runs your workflow. If you're using DVC, DVC needs to run in that volume (at the very least, your DVC cache must be there). A user recently let us know that this approach is working well for them and prevents unnecessary re-downloads of their DVC cache. They also recommended this article for setup guidelines.
- Use a shared DVC cache. Currently, many DVC users configure their cache in shared NFS. A similar setup that might help here is using a single shared development server- check out our docs for a use case.