July '22 Community Gems
A roundup of technical Q&A's from the DVC community. This month: deploying models MLEM, DVC data and remotes, DVC stages and plots, and more.
- Milecia McGregor
- July 26, 2022 • 4 min read
How can I track a new file added to my
data folder if the
data folder is already tracked by DVC, yet ignored by Git?
Great question on how DVC handles data tracking from @NgHoangDat!
Since you already track the
data folder, when you add a new file into it, all
you need to do is update your DVC history. You can use either
dvc add data or
dvc commit to start tracking the new file.
DVC will also only recalculate the changed files. If you add or modify a small number of files in that folder, the update will not take very long.
Wonderful question from @come_arvis!
You can use the
get_url method of the
DVC Python API to do this. Here's an
example of a script you might run to get the remote URL.
resource_url = dvc.api.get_url(
This URL is built with the remote URL from the project configuration file,
.dvc/config, and the
md5 file hashes stored in the
.dvc file corresponding
to the data file or directory you want the storage location of.
I'm excited about MLEM helping expose API endpoints to our model, but heard it was experimental. Where can I learn more about how to deploy models with this tool?
Great question from @raveman^2!
There are a few ways you can use expose API endpoints to your model:
mlem serveto generate a FastAPI endpoint with your model.
- Export the model as a Python package for your own custom-built API.
- The experimental deploy to Heroku.
You can find more details here in the MLEM docs: https://mlem.ai/doc/get-started
You can also see an example of deploying a model with MLEM in this blog post tutorial.
This is a good question from @Nwoke!
If you have accidentally added the wrong directory or files for DVC to track,
you can easily remove them with the
dvc remove command. This is used to remove
.dvc file and ensure that the original data file is no longer being
tracked. Here's an example of this command being used:
$ dvc remove data.csv.dvc
Sometimes when you stop tracking data, you also want to remove it from your
cache. You can do this with the
dvc gc command, which will remove all data,
not just the target of
dvc remove. If you want to remove all of the data and
its previous versions from the cache, you can do that with the following
$ dvc gc -w
-w option only keeps the files and directories referenced in the
workspace, so once you have removed the data you don't want to track, this is
how DVC knows what to keep and what to discard.
You can learn more about removing tracked data in the docs here.
Fantastic question from @Nish!
This is the expected behavior of DVC. It removes the
outs of a stage unless
persist:true value is set for that output. You can learn more about how
this works in
our docs here.
Here's an example of a stage with the
persist value set.
cmd: date > data/external/date
Even if you don't persist your
outs, you can still check out an older version
of the pipeline to get older
dvc checkout. This is based on what's
.dvc files and it will update your workspace to match
the experiment you check out. This is usually run after checking out a different
Git branch. So the flow might look like:
$ git checkout experiment-branch
$ dvc checkout
These commands allow you to get the
.dvc files for the
experiment you want to go back to from your Git history. Then it uses DVC to get
your data to the version you want and reproduce your entire experiment. You can
learn more about these details in
dvc checkout docs here.
Wonderful question from @shortcipher3!
If you update DVC to version
2.12.1 and higher, you should be able to define
multiple y-axes in your DVC pipeline. Here's an example of how this may look in
y: [col1, col2, col3]
# alternative 1:
some_file.csv: [col1, col2, col3]
# in case of multiple files:
file1.csv: [col1, col2]
A quick note, make sure that
plots is on the same level as
stages in your
Awesome question from @srb302!
You would need to set up outputs and dependencies for each stage. So a stage that is run first would generate an output and the stage that is suppose to run second would use the first stage's output as a dependency.
Otherwise, DVC does not guarantee any particular execution order for stages which are independent of each other. DVC determines the structure of your DAG based on file outputs and dependencies and there isn't another way to enforce order of stage execution in DVC.
This is a really good question from @vadim.sukhov!
Let's take a look at an example
In this scenario, the
roc.json files are not being tracked
by DVC because of the
cache: false value. Since these files aren't tracked by
DVC, they aren't saved to a remote storage location outside of Git, like data
files are. So if you have
cache: false on a file that you want to keep track
of, you'll need to Git commit them to your project.
Keep an eye out for our next Office Hours Meetup! Make sure you stay up to date with us to find out what it is! Join our group to stay up to date with specifics as we get closer to the event!
Check out our docs to get all your DVC, CML, and MLEM questions answered!
Join us in Discord to chat with the community!