Syncing Data to Azure Blob Storage
Setting up a remote to make data versioning easier with DVC is a common need so we're going to go through a tutorial for doing this with Azure.
- Milecia McGregor
- June 13, 2022 • 4 min read
Using Azure Blob Storage in DVC
When you’re working on a data science project that has huge datasets, it’s common to store them in cloud storage. You’ll also be working with different versions of the same datasets to train a model, so it’s crucial to have a tool that enables you to switch between datasets quickly and easily. That’s why we’re going to do a quick walkthrough of how to set up a remote with Azure Blob Storage and handle data versioning with DVC.
We’ll start by creating a new blob storage container in our Azure account, then we’ll show how you can add DVC to your project. We’ll be working with this repo if you want an example to play with.
By the time you finish, you should be able to create this setup for any machine learning project using an Azure remote.
Make sure that you already have a Microsoft Azure account. When you log in, you should see a page like this.
storage accounts in the search bar and click
Services. Make sure you don't click the "classic" option.
This will bring you to the
Storage accounts page where you'll need click the
Create storage account button.
Now you need to enter a
Resource group and name for the account. You can
create a new resource group right here, like we do, and call it
BicycleProject. We'll name this storage account
bicycleproject. Then you can
leave all the default settings in place and click
Review + create.
Azure will run validation on the account and then you'll be able to click
Create and it will generate the storage account.
You'll get redirected to a new page and you should click the
Go to resource
button. Now you should see all of the details for your storage account. In the
left sidebar, got to on
Data storage >
Then click the
+ Container button at the top of the new page and you'll see a
right sidebar open. In the name field, type
bikedata and then click
Now we have everything set up for the blob storage to work.
You'll need the right roles on your storage account and your container in order to connect this remote storage to your machine learning project.
On the page for your
bicycleproject storage account, go to the
Access Control (IAM) in the left sidebar.
On this page, you'll click
Add role assignment and get directed to the page
with all of the roles.
Storage Blob Data Contributor role and click
Then you can click
+ Select members to add this role to your user.
You'll also need to go through this exact flow for your
bikedata container, so
make sure you do this immediately after your storage account is updated.
Since our Azure storage account and container have the correct roles now, let's set up the project!
First, add DVC as a requirement to your project with the following installation command:
$ pip install 'dvc[azure]'
Then you can initialize DVC in your own project with the following command:
$ dvc init
This will add all of the DVC internals needed to start versioning your data and tracking experiments. Now we need to set up the remote to connect our project data stored in Azure to the DVC repo.
Now we can add a default to the project with the following command:
$ dvc remote add -d bikes azure://bikedata
This creates a default remote called
bikes that connects to the
container we made earlier which is where the training data for the model will be
In order for DVC to be able to push and pull data from the remote, you need to have valid Azure credentials.
By default, DVC authenticates using your Azure CLI configuration.
Run the following command to authenticate with Azure.
$ az login
A web browser has been opened at https://login.microsoftonline.com/organizations/oauth2/v2.0/authorize. Please continue the login in the web browser. If no web browser is available or if the web browser fails to open, use device code flow with `az login --use-device-code`.
"name": "Azure subscription 1",
"name": "[email protected]",
This should open a window that looks like this where you can enter your login credentials.
You will also need to manually define the storage account name with the following command:
$ dvc remote modify bikes account_name 'bicycleproject'
Now you can push data from your local machine to the Azure remote! First, add the data you want DVC to track with the following command:
$ dvc add data
This will allow DVC to track the entire
data directory so it will note when
any changes are made. Then you can push that data to your Azure remote with this
$ dvc push
Here's what the data might look like in your Azure container.
Then if you move to a different machine or someone else needs to use that data, it can be accessed by cloning or forking the project repo and running:
$ dvc pull
This will get any data from your remote and download it to your local machine.
Authentication has to be setup locally on any machine you need to pull or push
data from. That means running the
az login command on any other machine. You
don't need to go through the DVC setup again.
That’s it! Now you can connect any DVC project to an Azure blob storage container. If you run into any issues, makes sure to check that your credentials are valid, check if your user has MFA enabled, and check that the user has the right level of permissions.