Syncing Data to AWS S3
Setting up a remote to make data versioning easier with DVC is a common need so we're going to go through a tutorial for doing this with AWS.
- Milecia McGregor
- May 31, 2022 • 3 min read
Using AWS Remotes in DVC
When you’re working on a data science project that has huge datasets, it’s common to store them in cloud storage. You’ll also be working with different versions of the same datasets to train a model, so it’s crucial to have a tool that enables you to switch between datasets quickly and easily. That’s why we’re going to do a quick walkthrough of how to set up a remote in an AWS S3 bucket and handle data versioning with DVC.
We’ll start by creating a new S3 bucket in our AWS account, then we’ll show how you can add DVC to your project. We’ll be working with this repo if you want an example to play with.
By the time you finish, you should be able to create this setup for any machine learning project using an AWS remote.
Make sure that you already have an AWS account and
log in. Search for
S3 and it should be the first service that appears.
Once you’re on the S3 page, click the
Create Bucket button and it will take
you to a page that looks like this. The bucket in this example is called
updatedbikedata because that is the data our demo repo works with. You can
leave the default settings in place or you can update them to fit the
functionality you need.
Once you’ve created the bucket, you should be redirected to the S3 dashboard and see the success message and your new bucket.
Now that the S3 bucket is ready, we need the
secret_access_key from AWS in order to connect to our project. You can create
these keys in your Identity and Access Management settings. Go to your security
credentials and select the
Access keys section. Then click the
Create New Access Key button. This will generate a new set of keys for you so
make sure you download this file to get your secret access key.
Once you’ve downloaded the credentials, you should see the access key ID in the table. Note that you won’t be able to access your secret key again at this point. You would need to make a new set of credentials if you don’t have it.
That’s it for setting up your bucket and getting the credentials you need! Now let’s add DVC to our demo repo and set up the remote.
First, add DVC as a requirement to your project with the following installation command:
$ pip install 'dvc[s3]'
Then you can initialize DVC in your own project with the following command:
$ dvc init
This will add all of the DVC internals needed to start versioning your data and tracking experiments. Now we need to set up the remote to connect our project data stored in AWS to the DVC repo.
Now we can add a default to the project with the following command:
$ dvc remote add -d bikes s3://updatedbikedata
This creates a default remote called
bikes that connects to the
updatedbikedata bucket we made earlier which is where the training data for
the model will be stored.
In order for DVC to be able to push and pull data from the remote, you need to have valid AWS credentials.
By default, DVC authenticates using your AWS CLI configuration, if it has been
set. You can do that with the
aws configure command like in this example:
$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]:
Default output format [None]:
You can check out more details on this command here in the AWS docs.
If you want to use a different authentication method or if you run into issues with the credentials, you can manually add them with the following commands:
$ dvc remote modify --local bikes access_key_id 'mykey'
$ dvc remote modify --local bikes secret_access_key 'mysecret'
Now you can push data from your local machine to the AWS remote! First, add the data you want DVC to track with the following command:
$ dvc add data
This will allow DVC to track the entire
data directory so it will note when
any changes are made. Then you can push that data to your AWS remote with this
$ dvc push
Here's what the data might look like in your AWS bucket.
Then if you move to a different machine or someone else needs to use that data, it can be accessed by cloning or forking the project repo and running:
$ dvc pull
This will get any data from your remote and download it to your local machine.
Authentication has to be setup locally on any machine you need to pull or push data from.
That’s it! Now you can connect any DVC project to an AWS S3 bucket. If you run into any issues, makes sure to check that your credentials are valid, check if your user has MFA enabled, and check that the user has the right level of permissions.