Dataset Tracking With DVC: A Practical Guide for MLOps
Efficiently Managing Machine Learning Datasets Using DVC and S3-Compatible Storage
In the rapidly evolving field of machine learning, managing and tracking datasets is crucial for reproducibility, collaboration, and efficient workflow. Data Version Control (DVC) has emerged as a powerful tool in the MLOps toolkit, allowing data scientists and ML engineers to version their datasets alongside their code. This blog post will guide you through setting up DVC with a DigitalOcean Spaces backend, an S3-compatible storage solution, to track your datasets effectively.
Why Dataset Tracking Matters
Before we dive into the setup, let’s briefly discuss why dataset tracking is essential in machine learning workflows:
- Reproducibility: Ensures that experiments can be replicated with the exact same data.
- Collaboration: Enables team members to share and sync datasets effortlessly.
- Versioning: Allows tracking of data changes over time, similar to code versioning.
- Storage Efficiency: Stores only the changes, not duplicate copies of entire datasets.
- Integration: Seamlessly integrates with existing Git workflows for code.
Now, let’s get started with setting up DVC in your project.
Setup DVC
Step 1: Initialize DVC
Before you start, make sure you have DVC installed:
pip install dvc dvc-s3
Begin by initializing DVC in your project root:
dvc init
This command creates a .dvc
directory to store DVC-specific files and configurations.
Step 2: Configure the Remote Storage
Next, we’ll configure DVC to use DigitalOcean Spaces as our remote storage. Add the following to the .dvc/config
file:
[core]
analytics = false
remote = mapllm
['remote "mapllm"']
url = s3://mapllm/data
endpointurl = https://ams3.digitaloceanspaces.com
profile = mapllm
This configuration tells DVC:
- To use a remote named “mapllm”
- The S3 bucket URL where data will be stored
- The endpoint URL for DigitalOcean Spaces
- The AWS credential profile to use
Step 3: Set Up Credentials
For security, we’ll store our DigitalOcean Spaces credentials in the AWS credentials file. Add the following to your ~/.aws/credentials
file:
[mapllm]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY
Replace YOUR_ACCESS_KEY_ID
and YOUR_SECRET_ACCESS_KEY
with your actual DigitalOcean Spaces credentials.
Note: Keep these credentials secure and never commit them to your repository.
Tracking Data with DVC
Now that we’ve set up DVC and configured our remote storage, let’s start tracking some data.
Adding Data to DVC
To track a data file with DVC, use the dvc add
command:
dvc add data/cities.parquet
This command will:
- Create a
.dvc
file (data/cities.parquet.dvc
) that contains a reference to the data file. - Add the original data file to
.gitignore
to prevent it from being tracked by Git.
You should see output similar to this:
100% Adding...|████████████████████████████████████████████████████████████████|1/1 [00:00, 2.01file/s]
To track the changes with git, run:
git add data/.gitignore data/cities.parquet.dvc
To enable auto staging, run:
dvc config core.autostage true
Committing and Pushing Data
After adding your data to DVC, you need to:
-
Commit the DVC tracking file to Git:
git add data/cities.parquet.dvc git commit -m "Add cities data"
-
Push the actual data to your remote storage:
dvc push
Pulling Data
Later when clonning the repo in a new machine, assuming AWS credentials are in place all you have to do is to pull the data from the remote storage is to use the following command:
dvc pull
This process separates the concerns of version control: Git tracks the metadata and DVC handles the large data files.
A few tips
-
Regular Updates: Whenever you update your dataset, re-run
dvc add
anddvc push
to keep everything in sync. -
Branching: Use Git branches for different versions of your data, just like you would with code.
-
Pulling Data: When collaborating, use
dvc pull
to fetch the latest version of the data after agit pull
. -
Checking Status: Use
dvc status
to see if your local data is in sync with the remote storage. -
Data Exploration: DVC allows you to switch between different versions of your data easily, facilitating A/B testing and model comparison.
Setting up DVC with a DigitalOcean Spaces backend provides a robust solution for dataset tracking in your MLOps workflow. By versioning your data alongside your code, you ensure reproducibility, enhance collaboration, and maintain a clear history of your dataset evolution.