Dataset Tracking With DVC: A Practical Guide for MLOps

Efficiently Managing Machine Learning Datasets Using DVC and S3-Compatible Storage

In the rapidly evolving field of machine learning, managing and tracking datasets is crucial for reproducibility, collaboration, and efficient workflow. Data Version Control (DVC) has emerged as a powerful tool in the MLOps toolkit, allowing data scientists and ML engineers to version their datasets alongside their code. This blog post will guide you through setting up DVC with a DigitalOcean Spaces backend, an S3-compatible storage solution, to track your datasets effectively.

Why Dataset Tracking Matters

Before we dive into the setup, let’s briefly discuss why dataset tracking is essential in machine learning workflows:

  1. Reproducibility: Ensures that experiments can be replicated with the exact same data.
  2. Collaboration: Enables team members to share and sync datasets effortlessly.
  3. Versioning: Allows tracking of data changes over time, similar to code versioning.
  4. Storage Efficiency: Stores only the changes, not duplicate copies of entire datasets.
  5. Integration: Seamlessly integrates with existing Git workflows for code.

Now, let’s get started with setting up DVC in your project.

Setup DVC

Step 1: Initialize DVC

Before you start, make sure you have DVC installed:

pip install dvc dvc-s3

Begin by initializing DVC in your project root:

dvc init

This command creates a .dvc directory to store DVC-specific files and configurations.

Step 2: Configure the Remote Storage

Next, we’ll configure DVC to use DigitalOcean Spaces as our remote storage. Add the following to the .dvc/config file:

[core]
 analytics = false
 remote = mapllm
['remote "mapllm"']
 url = s3://mapllm/data
 endpointurl = https://ams3.digitaloceanspaces.com
 profile = mapllm

This configuration tells DVC:

  • To use a remote named “mapllm”
  • The S3 bucket URL where data will be stored
  • The endpoint URL for DigitalOcean Spaces
  • The AWS credential profile to use

Step 3: Set Up Credentials

For security, we’ll store our DigitalOcean Spaces credentials in the AWS credentials file. Add the following to your ~/.aws/credentials file:

[mapllm]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

Replace YOUR_ACCESS_KEY_ID and YOUR_SECRET_ACCESS_KEY with your actual DigitalOcean Spaces credentials.

Note: Keep these credentials secure and never commit them to your repository.

Tracking Data with DVC

Now that we’ve set up DVC and configured our remote storage, let’s start tracking some data.

Adding Data to DVC

To track a data file with DVC, use the dvc add command:

dvc add data/cities.parquet

This command will:

  1. Create a .dvc file (data/cities.parquet.dvc) that contains a reference to the data file.
  2. Add the original data file to .gitignore to prevent it from being tracked by Git.

You should see output similar to this:

100% Adding...|████████████████████████████████████████████████████████████████|1/1 [00:00, 2.01file/s]
To track the changes with git, run:
 git add data/.gitignore data/cities.parquet.dvc
To enable auto staging, run:
 dvc config core.autostage true

Committing and Pushing Data

After adding your data to DVC, you need to:

  1. Commit the DVC tracking file to Git:

    git add data/cities.parquet.dvc
    git commit -m "Add cities data"
    
  2. Push the actual data to your remote storage:

    dvc push
    

Pulling Data

Later when clonning the repo in a new machine, assuming AWS credentials are in place all you have to do is to pull the data from the remote storage is to use the following command:

dvc pull

This process separates the concerns of version control: Git tracks the metadata and DVC handles the large data files.

A few tips

  1. Regular Updates: Whenever you update your dataset, re-run dvc add and dvc push to keep everything in sync.

  2. Branching: Use Git branches for different versions of your data, just like you would with code.

  3. Pulling Data: When collaborating, use dvc pull to fetch the latest version of the data after a git pull.

  4. Checking Status: Use dvc status to see if your local data is in sync with the remote storage.

  5. Data Exploration: DVC allows you to switch between different versions of your data easily, facilitating A/B testing and model comparison.

Setting up DVC with a DigitalOcean Spaces backend provides a robust solution for dataset tracking in your MLOps workflow. By versioning your data alongside your code, you ensure reproducibility, enhance collaboration, and maintain a clear history of your dataset evolution.

AI Researcher

Working on multimodal LLMs, and on-device AI. Available for hire.