MLOps Basic Open-Source Tool Series #3: DVC for Data Versioning

The third post on MLOps Open-Source tools series for your Machine Learning needs. The third edition would discuss on the versioning tools.

Apr 09, 2024

This newsletter is part of a series introducing open-source tools used in Machine Learning Operations (MLOps). Each series will introduce new tools for different parts of the process. We would combine everything at the end of the series to make a cohesive MLOps project.

The machine learning project is a project that requires continual updates to its model. The continuous project ensures that the model provides sufficient value to the business.

However, not all updates always mean a better model. There are times when we want to go back into our previous iteration. This is where versioning comes to help.

Versioning tracks changes by assigning unique numbers or identifiers to different states or updates. You can version anything, such as datasets, Models, parameters, etc.

There are many values of versioning, including enabling the older version rollback and supporting collaborative development efforts where multiple individuals or teams work on the same projects.

When you think about versioning tools, it’s often all about Git. However, the machine learning project is different from the software development project. That’s why data science requires more specialization tools. One of the open-source versioning tools for machine learning projects is DVC.

So, how could DVC help data scientists and machine learning engineers with versioning tasks? Let’s learn about it!

If you miss my previous series, you can read it below!

MLOps Basic Open-Source Tool Series #2: MLFlow for Model Registry

Cornellius Yudha Wijaya

March 24, 2024

MLOps Basic Open-Source Tool Series #2: MLFlow for Model Registry

Read full story

DVC For Data Versioning

DVC stands for Data Version Control, the Git for Machine Learning project. It’s a tool that allows us to track changes in our ML project, especially large files. It has been optimized to enable remote data storage and complement Git, so it’s not a replacement.

Let’s try it out. We would need to install the DVC initially. The easiest way to install them is by using pip.

pip install dvc

If you want to use them with Visual Studio Code, you can install them from their marketplace.

For this example, we would try to track the change in the dataset with DVC. I would use the email spam data from Kaggle for the sample dataset.

Next, we would try out DVC to use the data versioning. As DVC complements Git, we must initiate Git before using the DVC.

git init

Then, we can start creating a DVC project within our directory.

When we create the DVC project, a few files will be created. Let’s check the status to see what the new file was.

git status

The green one shows the files created by the DVC, which are the .dvc tracking files.

With all preparations ready, we can start tracking the data with DVC. To do that, we can use the following command.

dvc add email_spam.csv

The code above specified which data we wanted to track and the result in creating the dataset metadata. If you check your folder, you will find the email_spam.csv.A DVC file contains information similar to the image below.

This is the cached metadata for your dataset using DVC, and this is how DVC could track even a large dataset. You can also find the cached file in the .dvc folder.

Next, we would version our metadata while ignoring the actual dataset.

git add email_spam.csv.dvc .gitignore
git commit -m "Add raw data"

Now, we want to send the dataset somewhere. However, we need to set the Remote address before doing that. For the first example, we would use the local folder for the remote storage.

mkdir TEMP/dvcstore 
dvc remote add -d myremote TEMP\dvcstore

What happened above is that we created a folder called TEMP with a dvcstore folder inside. Then, we set up the remote address where the data was pushed into that folder. This remote can use a cloud system such as S3 and Google Drive.

If all is going well, you can push them now to track the data.

dvc push

Once the data has been tracked remotely, we can download it somewhere else via the pull command. You can move the work elsewhere and pull the data if the remote is correct.

dvc pull

Thank you for reading Non-Brand Data. This post is public so feel free to share it.

Dataset Change Tracking

The point of using DVC is to track any changes to our dataset, and we could change back to that version. That’s why we should learn how to use DVC to track data changes.

First, I would simulate the changes in the dataset. For this example, I would decrease the sample size to 10 only.

import pandas as pd
df = pd.read_csv('email_spam.csv')
df = df.sample(10).to_csv('email_spam.csv', index = False)

Then, I would track the data once more with DVC.

dvc add email_spam.csv

Lastly, we push the data again with DVC and commit them with Git.

dvc push
git commit email_spam.csv.dvc -m "Update change to samples"

Now, we would use Git to get back to the previous version.

git checkout HEAD~1 email_spam.csv.dvc

Then, we can use the DVC pull if we want to take the dataset of that version.

dvc pull

That’s it. By combining Git and DVC with remote storage, we can control how our data is versioned and use it whenever we need to return to our previous dataset version.