Skip to the content.

DVC

Information

DVC (Data Version Control) is a command-line tool for tracking large files, datasets, models, and machine learning pipelines together with normal source code repositories.

In practice, it gives developers a workflow similar to Git, but for data and reproducible processing steps:

Main functionalities and features:

Good use cases:

Installation

DVC is a Python-based CLI. The most common and reliable installation approach is pip, pipx, or an official Python environment. Some operating systems may have community packages, but those can lag behind the upstream release.

Before installing, verify Python:

python --version
pip --version

Recommended general installation options:

pip install --user dvc

or with isolated CLI installation:

pipx install dvc

If you need a storage backend plugin, install the matching extra, for example:

pip install --user "dvc[s3]"
pip install --user "dvc[ssh]"
pip install --user "dvc[gs]"
pip install --user "dvc[azure]"

Check installation:

dvc version

CentOS, Rocky Linux

Install Python tooling first:

sudo dnf install -y python3 python3-pip
python3 -m pip install --user dvc

If your team prefers isolated CLI tools:

sudo dnf install -y python3 python3-pip pipx
pipx install dvc

Developer note:

Fedora

sudo dnf install -y python3 python3-pip pipx
pipx install dvc

If pipx is not preferred:

python3 -m pip install --user dvc

macOS

Common options:

brew install pipx
pipx install dvc

or:

python3 -m pip install --user dvc

FreeBSD

Typical approach is Python-based installation:

pkg install -y python3 py39-pip
python3 -m pip install --user dvc

If a matching package version differs on the system, use the available Python / pip package names from the repository.

OpenIndiana

Package availability can vary, so the practical path is usually Python + pip:

pkg install runtime/python-311
python3 -m ensurepip --upgrade
python3 -m pip install --user dvc

If the Python package name differs in your image/repository, use the available Python 3 runtime and then install DVC with pip.

Configuration

Typical developer setup flow:

  1. Initialize normal source control in the repository.
  2. Initialize DVC in that repository.
  3. Add large files or directories with dvc add.
  4. Commit the generated .dvc files, .gitignore changes, dvc.yaml, and dvc.lock files to VCS.
  5. Configure one or more remotes.
  6. Push data to the remote so teammates and CI can restore it.

Basic repository initialization:

git init
dvc init
git add .dvc .dvcignore
git commit -m "Initialize DVC"

Useful configuration notes:

Examples:

dvc remote add -d origin C:\data\tank\dvc
dvc remote list
dvc doctor

If the cache should live on another drive:

dvc cache dir D:\dvc-cache

Typical remote types developers use:

Usage, tips and tricks

Continuous Example Script: How DVC Works End-to-End

The following script is meant to show the normal developer flow from empty repository to tracked data, remote push, modification, and restoring an older state. Adjust paths for your platform.

mkdir dvc-demo
cd dvc-demo

git init
dvc init
git add .dvc .dvcignore
git commit -m "Initialize repository with DVC"

mkdir data
mkdir remote-storage

echo "raw line 1" > data\dataset.txt
echo "raw line 2" >> data\dataset.txt

dvc add data\dataset.txt
git add data\.gitignore data\dataset.txt.dvc
git commit -m "Track dataset with DVC"

dvc remote add -d localremote .\remote-storage
git add .dvc\config
git commit -m "Configure local DVC remote"

dvc push

echo "raw line 3" >> data\dataset.txt
dvc status

dvc add data\dataset.txt
git add data\dataset.txt.dvc
git commit -m "Update tracked dataset"

dvc push

git log --oneline -n 2

git checkout HEAD~1
dvc checkout

type data\dataset.txt

git checkout -
dvc checkout

type data\dataset.txt

What this demonstrates:

Developer notes:

Working with DVC Remotes

Useful remote commands:

dvc remote add mylocalremote C:\data\tank\dvc
dvc remote add -d origin C:\data\tank\dvc
dvc remote default origin
dvc remote modify origin url C:\new-data-location\dvc
dvc remote list
dvc remote remove mylocalremote

Typical daily workflow in an existing repository:

git pull
dvc pull

# work with data

dvc status
dvc add data
git add *.dvc .gitignore dvc.yaml dvc.lock
git commit -m "Update data metadata"
dvc push
git push

Pipeline Notes

One important strength of DVC is pipeline reproducibility. Example:

dvc stage add -n prepare \
  -d src/prepare.py \
  -d data/dataset.txt \
  -o data/prepared.txt \
  python src/prepare.py

This creates dvc.yaml and, after running, dvc.lock. Those files describe how outputs depend on code and inputs.

Common follow-up commands:

dvc repro
dvc dag
dvc metrics show

Experimenting with DVC with Sapling (sl)

DVC is often used with Git, but it can still be useful in workflows where you want DVC-managed data and another source-control workflow around the repository. The main idea remains the same: commit DVC metadata files to your chosen VCS, and store real data in DVC cache/remotes.

The example below keeps DVC in --no-scm mode while Sapling tracks the metadata files.

sl config --user ui.username "John Doe <john.doe@example.com>"

mkdir sapling-dvc-experiment
cd sapling-dvc-experiment

sl init
dvc init --no-scm

mkdir huge_files

echo "# Huge files" > .gitignore
echo "huge_files" >> .gitignore
echo "root-huge-file.txt" >> .gitignore

sl add .gitignore .dvcignore

echo "Huge file A" > .\huge_files\A.txt
echo "Huge file B" > .\huge_files\B.txt
echo "Root huge file" > .\root-huge-file.txt

dvc add .\huge_files
dvc add .\root-huge-file.txt

sl add .

dvc remote add -d fake-remote C:\pub\setmy.info\data\dvc

sl commit -m "First huge files added"
dvc push

echo "Huge file A VERSION 2" >> .\huge_files\A.txt
echo "Huge file B VERSION 2" >> .\huge_files\B.txt
echo "Root huge file VERSION 2" >> .\root-huge-file.txt

dvc status
dvc add .\huge_files
dvc add .\root-huge-file.txt

sl status
sl commit -m "Version 2"
dvc push

sl log -l 2

# move to an older Sapling revision, then restore matching data metadata
sl goto .^ 
dvc checkout

Important note:

Example .gitignore

When you track raw files with DVC, the actual data path is often added to .gitignore while the small .dvc metadata file is committed.

# Huge files
huge_files
root-huge-file.txt

Coding tips and tricks

Alternatives

Possible alternatives:

Short comparison notes:

See also

DVC

Get Started: Data Management

Data pipelines

DVC YAML

Experiments

Issue reported

Issue turned into discussion