Data Version Control.
Initialize dvc - needs to be inside a git repo:
dvc initAdd remote storage:
dvc remote add -d remote_storage path/to/your/dvc_remoteTest adding remote storage using /tmp/:
dvc remote add -d remote_storage /tmp/local_dvc_storage/Turn off analytics collection:
dvc config core.analytics falseTrack data files or folder with dvc:
dvc add data/raw/trainIt will do the following operations:
- Add
data/raw/trainto.gitignore - Create a file
data/raw/train.dvccontaining md5, nfiles and other informations about the tracked data - Copy
data/raw/trainto a staging area - a cache located in.dvc/cache
Push data to the remote storage:
dvc pushGet the data from the remote storage:
dvc checkout data/raw/train.dvcOn a freshly cloned repo, one can fetch all data from the remote in cache:
dvc fetchOr fetch only some files:
dvc fetch data/raw/val.dvcTo combine fetching and checking out:
dvc pullModify content:
- Unlink the file/folder with
dvc unprotect:dvc unprotect data/raw/train - Update the data or download new one
- Add the file/folder back to dvc:
dvc add data/raw/train
Note: Often, it is better to version the raw data and just track a new version of this. Consider the raw data as immutable.
Commit new data: Record changes to files or directories tracked by DVC.
dvc commitDVC makes it possible to execute data pipelines and track generated artificats.
dvc stage add \
-n prepare \ # name
-d src/prepare.py \ # deps
-d data/raw \ # deps
-o data/prepared/train.csv \ # outs
-o data/prepared/test.csv \ # outs
python src/prepare.py # cmd which is equivalent to this section in the dvc.yaml file:
stages:
prepare:
cmd: python src/prepare.py
deps:
- data/raw
- src/prepare.py
outs:
- data/prepared/test.csv
- data/prepared/train.csvThe configuration file is located in .dvc/config.
Use a share cache when multiple repos points to the same data files: dvc cache dir path/to/shared_cache and mv .dvc/cache/* path/to/shared_cache.