-
Notifications
You must be signed in to change notification settings - Fork 0
Description
architecture and requirements
user-stories
reproduce results from a paper
- see paper published by sdil's associate
- connect to sdil
- checkout commit from git (repository&commit-id are linked in paper)
- execute
- results are exactly the same as the one in the paper
creating reproducible results
- validate input data
- --> make sure the hash is correct
- run experiment
- update hash on data
- check in soft-links to new data
this use-case will be automated in some sort of CI-Script
converting regular project into reproducible project
- create hash on source-dir
- move all content from source-dir to target-dir//
- replace all links in project-dir to source-dir with links to target-dir//
this will be 1 script: create_hashed_data.py (required parameters: source-dir and target-dir and project-dir)
note: the source-dir is a subfolders of project-dir which links to the actual source-dir (softlink or hardlink).
life-cycle
once you convert the project to a "reproducible project" by running the script create_hashed_data.py on the data directory. Then you add a softlink in your code-directory to the hashed data manually and check in the result into git.
when you start you experiment you run validate.py. It will raise an error of the data folder's hash doesn't fit it's content. Also it will raise an error if any file or subfolder in the data dir is writeable.
then you commit everything into git. (this comment will need to be checked out to reproduce the experiment)
the you run create_work_copy.py, which will create a hardlinked copy of the data-dir and update the softlinks in the code-dir. This will allow your experiment to create new files, without changing the hash of existing folders
then you run your experiment.
then you run hash_work_copy.py. It will create a hash of your modified data-dir, create a new hash-folder, hardcopy all files of your working copy into this folder, and will change the softlinks in the code-directory to the new hashed directory.
then you commit the new softlinks into git.
scripts
create_hashed_data.py
validate.py
create_work_copy.py
hash_work_copy.py
code folder structure
the code directory has a sub-dir called data, which is a softlink to a hashed data directory (e.g. /smartdata/proj_iris/v1:sha256:128M:b0efbbc43054beee753cd10fab49ea0fe2fabdba420e72d0ba74fe2a0222dbf9)
data folder structure
variables:
<base>: basefolder for project-data- e.g.
/smartdata/proj_iris/or/smartdata/ugfam/data/iris
- e.g.
<hash>: hash-value of entire directory- value depends on content, the filenames and the sub-directories. Does not depend on name of parents-folder
- e.g.
v1:sha256:128M:b0efbbc43054beee753cd10fab49ea0fe2fabdba420e72d0ba74fe2a0222dbf9
general folder structure:
<base>/<hash>/- actual data
work_<timestamp>- working copy, that only exists during exectution
example folder structure:
/smartdata/proj_iris/v1:sha256:128M:aa669dcefba57e01bd7ff0526a0001d2118f06adc8106d265b5743b0ee90084firis.csv
v1:sha256:128M:280de49cd7cd754b71759bc5da30c31a7be3350bcde2548aebab702272ec1c51iris.csv-->../v1:sha256:128M:b0efbbc43054beee753cd10fab49ea0fe2fabdba420e72d0ba74fe2a0222dbf9/iris.csv(hardlink)number.csv