new example: reproducible datascience


## architecture and requirements

### user-stories

#### reproduce results from a paper

1. see paper published by sdil's associate 
1. connect to sdil
1. checkout commit from git (repository&commit-id are linked in paper)
1. execute
1. results are exactly the same as the one in the paper


#### creating reproducible results

1. validate input data
    1. --> make sure the hash is correct
2. run experiment
3. update hash on data
4. check in soft-links to new data

this use-case will be automated in some sort of CI-Script

#### converting regular project into reproducible project

1. create hash on source-dir
1. move all content from source-dir to target-dir/<hash>/
1. replace all links in project-dir to source-dir with links to target-dir/<hash>/

this will be 1 script: `create_hashed_data.py` (required parameters: source-dir and target-dir and project-dir)

note: the source-dir is a subfolders of project-dir which links to the actual source-dir (softlink or hardlink). 

### life-cycle

once you convert the project to a "reproducible project" by running the script  `create_hashed_data.py` on the data directory. Then you add a softlink in your code-directory to the hashed data manually and check in the result into git.

when you start you experiment you run `validate.py`. It will raise an error of the data folder's hash doesn't fit it's content. Also it will raise an error if any file or subfolder in the data dir is writeable.

then you commit everything into git. (this comment will need to be checked out to reproduce the experiment)

the you run `create_work_copy.py`, which will create a hardlinked copy of the data-dir and update the softlinks in the code-dir. This will allow your experiment to create new files, without changing the hash of existing folders

then you run your experiment.

then you run `hash_work_copy.py`. It will create a hash of your modified data-dir, create a new hash-folder, hardcopy all files of your working copy into this folder, and will change the softlinks in the code-directory to the new hashed directory.

then you commit the new softlinks into git.

#### scripts

**`create_hashed_data.py`**

**`validate.py`**

**`create_work_copy.py`**

**`hash_work_copy.py`**

#### code folder structure

the code directory has a sub-dir called `data`, which is a softlink to a hashed data directory (e.g. `/smartdata/proj_iris/v1:sha256:128M:b0efbbc43054beee753cd10fab49ea0fe2fabdba420e72d0ba74fe2a0222dbf9`)

#### data folder structure

variables:
  * `<base>`: basefolder for project-data
      - e.g. `/smartdata/proj_iris/` or `/smartdata/ugfam/data/iris`
  * `<hash>`: hash-value of entire directory
      - value depends on content, the filenames and the sub-directories. Does not depend on name of parents-folder
      - e.g. `v1:sha256:128M:b0efbbc43054beee753cd10fab49ea0fe2fabdba420e72d0ba74fe2a0222dbf9`

general folder structure:

  * `<base>/`
    - `<hash>/`
        - actual data
    - `work_<timestamp>`
        - working copy, that only exists during exectution

example folder structure:

  * `/smartdata/proj_iris/`
    - `v1:sha256:128M:aa669dcefba57e01bd7ff0526a0001d2118f06adc8106d265b5743b0ee90084f`
        - `iris.csv`
    - `v1:sha256:128M:280de49cd7cd754b71759bc5da30c31a7be3350bcde2548aebab702272ec1c51`
        - `iris.csv` --> `../v1:sha256:128M:b0efbbc43054beee753cd10fab49ea0fe2fabdba420e72d0ba74fe2a0222dbf9/iris.csv` (hardlink)
        - `number.csv`



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new example: reproducible datascience #9

architecture and requirements

user-stories

reproduce results from a paper

creating reproducible results

converting regular project into reproducible project

life-cycle

scripts

code folder structure

data folder structure

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

new example: reproducible datascience #9

Description

architecture and requirements

user-stories

reproduce results from a paper

creating reproducible results

converting regular project into reproducible project

life-cycle

scripts

code folder structure

data folder structure

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions