Skip to content

add working files#1

Open
rearustagi wants to merge 1 commit intoGoldenPlanetaryHealthLab:mainfrom
rearustagi:add-files
Open

add working files#1
rearustagi wants to merge 1 commit intoGoldenPlanetaryHealthLab:mainfrom
rearustagi:add-files

Conversation

@rearustagi
Copy link
Copy Markdown
Collaborator

Adds my initial visualization and data files to project

Copy link
Copy Markdown
Member

@TinasheMTapera TinasheMTapera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He @rearustagi this is AWESOME so far 🥳. You're using the cluster just as intended and tracking your work with git!

A couple of changes I'd like to make before approving:

  1. Please remove the source data from git tracking

Tracking individual data files in git is not a recommended pattern because it quickly slows down git and overwhelms memory. Git should be used only for plain text files (and notebooks). So, in your repo, add the "raw data" folder to your gitignore.

  1. Delineate between source and intermediate data

I notice a proc_data repo with data that I'm not aware of. First, the same principle as point 1. applies: do not track data in git repositories. It will slow down git and you may accidentally expose PII or PHI 😬. Second, make sure that the code that generates these intermediate objects is similarly tracked and reproducible. I can't immediately tell from the notebooks but it looks like this is the case. My rule of thumb is that if I were to remove the intermediate data, I should still be able to reproduce it without modifying any code. So just make sure of that

  1. Delineate cont'd.

If the proc_data does in fact contain RAW data (ie not generated, original files you got from chris/meghnath etc), please help me out by adding it to our data catalog: https://docs.google.com/forms/d/e/1FAIpQLSdzeBquqe_4ghFDu7QN-ChzXgCBsnHLty3is8yR1VOMADet3w/viewform?usp=sharing&ouid=106438662307402236405 This is how I make sure to track and document all of the data that goes through the lab

  1. Organisation

This is less urgent, but I would recommend putting a little bit of organization into your repo with, at minimum, folders for scripts, notebooks or nbs, outputs, underwhich you can put outputs/figures etc... just so I know what I'm looking at.

Great work again!

ETA: 5. You can use the grdrive mapped source data

The source data for the files I sent are actually already mapped, they are in /n/holylabs/LABS/cgolden_lab/Lab/data_freeze/golden_googledrive_rclone/Climate-Smart Public Health - Nepal/4. Datasets/snake_bites

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants