Conversation
There was a problem hiding this comment.
He @rearustagi this is AWESOME so far 🥳. You're using the cluster just as intended and tracking your work with git!
A couple of changes I'd like to make before approving:
- Please remove the source data from git tracking
Tracking individual data files in git is not a recommended pattern because it quickly slows down git and overwhelms memory. Git should be used only for plain text files (and notebooks). So, in your repo, add the "raw data" folder to your gitignore.
- Delineate between source and intermediate data
I notice a proc_data repo with data that I'm not aware of. First, the same principle as point 1. applies: do not track data in git repositories. It will slow down git and you may accidentally expose PII or PHI 😬. Second, make sure that the code that generates these intermediate objects is similarly tracked and reproducible. I can't immediately tell from the notebooks but it looks like this is the case. My rule of thumb is that if I were to remove the intermediate data, I should still be able to reproduce it without modifying any code. So just make sure of that
- Delineate cont'd.
If the proc_data does in fact contain RAW data (ie not generated, original files you got from chris/meghnath etc), please help me out by adding it to our data catalog: https://docs.google.com/forms/d/e/1FAIpQLSdzeBquqe_4ghFDu7QN-ChzXgCBsnHLty3is8yR1VOMADet3w/viewform?usp=sharing&ouid=106438662307402236405 This is how I make sure to track and document all of the data that goes through the lab
- Organisation
This is less urgent, but I would recommend putting a little bit of organization into your repo with, at minimum, folders for scripts, notebooks or nbs, outputs, underwhich you can put outputs/figures etc... just so I know what I'm looking at.
Great work again!
ETA: 5. You can use the grdrive mapped source data
The source data for the files I sent are actually already mapped, they are in /n/holylabs/LABS/cgolden_lab/Lab/data_freeze/golden_googledrive_rclone/Climate-Smart Public Health - Nepal/4. Datasets/snake_bites
Adds my initial visualization and data files to project