-
Notifications
You must be signed in to change notification settings - Fork 54
Fix multi GPU training #469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
Instead of having to specify `train_from_scratch` in the config file, training will proceed from an existing model weights file if this is given as an argument to `casanovo train`. Fixes #263.
* Add epsilon to index zero * Fix typo * Use base PyTorch for repeating along the vocabulary size * Combine masking steps * Lint with updated black version * Lint test files * Add topk unit test * Fix lint * Add fixme comment for future * Update changelog * Generate new screengrabs with rich-codex --------- Co-authored-by: Wout Bittremieux <wout@bittremieux.be> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Rename max_iters to cosine_schedule_period_iters * Add deprecated config option unit test * Fix missed rename * Proper linting * Remove unnecessary logging * Test that checkpoints with deprecated config options can be loaded * Minor change * Add test for fine-tuning with deprecated config options * Remove deprecated hyperparameters during model loading * Include deprecated hyperparameter warning * Test whether the warning is issued * Verify that the deprecated option is removed * Fix comments * Avoid defining deprecated options twice * Remap previous renamed config option `every_n_train_steps` * Update changelog --------- Co-authored-by: melihyilmaz <yilmazmelih97@gmail.com>
* Test different beams with identical scores * Randomly break ties for beams with identical peptide score * Update changelog * Don't remove unit test
* Add 9-species model weights link to FAQ (#303) * Add model weights link * Generate new screengrabs with rich-codex * Clarify that these weights should only be used for benchmarking --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Wout Bittremieux <wout@bittremieux.be> * Add FAQ entry about antibody sequencing (#304) * Add FAQ entry about antibody sequencing * Generate new screengrabs with rich-codex --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Melih Yilmaz <32707537+melihyilmaz@users.noreply.github.com> * Allow csv to handle all newlines The `csv` module tries to handle newlines itself. On Windows, this leads to line endings of `\r\r\n` instead of `\r\n`. Setting `newline=''` produces the intended output on both platforms. * Update CHANGELOG.md * Fix linting issue * Delete docs/images/help.svg --------- Co-authored-by: Melih Yilmaz <32707537+melihyilmaz@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Wout Bittremieux <wout@bittremieux.be> Co-authored-by: William Stafford Noble <wnoble@uw.edu> Co-authored-by: Wout Bittremieux <bittremieux@users.noreply.github.com>
* stable product score * numerically stable peptide scores * renaming * eliminate magic num
* add stop token * stop token test
* Improved beam search efficiency based on the latest model.py version * Speed up inference time * Fixed version * fixed version * test * test2 * test3 * new model * Update model.py and unit test
* db merge n term score * use base class on predict batch end
|
I just did a test run using 4 GPUs and everything looks good. |
bittremieux
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: So when doing inference this will produce multiple mzTab output files, one for each GPU? This is not super user-friendly imo, because then the user won't know where specific results can be found.
I'm also not sure what exactly the problem is.
- Multi-GPU training should work correctly, or at least it has for a long time.
- Multi-GPU inference indeed doesn't properly work. That's also why we set the number of devices to 1 and only increase it during training mode. (This is also not changed in this PR, so multi-GPU inference still wouldn't happen I think.) But doing inference can be trivially parallelized by executing separate commands on different files, so I don't think that this is a major issue either. At least to the extent that any solution should clearly improve upon this status quo.
Mutli-GPU training was broken by the changes to the output file name/locations in #372. Since each sub-process will create it's own log file, if an
Yep correct. I initially drafted this PR several months ago before revisiting it so I misremembered what exactly the scope was. I think I maybe intended to fix multi-gpu inference when I first drafted this not sure. |
and inference
and inference| curr_filename = prefix + "{epoch}-{step}" | ||
| best_filename = prefix + "best" | ||
| if overwrite_ckpt_check: | ||
| if overwrite_ckpt_check and utils.get_local_rank() == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Change overwrite_ckpt_check one level up, where ModelRunner is being called. That brings it in the same scope as the changes to _setup_output and avoids having to make any changes in this file.
| ) | ||
|
|
||
|
|
||
| def get_local_rank() -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Would it make sense to use the global rank (e.g. torch.distributed.get_global_rank (better) or RANK environment variable) instead of the local rank to avoid issues when doing multi-node, multi-GPU training?
No description provided.