Fix multi GPU training #469

Lilferrit · 2025-05-06T23:49:27Z

No description provided.

Instead of having to specify `train_from_scratch` in the config file, training will proceed from an existing model weights file if this is given as an argument to `casanovo train`. Fixes #263.

* Add epsilon to index zero * Fix typo * Use base PyTorch for repeating along the vocabulary size * Combine masking steps * Lint with updated black version * Lint test files * Add topk unit test * Fix lint * Add fixme comment for future * Update changelog * Generate new screengrabs with rich-codex --------- Co-authored-by: Wout Bittremieux <wout@bittremieux.be> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Rename max_iters to cosine_schedule_period_iters * Add deprecated config option unit test * Fix missed rename * Proper linting * Remove unnecessary logging * Test that checkpoints with deprecated config options can be loaded * Minor change * Add test for fine-tuning with deprecated config options * Remove deprecated hyperparameters during model loading * Include deprecated hyperparameter warning * Test whether the warning is issued * Verify that the deprecated option is removed * Fix comments * Avoid defining deprecated options twice * Remap previous renamed config option `every_n_train_steps` * Update changelog --------- Co-authored-by: melihyilmaz <yilmazmelih97@gmail.com>

* Test different beams with identical scores * Randomly break ties for beams with identical peptide score * Update changelog * Don't remove unit test

* Add 9-species model weights link to FAQ (#303) * Add model weights link * Generate new screengrabs with rich-codex * Clarify that these weights should only be used for benchmarking --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Wout Bittremieux <wout@bittremieux.be> * Add FAQ entry about antibody sequencing (#304) * Add FAQ entry about antibody sequencing * Generate new screengrabs with rich-codex --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Melih Yilmaz <32707537+melihyilmaz@users.noreply.github.com> * Allow csv to handle all newlines The `csv` module tries to handle newlines itself. On Windows, this leads to line endings of `\r\r\n` instead of `\r\n`. Setting `newline=''` produces the intended output on both platforms. * Update CHANGELOG.md * Fix linting issue * Delete docs/images/help.svg --------- Co-authored-by: Melih Yilmaz <32707537+melihyilmaz@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Wout Bittremieux <wout@bittremieux.be> Co-authored-by: William Stafford Noble <wnoble@uw.edu> Co-authored-by: Wout Bittremieux <bittremieux@users.noreply.github.com>

…ead of callback

…x-python

* stable product score * numerically stable peptide scores * renaming * eliminate magic num

* add stop token * stop token test

* Improved beam search efficiency based on the latest model.py version * Speed up inference time * Fixed version * fixed version * test * test2 * test3 * new model * Update model.py and unit test

* db merge n term score * use base class on predict batch end

casanovo/casanovo.py

Lilferrit · 2025-07-22T22:25:02Z

I just did a test run using 4 GPUs and everything looks good.

bittremieux

question: So when doing inference this will produce multiple mzTab output files, one for each GPU? This is not super user-friendly imo, because then the user won't know where specific results can be found.

I'm also not sure what exactly the problem is.

Multi-GPU training should work correctly, or at least it has for a long time.
Multi-GPU inference indeed doesn't properly work. That's also why we set the number of devices to 1 and only increase it during training mode. (This is also not changed in this PR, so multi-GPU inference still wouldn't happen I think.) But doing inference can be trivially parallelized by executing separate commands on different files, so I don't think that this is a major issue either. At least to the extent that any solution should clearly improve upon this status quo.

casanovo/casanovo.py

Lilferrit · 2025-07-28T19:31:15Z

Multi-GPU training should work correctly, or at least it has for a long time.

Mutli-GPU training was broken by the changes to the output file name/locations in #372. Since each sub-process will create it's own log file, if an output_root is provided after the main process creates a log file with this name subsequent sub-processes will attempt to create a log file with the same name which will lead to either that process crashing or that log file being silently overwritten. I felt the best way to deal with this was just to divert each sub-process's logging to a separate log file.

Multi-GPU inference indeed doesn't properly work

Yep correct. I initially drafted this PR several months ago before revisiting it so I misremembered what exactly the scope was. I think I maybe intended to fix multi-gpu inference when I first drafted this not sure.

bittremieux · 2025-08-04T12:51:29Z

casanovo/denovo/model_runner.py

        curr_filename = prefix + "{epoch}-{step}"
        best_filename = prefix + "best"
-        if overwrite_ckpt_check:
+        if overwrite_ckpt_check and utils.get_local_rank() == 0:


suggestion: Change overwrite_ckpt_check one level up, where ModelRunner is being called. That brings it in the same scope as the changes to _setup_output and avoids having to make any changes in this file.

bittremieux · 2025-08-04T12:57:47Z

casanovo/utils.py

            )
+
+
+def get_local_rank() -> int:


question: Would it make sense to use the global rank (e.g. torch.distributed.get_global_rank (better) or RANK environment variable) instead of the local rank to avoid issues when doing multi-node, multi-GPU training?

bittremieux and others added 30 commits January 9, 2024 14:59

Remove train_from_scratch config option (#275)

f01c607

Instead of having to specify `train_from_scratch` in the config file, training will proceed from an existing model weights file if this is given as an argument to `casanovo train`. Fixes #263.

Merge branch 'main' into dev

83a1ce4

Add FAQ entry about antibody sequencing

cd29e4b

Don't crash when multiple beams have identical peptide scores (#306)

6eabd6e

* Test different beams with identical scores * Randomly break ties for beams with identical peptide score * Update changelog * Don't remove unit test

Don't test on macOS versions with MPS (#327)

e4394be

Prepare for release v4.2.0

d53e81f

Merge branch 'main' into dev

185a510

Update CHANGELOG.md (#332)

cedfaa7

implemented automatic CLI documentation generation

c16a872

converted CLI doc page to markdown, added CLI doc page intro

4e8ce27

Merge remote-tracking branch 'origin/dev' into man-pages

b6a99bd

fixed cli man page formatting bug

c20ae02

implemented report_gen submodule

da45608

report_gen documentation

2d6b5c3

report_gen submodule test

28fa6c8

naming conventions

97e5bf1

naming conventions

4f635f9

implemented save last checkpoint

2745865

implemented last checkpoint saving using trainer save_checkpoint inst…

c58b661

…ead of callback

final checkpoint file name

3f033b2

added final epoch number to final checkpoint name

8e292b1

linter rules

b57ea7d

changed casanovo cli help message to rst for compatability with sphin…

dcd700c

…x-python

implemented filter javascript

4bc8d81

resolved linter errors

f640c5a

resolved more linter errors

786676c

Generate new screengrabs with rich-codex

c4d69f8

Lilferrit and others added 13 commits June 25, 2025 15:08

Numerically Stable Peptide Score (#483)

256f31c

* stable product score * numerically stable peptide scores * renaming * eliminate magic num

Casanovo DB Add Stop Token (#481)

72b5af3

* add stop token * stop token test

Improved beam search efficiency

7bdc0fe

* Improved beam search efficiency based on the latest model.py version * Speed up inference time * Fixed version * fixed version * test * test2 * test3 * new model * Update model.py and unit test

Update CHANGELOG.md

0d97014

Omit stop token from reported DB search AA scores (#488)

1c6a86f

Casanovo DB Merge N-terminal Scores with Leading AA (#489)

1588b17

* db merge n term score * use base class on predict batch end

Prepare for new release

c0b8a3d

Merge branch 'main' into dev

73ad7b9

Tokenizer bugfix

e8ff051

Fix linting

926cef3

Fix detokenizing bug introduced by merge

814cdaa

multi gpu file io check fix

4609c2d

added force ipv4 config option

f19212f

Lilferrit force-pushed the multi-gpu-fix branch from 437f999 to f19212f Compare July 21, 2025 19:07

remove force ipv4 option

53a50e8

Lilferrit force-pushed the multi-gpu-fix branch from cbfc356 to 53a50e8 Compare July 21, 2025 19:19

Generate new screengrabs with rich-codex

6deb072

Lilferrit commented Jul 21, 2025

View reviewed changes

casanovo/casanovo.py Show resolved Hide resolved

Lilferrit marked this pull request as ready for review July 22, 2025 22:24

Lilferrit requested a review from bittremieux July 22, 2025 22:24

bittremieux reviewed Jul 28, 2025

View reviewed changes

casanovo/casanovo.py Outdated Show resolved Hide resolved

Lilferrit changed the title ~~Fix multi GPU training and inference~~ Fix multi GPU training ~~and inference~~ Jul 28, 2025

Lilferrit changed the title ~~Fix multi GPU training ~~and inference~~~~ Fix multi GPU training and inference Jul 28, 2025

Lilferrit changed the title ~~Fix multi GPU training and inference~~ Fix multi GPU training ~~and inference~~ Jul 28, 2025

Lilferrit changed the title ~~Fix multi GPU training ~~and inference~~~~ Fix multi GPU training Jul 28, 2025

subprocess output root

15b786b

bittremieux requested changes Aug 4, 2025

View reviewed changes

bittremieux force-pushed the dev branch from 2abd74c to 37a853b Compare August 18, 2025 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix multi GPU training #469

Fix multi GPU training #469

Uh oh!

Lilferrit commented May 6, 2025

Uh oh!

Uh oh!

Lilferrit commented Jul 22, 2025

Uh oh!

bittremieux left a comment

Uh oh!

Uh oh!

Lilferrit commented Jul 28, 2025 •

edited

Loading

Uh oh!

bittremieux Aug 4, 2025

Uh oh!

bittremieux Aug 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Fix multi GPU training #469

Are you sure you want to change the base?

Fix multi GPU training #469

Uh oh!

Conversation

Lilferrit commented May 6, 2025

Uh oh!

Uh oh!

Lilferrit commented Jul 22, 2025

Uh oh!

bittremieux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Lilferrit commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bittremieux Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

bittremieux Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Lilferrit commented Jul 28, 2025 •

edited

Loading