Skip to content

Conversation

@erjel
Copy link
Contributor

@erjel erjel commented Sep 29, 2025

Hi,

for a project of mine I needed to scale cellpose on a SLURM cluster. To make the topic a little more interesting, the cluster I have at hand has only AMD GPUs. The documentation on distributed cellpose gave hints on how to run on LSF clusters. I also want to mention that there is already some documentation on how to run cellpose on AMD GPUs.

The first contribution of this PR is a working conda environment (environment-rocm.yaml) file which works for inference on AMD GPUs. I am happy to update the install documentation accordingly.

The second contribution is a medium-sized test case for a slurm cluster (cellpose/contrib/test_slurm.pycellpose/contrib/cluster_script.py). The example data is not special by any means - and not working particularly well with cellposeSAM, if someone has a hint on a nice (1024 x 1024 x 1024 px ) dataset which is worth highlighting in the cellpose distributed documentation I am open for suggestions. My hope is that the test can be serve as reference for checking cellposes distributed on different clusters before users try to run cellpose with their own data.

Lastly, I modified cellpose/contrib/distributed_segmentation.py so that it now works for my circumstances. Note that there two things left to be done:

1. the code still needs some clean-up after my initial tests with cropping/ transposing
2. the PR will in its current form break the functionality of the janeliaLSFCluster class due to missing abstraction in distributed_eval with respect to the mem, cores, and ncpus .
3. Scaling the cluster to 0 workers; changing the worker config and rescaling did not work for me. I am happy to run further tests, but I would need some assistance with dask debugging.

I am happy to polish the code and documentation the next days. Since I am not really a dask expert I am very curious about feedback about my dask usage.

Best wishes,
Eric

fixes #1111

@codecov
Copy link

codecov bot commented Sep 30, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.29%. Comparing base (bf958cb) to head (3af4df3).
⚠️ Report is 32 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1334      +/-   ##
==========================================
+ Coverage   42.19%   42.29%   +0.09%     
==========================================
  Files          16       16              
  Lines        3773     3783      +10     
==========================================
+ Hits         1592     1600       +8     
- Misses       2181     2183       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@erjel
Copy link
Contributor Author

erjel commented Oct 1, 2025

In the current form, python cellpose/contrib/cluster_script.py runs end-to-end (incl. test data download, segmentation, merging, saving) in approximately 8 mins if the requested compute resources are immediately available.

Looking forward to feedback!

@erjel erjel marked this pull request as ready for review October 1, 2025 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Running cellpose distribute on a SLURM cluster

1 participant