-
Notifications
You must be signed in to change notification settings - Fork 566
Proof-of-concept: Cellpose distributed on Slurm cluster with AMD GPUs #1334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1334 +/- ##
==========================================
+ Coverage 42.19% 42.29% +0.09%
==========================================
Files 16 16
Lines 3773 3783 +10
==========================================
+ Hits 1592 1600 +8
- Misses 2181 2183 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
In the current form, Looking forward to feedback! |
Hi,
for a project of mine I needed to scale cellpose on a SLURM cluster. To make the topic a little more interesting, the cluster I have at hand has only AMD GPUs. The documentation on distributed cellpose gave hints on how to run on LSF clusters. I also want to mention that there is already some documentation on how to run cellpose on AMD GPUs.
The first contribution of this PR is a working conda environment (
environment-rocm.yaml) file which works for inference on AMD GPUs. I am happy to update the install documentation accordingly.The second contribution is a medium-sized test case for a slurm cluster (
cellpose/contrib/test_slurm.pycellpose/contrib/cluster_script.py).The example data is not special by any means - and not working particularly well with cellposeSAM, if someone has a hint on a nice (1024 x 1024 x 1024 px ) dataset which is worth highlighting in the cellpose distributed documentation I am open for suggestions.My hope is that the test can be serve as reference for checking cellposes distributed on different clusters before users try to run cellpose with their own data.Lastly, I modified
cellpose/contrib/distributed_segmentation.pyso that it now works for my circumstances. Note that there two things left to be done:1. the code still needs some clean-up after my initial tests with cropping/ transposing2. the PR will in its current form break the functionality of the
janeliaLSFClusterclass due to missing abstraction indistributed_evalwith respect to themem,cores, andncpus.3. Scaling the cluster to 0 workers; changing the worker config and rescaling did not work for me. I am happy to run further tests, but I would need some assistance with dask debugging.
I am happy to polish the code and documentation the next days. Since I am not really a dask expert I am very curious about feedback about my dask usage.
Best wishes,
Eric
fixes #1111