Fix issues #41, #42, #43: image path, exec proxy, pending reason by powderluv · Pull Request #44 · ROCm/spur

powderluv · 2026-04-03T06:39:11Z

Summary

Container image not found on agent despite successful import and listing #41 (spur image not found on agent): sbatch now resolves --container-image to an absolute squashfs path at submit time (checks SPUR_IMAGE_DIR, /var/spool/spur/images, ~/.spur/images). On shared-filesystem clusters the agent receives a full path and finds the image directly.
spur exec fails with "failed to connect to agent" when executed from login node #42 (spur exec fails from login node): exec now connects to the controller instead of directly to the agent. Added ExecInJob to SlurmController; the controller looks up which node is running the job and proxies to the correct agent automatically.
Jobs remain in PENDING state with Reason=Priority despite idle nodes #43 (jobs stay PENDING with Reason=Priority): Two changes:
1. submit_job assigns the default partition when none is specified, so the scheduler's partition filter matches correctly.
2. After each scheduling cycle, update_pending_reasons() sets pending_reason to Resources (no capable node), NodeDown (all nodes down/drained), or Priority (nodes occupied). Users now see an accurate reason.

Test plan

cargo build clean
cargo test — 726 tests, 0 failures (up from 706)
cargo fmt --check clean
CI green

🤖 Generated with Claude Code

Issue #41 (Container image not found on agent): sbatch now resolves --container-image to an absolute .sqsh path at submission time by searching SPUR_IMAGE_DIR, /var/spool/spur/images, and ~/.spur/images in order. On shared-filesystem clusters the compute node agent receives the absolute path and finds the file directly. Issue #42 (spur exec fails from login node): exec now connects to the controller (default http://localhost:6817 via SPUR_CONTROLLER_ADDR) instead of directly to the agent. Added ExecInJob RPC to SlurmController service; the controller looks up which node is running the job and proxies the request to the correct agent automatically. Issue #43 (Jobs remain PENDING with Reason=Priority): Two fixes: 1. When no --partition is specified, submit_job now assigns the default partition (is_default=true) or first partition, ensuring the scheduler's partition filter matches correctly. 2. After each scheduling cycle, update_pending_reasons() sets pending_reason to Resources (no capable node), NodeDown (all nodes down/drained), or Priority (nodes exist but currently occupied). Users now see an accurate reason instead of always seeing "Priority". Changes: - proto/slurm.proto: add ExecInJob to SlurmController service - crates/spur-cli/src/exec.rs: use controller, not agent directly - crates/spur-cli/src/sbatch.rs: resolve_container_image() at submit time - crates/spurctld/src/server.rs: implement controller exec_in_job proxy - crates/spurctld/src/cluster.rs: default partition assignment + update_pending_reasons() - crates/spurctld/src/scheduler_loop.rs: call update_pending_reasons after each cycle - crates/spur-tests/src/t50_core.rs: 3 new tests (t50_89–91), 726 total 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

powderluv merged commit b3469dc into main Apr 3, 2026
4 checks passed

powderluv deleted the fix/issues-41-42-43 branch April 3, 2026 06:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issues #41, #42, #43: image path, exec proxy, pending reason#44

Fix issues #41, #42, #43: image path, exec proxy, pending reason#44
powderluv merged 1 commit intomainfrom
fix/issues-41-42-43

powderluv commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

powderluv commented Apr 3, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant