Fix issues #51-#56: scheduler, images, attach, CLI, K8s by powderluv · Pull Request #57 · ROCm/spur

powderluv · 2026-04-08T18:55:56Z

Summary

Fixes 6 open issues (3 reopened from PR #49, 3 new):

Reopen #47 : Jobs still remain in PENDING state with Reason=Priority despite idle nodes #56 (reopen Reopen #43 : Jobs still remain in PENDING state with Reason=Priority despite idle nodes #47): Scheduler crash recovery via catch_unwind; safety for num_nodes=0; update_pending_reasons now checks constraint/exclusive/fully-consumed to match scheduler's actual filtering
Reopen #44: Container image still not found on agent despite successful import and listing #55 (reopen Fix issues #41, #42, #43: image path, exec proxy, pending reason #44): Agent image_dir() uses 3-tier fallback matching CLI (env → system if exists → ~/.spur/images)
Reopen #45: spur attach to the job getting hung and does not provide interactive I/O #54 (reopen spur attach connects to job but does not provide interactive I/O #45): sattach uses per-byte reads instead of line-buffered; channel buffer 32→256 to prevent deadlock; graceful task shutdown
[Feature]: CLI requires spur show show node X — docs show spur show node X #53: spur show node X inserts implicit show subcommand → dispatches as scontrol show node X
No retry/reconnect on controller connection failure #52: K8s operator wraps background tasks in retry loops with exponential backoff (1s→60s)
Labeled worker node showing as down* #51: K8s operator adds --address flag with POD_IP env var fallback instead of unroutable Pod hostname

Test plan

743 tests pass, 0 failures (+13 new)
cargo fmt --check clean
New tests: scheduler edge cases (num_nodes=0, constraint mismatch, exclusive, single idle node), CLI show dispatch, K8s address resolution, retry backoff, image fallback, attach raw bytes
CI: fmt, clippy, build-and-test, cluster tests

🤖 Generated with Claude Code

…h hang, CLI show, K8s retry/address Fixes: - #56: Scheduler crash recovery via catch_unwind; safety check for num_nodes=0; update_pending_reasons now checks constraint/exclusive/fully-consumed (matching find_suitable_nodes) so Reason accurately reflects why job can't be scheduled - #55: Agent image_dir() now uses 3-tier fallback matching CLI (env → system dir if exists → ~/.spur/images) instead of hardcoding /var/spool/spur/images - #54: sattach uses per-byte reads instead of line-buffered; channel buffer increased 32→256 to prevent deadlock; graceful task shutdown instead of abort - #53: `spur show node X` now dispatches as `scontrol show node X` by inserting implicit show subcommand (docs said `spur show node` but required `spur show show node`) - #52: K8s operator wraps background tasks (node watcher, job controller, health) in retry loops with exponential backoff (1s→60s cap) - #51: K8s operator adds --address flag with POD_IP env var fallback; Pod hostname is no longer used as default (unroutable from spurctld) Tests: 743 passed, 0 failed (+13 new tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

powderluv merged commit 2bf4b9e into main Apr 8, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issues #51-#56: scheduler, images, attach, CLI, K8s#57

Fix issues #51-#56: scheduler, images, attach, CLI, K8s#57
powderluv merged 1 commit intomainfrom
fix/issues-51-56

powderluv commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

powderluv commented Apr 8, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant