Skip to content

Fix issues #41, #42, #43: image path, exec proxy, pending reason#44

Merged
powderluv merged 1 commit intomainfrom
fix/issues-41-42-43
Apr 3, 2026
Merged

Fix issues #41, #42, #43: image path, exec proxy, pending reason#44
powderluv merged 1 commit intomainfrom
fix/issues-41-42-43

Conversation

@powderluv
Copy link
Copy Markdown
Collaborator

Summary

Test plan

  • cargo build clean
  • cargo test — 726 tests, 0 failures (up from 706)
  • cargo fmt --check clean
  • CI green

🤖 Generated with Claude Code

Issue #41 (Container image not found on agent): sbatch now resolves
--container-image to an absolute .sqsh path at submission time by
searching SPUR_IMAGE_DIR, /var/spool/spur/images, and ~/.spur/images
in order. On shared-filesystem clusters the compute node agent receives
the absolute path and finds the file directly.

Issue #42 (spur exec fails from login node): exec now connects to the
controller (default http://localhost:6817 via SPUR_CONTROLLER_ADDR)
instead of directly to the agent. Added ExecInJob RPC to SlurmController
service; the controller looks up which node is running the job and
proxies the request to the correct agent automatically.

Issue #43 (Jobs remain PENDING with Reason=Priority): Two fixes:
1. When no --partition is specified, submit_job now assigns the default
   partition (is_default=true) or first partition, ensuring the
   scheduler's partition filter matches correctly.
2. After each scheduling cycle, update_pending_reasons() sets
   pending_reason to Resources (no capable node), NodeDown (all nodes
   down/drained), or Priority (nodes exist but currently occupied).
   Users now see an accurate reason instead of always seeing "Priority".

Changes:
- proto/slurm.proto: add ExecInJob to SlurmController service
- crates/spur-cli/src/exec.rs: use controller, not agent directly
- crates/spur-cli/src/sbatch.rs: resolve_container_image() at submit time
- crates/spurctld/src/server.rs: implement controller exec_in_job proxy
- crates/spurctld/src/cluster.rs: default partition assignment + update_pending_reasons()
- crates/spurctld/src/scheduler_loop.rs: call update_pending_reasons after each cycle
- crates/spur-tests/src/t50_core.rs: 3 new tests (t50_89–91), 726 total

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant