Fix issues #41, #42, #43: image path, exec proxy, pending reason#44
Merged
Fix issues #41, #42, #43: image path, exec proxy, pending reason#44
Conversation
Issue #41 (Container image not found on agent): sbatch now resolves --container-image to an absolute .sqsh path at submission time by searching SPUR_IMAGE_DIR, /var/spool/spur/images, and ~/.spur/images in order. On shared-filesystem clusters the compute node agent receives the absolute path and finds the file directly. Issue #42 (spur exec fails from login node): exec now connects to the controller (default http://localhost:6817 via SPUR_CONTROLLER_ADDR) instead of directly to the agent. Added ExecInJob RPC to SlurmController service; the controller looks up which node is running the job and proxies the request to the correct agent automatically. Issue #43 (Jobs remain PENDING with Reason=Priority): Two fixes: 1. When no --partition is specified, submit_job now assigns the default partition (is_default=true) or first partition, ensuring the scheduler's partition filter matches correctly. 2. After each scheduling cycle, update_pending_reasons() sets pending_reason to Resources (no capable node), NodeDown (all nodes down/drained), or Priority (nodes exist but currently occupied). Users now see an accurate reason instead of always seeing "Priority". Changes: - proto/slurm.proto: add ExecInJob to SlurmController service - crates/spur-cli/src/exec.rs: use controller, not agent directly - crates/spur-cli/src/sbatch.rs: resolve_container_image() at submit time - crates/spurctld/src/server.rs: implement controller exec_in_job proxy - crates/spurctld/src/cluster.rs: default partition assignment + update_pending_reasons() - crates/spurctld/src/scheduler_loop.rs: call update_pending_reasons after each cycle - crates/spur-tests/src/t50_core.rs: 3 new tests (t50_89–91), 726 total 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This was referenced Apr 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Container image not found on agent despite successful import and listing #41 (
spur imagenot found on agent):sbatchnow resolves--container-imageto an absolute squashfs path at submit time (checksSPUR_IMAGE_DIR,/var/spool/spur/images,~/.spur/images). On shared-filesystem clusters the agent receives a full path and finds the image directly.spur exec fails with "failed to connect to agent" when executed from login node #42 (
spur execfails from login node):execnow connects to the controller instead of directly to the agent. AddedExecInJobtoSlurmController; the controller looks up which node is running the job and proxies to the correct agent automatically.Jobs remain in PENDING state with Reason=Priority despite idle nodes #43 (jobs stay PENDING with Reason=Priority): Two changes:
submit_jobassigns the default partition when none is specified, so the scheduler's partition filter matches correctly.update_pending_reasons()setspending_reasontoResources(no capable node),NodeDown(all nodes down/drained), orPriority(nodes occupied). Users now see an accurate reason.Test plan
cargo buildcleancargo test— 726 tests, 0 failures (up from 706)cargo fmt --checkclean🤖 Generated with Claude Code