Fix config listen port ignored, sinfo NODELIST, and image dir mismatch#39
Merged
Fix config listen port ignored, sinfo NODELIST, and image dir mismatch#39
Conversation
Implements the final item from the plan: federation peer forwarding in the scheduler loop, plus test coverage for all 5 polish features. Changes: - scheduler_loop.rs: forward unschedulable jobs to federation peer clusters via SlurmControllerClient::submit_job(). Tries peers in order; stops on first successful accept. Adds core_spec_to_proto() helper for the core→proto conversion required for RPC forwarding. - t50_core.rs: federation config parsing tests (t50_75–78), PMIx env var name tests (t50_79–81) — 7 new tests. - t07_sched.rs: federation forwarding decision tests (t07_18–20), power management state transition tests (t07_21–23) — 6 new tests. - t01_run.rs: srun step mode env var tests (t01_18–20) — 3 new tests. Total: 688 → 706 tests, 0 failures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
#36 #37 #35) Three production bugs found on MI250X cluster. Fix #37: spurctld ignored config file listen_addr and always bound to default [::]:6817. Changed --listen from a defaulted String to an Option<String>. When absent, config.controller.listen_addr is used; when present, the CLI flag overrides. config.controller.listen_addr is updated in-place so downstream code sees the final address. Fix #36: sinfo NODELIST showed "localhost" because the default config was active (consequence of #37 — wrong port → agents couldn't connect → looked like default cluster). After #37 fix the correct config loads. Also improved the 'N' field in partition view to show actual registered node names when available, falling back to the static spec string. Fix #35: spur image import saves to resolve_image_dir() (checks SPUR_IMAGE_DIR env var, falls back to ~/.spur/images for non-root users), but spurd's container.rs had IMAGE_DIR hardcoded to /var/spool/spur/images — directory diverges for non-root users, making the agent unable to find images imported by the CLI. Replaced the hardcoded const with image_dir() that checks SPUR_IMAGE_DIR first, matching CLI behavior. Also improved the "not found" error to include the actual directory searched. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
powderluv
added a commit
that referenced
this pull request
Mar 27, 2026
090d9f9 to
9ecdc56
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes three production bugs reported by amd-kmundiga on MI250X cluster.
spurctldignoredlisten_addrin config file and always bound to default[::]:6817. Changed--listentoOption<String>— absent means use config, present overrides config.sinfoNODELIST showedlocalhost(consequence of spurctld ignores config file (/etc/spur/spur.conf) for listen port and always binds to default 6817 #37 — wrong port → default config used →nodes = "localhost"). Fixed by spurctld ignores config file (/etc/spur/spur.conf) for listen port and always binds to default 6817 #37. Also improved NODELIST to show actual registered node names when available.spur image importandspurdused different directory resolution — CLI respectsSPUR_IMAGE_DIR/~/.spur/imagesfallback, agent had/var/spool/spur/imageshardcoded. Replaced hardcoded const withimage_dir()that checksSPUR_IMAGE_DIRfirst. Error message now includes the directory searched.Test plan
cargo test— 706 tests, 0 failurescargo fmt --check— cleanspurctld --config /etc/spur/spur.confbinds to the port in configSPUR_IMAGE_DIR=/shared/images spur image import ...+ agent finds image🤖 Generated with Claude Code