Skip to content

Fix config listen port ignored, sinfo NODELIST, and image dir mismatch#39

Merged
powderluv merged 2 commits intomainfrom
users/powderluv/fix-issues-35-36-37
Mar 27, 2026
Merged

Fix config listen port ignored, sinfo NODELIST, and image dir mismatch#39
powderluv merged 2 commits intomainfrom
users/powderluv/fix-issues-35-36-37

Conversation

@powderluv
Copy link
Copy Markdown
Collaborator

Summary

Fixes three production bugs reported by amd-kmundiga on MI250X cluster.

Test plan

  • cargo test — 706 tests, 0 failures
  • cargo fmt --check — clean
  • Verify spurctld --config /etc/spur/spur.conf binds to the port in config
  • Verify SPUR_IMAGE_DIR=/shared/images spur image import ... + agent finds image

🤖 Generated with Claude Code

powderluv and others added 2 commits March 26, 2026 22:39
Implements the final item from the plan: federation peer forwarding in
the scheduler loop, plus test coverage for all 5 polish features.

Changes:
- scheduler_loop.rs: forward unschedulable jobs to federation peer
  clusters via SlurmControllerClient::submit_job(). Tries peers in
  order; stops on first successful accept. Adds core_spec_to_proto()
  helper for the core→proto conversion required for RPC forwarding.
- t50_core.rs: federation config parsing tests (t50_75–78), PMIx env
  var name tests (t50_79–81) — 7 new tests.
- t07_sched.rs: federation forwarding decision tests (t07_18–20),
  power management state transition tests (t07_21–23) — 6 new tests.
- t01_run.rs: srun step mode env var tests (t01_18–20) — 3 new tests.

Total: 688 → 706 tests, 0 failures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
#36 #37 #35)

Three production bugs found on MI250X cluster.

Fix #37: spurctld ignored config file listen_addr and always bound to
default [::]:6817. Changed --listen from a defaulted String to an
Option<String>. When absent, config.controller.listen_addr is used;
when present, the CLI flag overrides. config.controller.listen_addr is
updated in-place so downstream code sees the final address.

Fix #36: sinfo NODELIST showed "localhost" because the default config
was active (consequence of #37 — wrong port → agents couldn't connect →
looked like default cluster). After #37 fix the correct config loads.
Also improved the 'N' field in partition view to show actual registered
node names when available, falling back to the static spec string.

Fix #35: spur image import saves to resolve_image_dir() (checks
SPUR_IMAGE_DIR env var, falls back to ~/.spur/images for non-root
users), but spurd's container.rs had IMAGE_DIR hardcoded to
/var/spool/spur/images — directory diverges for non-root users, making
the agent unable to find images imported by the CLI. Replaced the
hardcoded const with image_dir() that checks SPUR_IMAGE_DIR first,
matching CLI behavior. Also improved the "not found" error to include
the actual directory searched.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@powderluv powderluv changed the base branch from users/powderluv/federation-polish to main March 27, 2026 15:43
powderluv added a commit that referenced this pull request Mar 27, 2026
@powderluv powderluv force-pushed the users/powderluv/fix-issues-35-36-37 branch from 090d9f9 to 9ecdc56 Compare March 27, 2026 16:46
@powderluv powderluv merged commit f8394d6 into main Mar 27, 2026
3 checks passed
@powderluv powderluv deleted the users/powderluv/fix-issues-35-36-37 branch March 27, 2026 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant