Skip to content

K8s deployment fixes#50

Merged
powderluv merged 1 commit intoROCm:mainfrom
tensorwavecloud:k8s-deploy
Apr 9, 2026
Merged

K8s deployment fixes#50
powderluv merged 1 commit intoROCm:mainfrom
tensorwavecloud:k8s-deploy

Conversation

@Bellk17
Copy link
Copy Markdown
Contributor

@Bellk17 Bellk17 commented Apr 8, 2026

Motivation

Fix several issues in the K8s deployment manifests and Dockerfile that prevented successful deployment to a production RKE2 cluster with AMD MI325X GPUs.

Technical Details

  • Fix Dockerfile build: Bumped rust:1.82 to rust:latest (edition 2024 required by redb crate) and added protobuf-compiler + libprotobuf-dev to the builder stage
  • Fix configmap config syntax: Changed [partitions.default] (TOML map) to [[partitions]] with name = "default" (TOML array-of-tables) to match SlurmConfig parser; replaced [cluster] name = ... with top-level cluster_name = ...; added required nodes field
  • Fix operator args ordering: Removed run subcommand from args — --controller-addr and other flags are top-level clap args that must appear before any subcommand, and run is already the default
  • Simplify spool volume: Replaced volumeClaimTemplate with emptyDir for the spurctld StatefulSet to avoid PVC provisioning issues on clusters with certain storage backends (e.g., Longhorn)
  • Fix spurrestd controller address: Prepended http:// to the --controller arg in spurrestd.yaml — spurrestd passes the address directly to tonic's connect() which requires a scheme

Test Plan

  • docker build -f deploy/Dockerfile completes successfully
  • All manifests apply cleanly to an RKE2 cluster
  • spurctld starts and runs the scheduler loop
  • spur-k8s-operator connects to spurctld and registers GPU nodes
  • spurrestd connects to spurctld without 500 errors
  • spur nodes shows registered MI325X node with correct resources

Test Result

Deployed to a single-node RKE2 cluster with AMD MI325X GPUs. All Spur control plane components (spurctld, spur-k8s-operator, spurrestd) start and communicate successfully. GPU node registered with correct CPU, memory, and GPU resources.

- Fix default config values in configmap
- Fix args in operator yaml (remove "run")
- Simplify spool volume
- Prepend FQDN with protocol
@powderluv powderluv merged commit 6ddaa72 into ROCm:main Apr 9, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants