ops: set-model.ps1 — atomic vLLM model swap on a running host by jieyao-MilestoneHub · Pull Request #14 · CoreNovus/llm-gateway

jieyao-MilestoneHub · 2026-05-05T12:03:44Z

Summary

Adds scripts/ops/set-model.ps1 — drops a different model into the running vLLM container without a redeploy. Composes with the existing day-to-day ops loop (setup-ssh.ps1 → fix-and-start.ps1 → develop / experiment → restore-idle-protection.ps1) and shares the same conventions: tag-based instance discovery, IPv4-validated EIP, hardened SSH options, idempotent re-runs, deterministic exit codes.

Use cases

Switch to a different model size or family without redeploying.
Sweep vLLM flags (--gpu-memory-utilization, --max-model-len, --tool-call-parser) for an ablation, with auto-archived rollback.
Pre-pull weights into the host HF cache before a measurement run (-PrePull) so model-load latency does not pollute timings.
Drive a benchmark from any orchestrator that can shell out to PowerShell — the script is non-interactive and exit codes are stable.

Threat model

	Mitigation
T1. Argument injection	Every string parameter is whitelist-validated client-side (`ValidatePattern` / `ValidateSet` / `ValidateRange`). Validation re-runs on the remote host as defence in depth. Values transit as a single `ConvertTo-Json` blob via fd 3 to `bash -s`; the remote script reads them with `jq`, so they never enter the shell-tokenisation surface.
T2. Compose-file corruption	The edit is atomic: an `awk` state machine rewrites a copy, `docker compose config -q` validates, then `mv` swaps it in. The state machine only touches the `vllm:` service's `command:` list; sibling services and other keys are untouched. Originals preserved under `/opt/llm-gateway/deploy/.swap-history/` for forensic diff / manual rollback. Optional flags (`--quantization` / `--tool-call-parser`) are dropped when `none` and inserted immediately after `--max-model-len` when newly added.
T3. No source mutation over SSH	Before any swap a READ-ONLY pre-flight searches the deployed `registry.py` for the requested `served_model_name` (`grep -F`). Unknown names exit 1 with a pointer to open a registry PR. Mutating Python source over SSH would defeat the gateway's normal code-review path.
T4. Concurrent operators	`flock -n /var/lock/llm-gateway-set-model.lock` makes parallel swaps fail fast instead of interleaving compose edits.
T5. Bearer-token leakage	The script never reads `.env`, never logs secrets, runs no traffic against the bearer-gated `/v1` surface.

Idempotency

Re-running with the currently-loaded model is a no-op: read the existing compose, find every requested flag already in place, verify /ready is 200, exit 0 without touching anything.

Exit codes

Code	Meaning
0	compose updated (or already matched) + container healthy + `/ready` 200
1	compose edit failed, registry pre-flight failed, or remote validation failed
2	`docker compose up` failed
3	container not healthy or `/ready` not 200 within budget
4	SSH connection / key / discovery failed

These are stable — adding a new code is a breaking change for any caller that branches on values; prefer reusing an existing one.

Test plan

This repo's CI is Python-only (no PSScriptAnalyzer / Pester). The script's correctness is verified by manual smoke against a real dev EC2 instance:

Idempotent no-op — run with the currently-loaded model; expect exit 0, no compose write, /ready reaffirmed
Real swap — run with a different model that IS in the registry; expect compose edit, container recreate, /ready 200 within budget
Unknown served_name — run with a name NOT in the registry; expect exit 1 + actionable error
PrePull mode — run with -PrePull and a fresh model; expect HF cache populated, no service recreation, exit 0
Bad input — pass -Model 'foo/bar; rm -rf /'; expect ValidatePattern rejection client-side, no SSH attempted
Concurrent run — start two swaps in parallel; expect second to exit 1 with "another swap is in progress"

Footprint

scripts/ops/README.md   | +89
scripts/ops/set-model.ps1 | +545
2 files changed

No code outside scripts/ops/. No CI changes. No Python touch.

Drops a different model into the running vLLM container without a redeploy. Composes with the existing day-to-day ops loop (setup-ssh.ps1 → fix-and-start.ps1 → develop / experiment → restore-idle-protection.ps1) and shares the same conventions: tag-based instance discovery, IPv4-validated EIP, hardened SSH options, idempotent re-runs, deterministic exit codes. Threat model T1. Argument injection. Every parameter is whitelist-validated client-side via PowerShell ValidatePattern / ValidateSet / ValidateRange. Validation re-runs on the remote host as defence in depth. Values transit as a single ConvertTo-Json blob via fd 3 to bash -s; the remote script reads them with jq, so they never enter the shell-tokenisation surface. T2. Compose-file corruption. The edit is atomic: an awk state machine rewrites a copy, ``docker compose config -q`` validates the result, then ``mv`` swaps it in. The state machine only touches the ``vllm:`` service's ``command:`` list; sibling services and other keys are untouched. Originals are preserved under ``/opt/llm-gateway/deploy/.swap-history/`` for forensic diff / manual rollback. Optional flags (--quantization / --tool-call-parser) are dropped when set to "none" and inserted immediately after --max-model-len when newly added. T3. No source-code mutation over SSH. Before any swap, a READ-ONLY pre-flight searches the deployed registry.py for the requested served_model_name. Unknown names exit 1 with a pointer to open a registry PR. Mutating Python source over SSH would defeat the gateway's normal code-review path. T4. Concurrent operators. ``flock -n /var/lock/llm-gateway-set-model.lock`` makes parallel swaps fail fast instead of interleaving compose edits. T5. Bearer-token leakage. The script never reads .env, never logs secrets, runs no traffic against the bearer-gated /v1 surface. Idempotency. Re-running with the currently-loaded model is a no-op: read the existing compose, find every requested flag already in place, verify /ready is 200, exit 0 without touching anything. SOLID - SRP: one responsibility — atomic compose-flag rewrite + service recreate. Registry validation is read-only; no source mutation. - OCP: adding a new vLLM flag = one new awk branch + one verify_present call + one ValidatePattern. Adding a new candidate model = a separate registry PR; this script does not change. - DIP: discovery is tag-based; the script never hardcodes instance/EIP. Operators inject -InstanceId / -Eip for non-standard layouts without touching code.

jieyao-MilestoneHub force-pushed the ops/set-model branch from ea9afe7 to e5e1aec Compare May 5, 2026 12:09

jieyao-MilestoneHub changed the title ~~ops: set-model.ps1 — atomic vLLM model swap (sister-script for downstream benchmarks)~~ ops: set-model.ps1 — atomic vLLM model swap on a running host May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ops: set-model.ps1 — atomic vLLM model swap on a running host#14

ops: set-model.ps1 — atomic vLLM model swap on a running host#14
jieyao-MilestoneHub wants to merge 1 commit into
mainfrom
ops/set-model

jieyao-MilestoneHub commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jieyao-MilestoneHub commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Use cases

Threat model

Idempotency

Exit codes

Test plan

Footprint

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jieyao-MilestoneHub commented May 5, 2026 •

edited

Loading