Skip to content

ops: set-model.ps1 — atomic vLLM model swap on a running host#14

Open
jieyao-MilestoneHub wants to merge 1 commit into
mainfrom
ops/set-model
Open

ops: set-model.ps1 — atomic vLLM model swap on a running host#14
jieyao-MilestoneHub wants to merge 1 commit into
mainfrom
ops/set-model

Conversation

@jieyao-MilestoneHub

@jieyao-MilestoneHub jieyao-MilestoneHub commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds scripts/ops/set-model.ps1 — drops a different model into the running vLLM container without a redeploy. Composes with the existing day-to-day ops loop (setup-ssh.ps1fix-and-start.ps1 → develop / experiment → restore-idle-protection.ps1) and shares the same conventions: tag-based instance discovery, IPv4-validated EIP, hardened SSH options, idempotent re-runs, deterministic exit codes.

Use cases

  • Switch to a different model size or family without redeploying.
  • Sweep vLLM flags (--gpu-memory-utilization, --max-model-len, --tool-call-parser) for an ablation, with auto-archived rollback.
  • Pre-pull weights into the host HF cache before a measurement run (-PrePull) so model-load latency does not pollute timings.
  • Drive a benchmark from any orchestrator that can shell out to PowerShell — the script is non-interactive and exit codes are stable.

Threat model

Mitigation
T1. Argument injection Every string parameter is whitelist-validated client-side (ValidatePattern / ValidateSet / ValidateRange). Validation re-runs on the remote host as defence in depth. Values transit as a single ConvertTo-Json blob via fd 3 to bash -s; the remote script reads them with jq, so they never enter the shell-tokenisation surface.
T2. Compose-file corruption The edit is atomic: an awk state machine rewrites a copy, docker compose config -q validates, then mv swaps it in. The state machine only touches the vllm: service's command: list; sibling services and other keys are untouched. Originals preserved under /opt/llm-gateway/deploy/.swap-history/ for forensic diff / manual rollback. Optional flags (--quantization / --tool-call-parser) are dropped when none and inserted immediately after --max-model-len when newly added.
T3. No source mutation over SSH Before any swap a READ-ONLY pre-flight searches the deployed registry.py for the requested served_model_name (grep -F). Unknown names exit 1 with a pointer to open a registry PR. Mutating Python source over SSH would defeat the gateway's normal code-review path.
T4. Concurrent operators flock -n /var/lock/llm-gateway-set-model.lock makes parallel swaps fail fast instead of interleaving compose edits.
T5. Bearer-token leakage The script never reads .env, never logs secrets, runs no traffic against the bearer-gated /v1 surface.

Idempotency

Re-running with the currently-loaded model is a no-op: read the existing compose, find every requested flag already in place, verify /ready is 200, exit 0 without touching anything.

Exit codes

Code Meaning
0 compose updated (or already matched) + container healthy + /ready 200
1 compose edit failed, registry pre-flight failed, or remote validation failed
2 docker compose up failed
3 container not healthy or /ready not 200 within budget
4 SSH connection / key / discovery failed

These are stable — adding a new code is a breaking change for any caller that branches on values; prefer reusing an existing one.

Test plan

This repo's CI is Python-only (no PSScriptAnalyzer / Pester). The script's correctness is verified by manual smoke against a real dev EC2 instance:

  • Idempotent no-op — run with the currently-loaded model; expect exit 0, no compose write, /ready reaffirmed
  • Real swap — run with a different model that IS in the registry; expect compose edit, container recreate, /ready 200 within budget
  • Unknown served_name — run with a name NOT in the registry; expect exit 1 + actionable error
  • PrePull mode — run with -PrePull and a fresh model; expect HF cache populated, no service recreation, exit 0
  • Bad input — pass -Model 'foo/bar; rm -rf /'; expect ValidatePattern rejection client-side, no SSH attempted
  • Concurrent run — start two swaps in parallel; expect second to exit 1 with "another swap is in progress"

Footprint

scripts/ops/README.md   | +89
scripts/ops/set-model.ps1 | +545
2 files changed

No code outside scripts/ops/. No CI changes. No Python touch.

Drops a different model into the running vLLM container without a
redeploy. Composes with the existing day-to-day ops loop
(setup-ssh.ps1 → fix-and-start.ps1 → develop / experiment →
restore-idle-protection.ps1) and shares the same conventions:
tag-based instance discovery, IPv4-validated EIP, hardened SSH
options, idempotent re-runs, deterministic exit codes.

Threat model

T1. Argument injection. Every parameter is whitelist-validated
    client-side via PowerShell ValidatePattern / ValidateSet /
    ValidateRange. Validation re-runs on the remote host as defence
    in depth. Values transit as a single ConvertTo-Json blob via fd 3
    to bash -s; the remote script reads them with jq, so they never
    enter the shell-tokenisation surface.

T2. Compose-file corruption. The edit is atomic: an awk state machine
    rewrites a copy, ``docker compose config -q`` validates the
    result, then ``mv`` swaps it in. The state machine only touches
    the ``vllm:`` service's ``command:`` list; sibling services and
    other keys are untouched. Originals are preserved under
    ``/opt/llm-gateway/deploy/.swap-history/`` for forensic diff /
    manual rollback. Optional flags (--quantization /
    --tool-call-parser) are dropped when set to "none" and inserted
    immediately after --max-model-len when newly added.

T3. No source-code mutation over SSH. Before any swap, a READ-ONLY
    pre-flight searches the deployed registry.py for the requested
    served_model_name. Unknown names exit 1 with a pointer to open a
    registry PR. Mutating Python source over SSH would defeat the
    gateway's normal code-review path.

T4. Concurrent operators. ``flock -n
    /var/lock/llm-gateway-set-model.lock`` makes parallel swaps fail
    fast instead of interleaving compose edits.

T5. Bearer-token leakage. The script never reads .env, never logs
    secrets, runs no traffic against the bearer-gated /v1 surface.

Idempotency. Re-running with the currently-loaded model is a no-op:
read the existing compose, find every requested flag already in
place, verify /ready is 200, exit 0 without touching anything.

SOLID

- SRP: one responsibility — atomic compose-flag rewrite + service
  recreate. Registry validation is read-only; no source mutation.
- OCP: adding a new vLLM flag = one new awk branch + one
  verify_present call + one ValidatePattern. Adding a new candidate
  model = a separate registry PR; this script does not change.
- DIP: discovery is tag-based; the script never hardcodes
  instance/EIP. Operators inject -InstanceId / -Eip for non-standard
  layouts without touching code.
@jieyao-MilestoneHub jieyao-MilestoneHub changed the title ops: set-model.ps1 — atomic vLLM model swap (sister-script for downstream benchmarks) ops: set-model.ps1 — atomic vLLM model swap on a running host May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant