feat(orchestrator): per-Agent LLM model selection#387
Conversation
ConnysCode
left a comment
There was a problem hiding this comment.
Review: Request changes
One blocker and two majors before merge. The core resolution design (3-tier overlay, hot-apply, the effectiveModel surface) is sound, and I verified the no-schema-change / no-env-var claims hold — findings below are scoped, all traced against the cloned head.
Must change before merge
🔴 BLOCKER — per-Agent model selection is broken on claude-cli (subscription) deployments. The claude-cli branch at applyDiff.ts:178 skips ref→modelId resolution and passes the ref through raw. But the new picker writes provider-qualified ids: on a keyless CLI deployment /v1/builder/models offers only claude-cli:opus-cli / sonnet-cli / haiku-cli (registered via listModels(), builtinLlmProviders.ts:136-159), validateModelRef accepts them, so model_routing.main = 'claude-cli:opus-cli' persists. At build the ref stays raw → buildOrchestrator.ts:335 runs config.model.replace(/-cli$/, '') → 'claude-cli:opus' → CliChatAgent passes it verbatim as claude -p --model claude-cli:opus (cliChatAgent.ts:608), an invalid model, so every turn for that Agent fails. The CLI expects the bare alias opus. On a keyless deployment there's no working pick, so the headline feature can't be used. Traced end-to-end; fix inline.
🟠 MAJOR — a pinned sub-agent model can never be cleared back to "inherit". updateSubAgent uses model = COALESCE($4, model) with $4 = patch.model ?? null (agentGraphStore.ts:359/366), so the documented null = "clear" is a no-op — COALESCE(NULL, model) keeps the old value. The operator can't revert a sub-agent from a pinned model to inherit-parent. Fix inline.
🟠 MAJOR — a cross-provider main-model pick is accepted at write time, then silently dropped at runtime. The picker lists every connected provider's models, not just the orchestrator's active provider, and validateModelRef (configStore.ts:343) resolves against the full registry with no defaultProvider — so e.g. openai:gpt-5.5 under an Anthropic orchestrator passes validation and persists. At build, resolveModelIdForProvider returns undefined for the cross-provider id and model falls back to the platform default (applyDiff.ts:188). The picker shows GPT-5.5, the Agent runs on Opus, and the only signal is the effectiveModel badge — no error at write time. (deps.provider is a single global provider — confirmed via OrchestratorRegistry → buildForAgent.) #296 scoped cross-provider routing out, but the picker actively offers a choice that silently won't run. Please scope the per-Agent / sub-agent model choice to the active provider: filter the offered list, or resolve main with { defaultProvider: <active provider> } at write time and reject (or warn on) a cross-provider pick.
Scope — sub-agent model selection (your open question)
#296 explicitly lists "Sub-agent model selection (SUB_AGENT_MODEL)" as out of scope, and this PR ships the sub-agent picker + validateModelRef guard + resolveSubAgentModel. There's no schema cost (it reuses the existing agent_subagents.model column — no migration), so keeping it is defensible — but the sub-agent half carries the MAJOR clear-bug above. Recommendation: if you keep it here, fix the updateSubAgent clear path first; if you'd rather land #296 clean, split the sub-agent picker into its own PR (the clear-bug travels with it). Either is fine — just don't merge the half-working clear.
What I verified holds (no action needed)
- Hot-reload is race-free.
setModelRouting→await reload()swapsthis.activesynchronously before the response readseffectiveModel, so acceptance #1/#4 deliver (agentBuilder.ts:194-196,registry/index.ts:212-223). - No schema change / no new env var.
agents.model_routing+agent_subagents.modelboth pre-exist (migration0003); no.sqladded; the threeDEFAULT_ORCHESTRATOR_MODELcopies (agentRuntime.ts,plugin.ts,config.ts) all agree onclaude-opus-4-8. assembleGraph/agentNodeblast radius is clean — every call site passesregistry, andl.registryis wired in production (index.ts:2009).- web-ui ↔ server shapes agree (
effectiveModel,model_id,aliases); the additive/v1/builder/modelschange is behindrequireAuthand has no OpenAPI spec to drift.
PR description — factcheck
| Claim | Reality |
|---|---|
| "registry-known same-provider → modelId; cross-provider → default; registry-unknown → raw" | True for the Anthropic/OpenAI path. Incomplete for claude-cli — those refs are passed raw without resolution → the BLOCKER. |
| "cross-provider → dropped to default (out of scope, would 404)" | True at build — but the write succeeds with no operator signal, and the picker offered the pick. |
"No schema change — reuses agents.model_routing + agent_subagents.model" |
✅ Verified — no migration added. |
| "No new env var" | ✅ Verified. |
| "Hot-applies via the existing per-Agent rebuild — no restart" | ✅ Verified — race-free on the happy path. |
| "typecheck clean / listed tests pass" | Not re-executed here. Note the new tests cover anthropic + openai but not claude-cli — the path that breaks. |
Minimum to merge
- Resolve
claude-clirefs to the baremodelIdbeforebuildOrchestratorstrips-cli(BLOCKER). - Let
updateSubAgentclearmodelback toNULL(MAJOR) — or split the sub-agent picker out. - Scope the model picker / write-validation to the orchestrator's active provider so a cross-provider pick can't silently no-op (MAJOR).
Nits (non-blocking)
applyDiff.ts:188— the platform defaultruntime.modelis sent raw (never resolved). Safe today only because theorchestrator_modelinstall field is an enum of bare ids; a defensiveresolveModelIdForProvider(runtime.model, …) ?? runtime.modelwould harden it if that field ever becomes free-text or a registry picker.InspectorPanel.tsx—useModelCatalog()refetches/v1/builder/modelson every node selection (once per editor mount). Consider lifting/caching the catalog.agentBuilder.tsreload()— the barecatch {}swallows reload failures, after which the response can echo a staleeffectiveModeluntil the next NOTIFY. Low impact.
(Reviewed against head 9a6a2f9.)
| // resolving it would change established behaviour. | ||
| const activeProvider = deps.provider?.id; | ||
| const resolveOverlay = (ref: string | undefined): string | undefined => | ||
| activeProvider === 'claude-cli' |
There was a problem hiding this comment.
BLOCKER (claude-cli deployments) — provider-qualified picker ids reach the CLI unresolved, so every per-Agent turn fails.
This activeProvider === 'claude-cli' branch skips resolution and passes the ref through raw. The new picker, though, writes provider-qualified ids, and on a keyless CLI deployment /v1/builder/models offers only claude-cli:opus-cli / sonnet-cli / haiku-cli (registered via listModels(), builtinLlmProviders.ts:136-159). validateModelRef accepts them, so model_routing.main = 'claude-cli:opus-cli' persists. At build the ref stays raw → buildOrchestrator.ts:335 runs config.model.replace(/-cli$/, '') → 'claude-cli:opus', which CliChatAgent passes verbatim as claude -p --model claude-cli:opus (cliChatAgent.ts:608). That's an invalid model id (the CLI expects the bare opus), so every turn for that Agent fails — and on a keyless deployment there is no other pick to choose.
Please route the CLI branch through the resolver as well, so it yields the bare modelId first:
const resolveOverlay = (ref: string | undefined): string | undefined =>
resolveModelIdForProvider(ref, activeProvider) ?? ref?.trim() || undefined;resolveModelIdForProvider('claude-cli:opus-cli', 'claude-cli') returns 'opus-cli' (→ .replace(/-cli$/,'') → 'opus'), while a legacy bare 'opus' resolves cross-provider to undefined and still falls through to ref.trim() untouched. Apply the same to the overlayRouting sub-models. Please also add a depsWithProvider('claude-cli') case to orchestratorRegistry.test.ts — the current tests only exercise anthropic/openai, which is why this slipped.
| // Issue #296 follow-up — see `createSubAgent` rationale. `null` is the | ||
| // wire form for "clear" (no validation); empty-string is treated the | ||
| // same so the UI dropdown's "(default)" choice round-trips cleanly. | ||
| if (patch.model != null && patch.model.trim() !== '') { |
There was a problem hiding this comment.
MAJOR — a pinned sub-agent model can never be cleared back to "inherit".
This comment says null is the wire form for "clear", but the UPDATE below uses model = COALESCE($4, model) (line 366) with $4 = patch.model ?? null (line 379). COALESCE(NULL, model) keeps the existing value, so passing model: null is a no-op — an operator can't revert a sub-agent from a pinned model back to inherit-parent. (The sibling setSubAgentSkill uses a direct write precisely because COALESCE can't clear a column; there's no equivalent for model.)
Separately, createSubAgent persists '' rather than NULL for the empty-string "(default)" choice — input.model ?? null (line 336) doesn't catch ''. Benign at runtime (resolveSubAgentModel('') → parent default) but it leaves dirty data.
Please give model a real clear path: branch on patch.model === null → write model = NULL (or add a dedicated clearSubAgentModel direct write, mirroring setSubAgentSkill), and bind input.model?.trim() || null in createSubAgent so the column stays clean.
| @@ -17,6 +19,60 @@ import type { ModelRoutingConfig as RuntimeModelRouting } from '../modelRouter.j | |||
|
|
|||
| const DEFAULT_CLASSIFIER_MODEL = 'claude-haiku-4-5'; | |||
There was a problem hiding this comment.
MINOR — the default classifier is an id your own validator would reject. claude-haiku-4-5 isn't in the registry (the registered Haiku is claude-haiku-4-5-20251001, alias haiku — modelRegistry.ts:12-16), so resolveModelRef('claude-haiku-4-5') is undefined. It survives only because resolveModelIdForProvider passes registry-unknown refs through raw and Anthropic happens to serve the undated alias — yet if an operator typed claude-haiku-4-5 into the triage field, validateModelRef would throw "not registered". Please align this default to the dated claude-haiku-4-5-20251001 (or the haiku alias) so the code's own default and its write-validation agree. Non-blocking.
What
Per-Agent orchestrator LLM model selection (Closes #296). Each Agent can pin its own model (and per-turn triage routing) via the Admin builder instead of one
global
ORCHESTRATOR_MODEL. Model picker is populated from the@omadia/llm-providerregistry.Why
Cost/latency differ per Agent — a high-volume support Agent can run on a cheap model while a reasoning-heavy one stays on the frontier model. Turns an
operator runtime choice into a per-instance setting instead of a deploy-time env decision. The pluggable-provider work only pays off once instances can
actually pick different models.
Test plan
configStoreModelRoutingValidation,agentBuilderAgentNode,agentGraphStoreSubAgentModelValidation,orchestratorRegistry,subAgentModelResolution,agentBuilderSubAgentToolseffectiveModelbadge + registryagent rebuiltlog, turn runs without 404Risk / blast radius
agents.model_routing(JSONB) +agent_subagents.model.ORCHESTRATOR_MODELkept as the fallback tier → existing deployments unchanged.modelraw to a single provider adapter.Per-Agent refs are now resolved to the active provider's bare
modelIdbefore build (resolveModelIdForProvider): registry-known same-provider →modelId; cross-provider → dropped to default (out of scope, would 404 on the wrong adapter); registry-unknown → passed through raw (preservespre-resolution behaviour, e.g. the default classifier).
GET /v1/builder/modelsnow returnsmodel_id(bare vendor id) alongsideid.Naming-decisions still pending
SUB_AGENT_MODEL) is listed out-of-scope in Allow defining the LLM model per orchestrator instance #296 but this PR adds the sub-agent picker + validation. Confirm keep, or split out.