feat(orchestrator): per-Agent LLM model selection by sneumannb5 · Pull Request #387 · byte5ai/omadia

sneumannb5 · 2026-06-29T11:29:36Z

What

Per-Agent orchestrator LLM model selection (Closes #296). Each Agent can pin its own model (and per-turn triage routing) via the Admin builder instead of one
global ORCHESTRATOR_MODEL. Model picker is populated from the @omadia/llm-provider registry.

Why

Cost/latency differ per Agent — a high-volume support Agent can run on a cheap model while a reasoning-heavy one stays on the frontier model. Turns an
operator runtime choice into a per-instance setting instead of a deploy-time env decision. The pluggable-provider work only pays off once instances can
actually pick different models.

Test plan

cd middleware && npm run typecheck -w @omadia/orchestrator
cd middleware && npx tsc --noEmit in middleware and web-ui (clean)
new/affected tests pass: configStoreModelRoutingValidation, agentBuilderAgentNode, agentGraphStoreSubAgentModelValidation, orchestratorRegistry, subAgentModelResolution, agentBuilderSubAgentTools
manual: set per-Agent model in Admin builder → confirm effectiveModel badge + registry agent rebuilt log, turn runs without 404

Note: full middleware suite has 5 pre-existing failures in durableRecall/recallProbe — bisected to 1550268 (#385), date-relative test rot
(fixed 2026-06-01 data vs now()), unrelated to this PR.

Risk / blast radius

No schema change — reuses existing agents.model_routing (JSONB) + agent_subagents.model.
No new env var. ORCHESTRATOR_MODEL kept as the fallback tier → existing deployments unchanged.
Resolution behaviour change (core): the orchestrator, per-turn router, and in-process sub-agents send model raw to a single provider adapter.
Per-Agent refs are now resolved to the active provider's bare modelId before build (resolveModelIdForProvider): registry-known same-provider →
modelId; cross-provider → dropped to default (out of scope, would 404 on the wrong adapter); registry-unknown → passed through raw (preserves
pre-resolution behaviour, e.g. the default classifier).
Public API (additive): GET /v1/builder/models now returns model_id (bare vendor id) alongside id.
Hot-applies via the existing per-Agent rebuild — no restart.

Naming-decisions still pending

Sub-agent model selection (SUB_AGENT_MODEL) is listed out-of-scope in Allow defining the LLM model per orchestrator instance #296 but this PR adds the sub-agent picker + validation. Confirm keep, or split out.

ConnysCode

Review: Request changes

One blocker and two majors before merge. The core resolution design (3-tier overlay, hot-apply, the effectiveModel surface) is sound, and I verified the no-schema-change / no-env-var claims hold — findings below are scoped, all traced against the cloned head.

Must change before merge

🔴 BLOCKER — per-Agent model selection is broken on claude-cli (subscription) deployments. The claude-cli branch at applyDiff.ts:178 skips ref→modelId resolution and passes the ref through raw. But the new picker writes provider-qualified ids: on a keyless CLI deployment /v1/builder/models offers only claude-cli:opus-cli / sonnet-cli / haiku-cli (registered via listModels(), builtinLlmProviders.ts:136-159), validateModelRef accepts them, so model_routing.main = 'claude-cli:opus-cli' persists. At build the ref stays raw → buildOrchestrator.ts:335 runs config.model.replace(/-cli$/, '') → 'claude-cli:opus' → CliChatAgent passes it verbatim as claude -p --model claude-cli:opus (cliChatAgent.ts:608), an invalid model, so every turn for that Agent fails. The CLI expects the bare alias opus. On a keyless deployment there's no working pick, so the headline feature can't be used. Traced end-to-end; fix inline.

🟠 MAJOR — a pinned sub-agent model can never be cleared back to "inherit". updateSubAgent uses model = COALESCE($4, model) with $4 = patch.model ?? null (agentGraphStore.ts:359/366), so the documented null = "clear" is a no-op — COALESCE(NULL, model) keeps the old value. The operator can't revert a sub-agent from a pinned model to inherit-parent. Fix inline.

🟠 MAJOR — a cross-provider main-model pick is accepted at write time, then silently dropped at runtime. The picker lists every connected provider's models, not just the orchestrator's active provider, and validateModelRef (configStore.ts:343) resolves against the full registry with no defaultProvider — so e.g. openai:gpt-5.5 under an Anthropic orchestrator passes validation and persists. At build, resolveModelIdForProvider returns undefined for the cross-provider id and model falls back to the platform default (applyDiff.ts:188). The picker shows GPT-5.5, the Agent runs on Opus, and the only signal is the effectiveModel badge — no error at write time. (deps.provider is a single global provider — confirmed via OrchestratorRegistry → buildForAgent.) #296 scoped cross-provider routing out, but the picker actively offers a choice that silently won't run. Please scope the per-Agent / sub-agent model choice to the active provider: filter the offered list, or resolve main with { defaultProvider: <active provider> } at write time and reject (or warn on) a cross-provider pick.

Scope — sub-agent model selection (your open question)

#296 explicitly lists "Sub-agent model selection (SUB_AGENT_MODEL)" as out of scope, and this PR ships the sub-agent picker + validateModelRef guard + resolveSubAgentModel. There's no schema cost (it reuses the existing agent_subagents.model column — no migration), so keeping it is defensible — but the sub-agent half carries the MAJOR clear-bug above. Recommendation: if you keep it here, fix the updateSubAgent clear path first; if you'd rather land #296 clean, split the sub-agent picker into its own PR (the clear-bug travels with it). Either is fine — just don't merge the half-working clear.

What I verified holds (no action needed)

Hot-reload is race-free. setModelRouting → await reload() swaps this.active synchronously before the response reads effectiveModel, so acceptance #1/#4 deliver (agentBuilder.ts:194-196, registry/index.ts:212-223).
No schema change / no new env var. agents.model_routing + agent_subagents.model both pre-exist (migration 0003); no .sql added; the three DEFAULT_ORCHESTRATOR_MODEL copies (agentRuntime.ts, plugin.ts, config.ts) all agree on claude-opus-4-8.
assembleGraph / agentNode blast radius is clean — every call site passes registry, and l.registry is wired in production (index.ts:2009).
web-ui ↔ server shapes agree (effectiveModel, model_id, aliases); the additive /v1/builder/models change is behind requireAuth and has no OpenAPI spec to drift.

PR description — factcheck

Claim	Reality
"registry-known same-provider → modelId; cross-provider → default; registry-unknown → raw"	True for the Anthropic/OpenAI path. Incomplete for `claude-cli` — those refs are passed raw without resolution → the BLOCKER.
"cross-provider → dropped to default (out of scope, would 404)"	True at build — but the write succeeds with no operator signal, and the picker offered the pick.
"No schema change — reuses `agents.model_routing` + `agent_subagents.model`"	✅ Verified — no migration added.
"No new env var"	✅ Verified.
"Hot-applies via the existing per-Agent rebuild — no restart"	✅ Verified — race-free on the happy path.
"typecheck clean / listed tests pass"	Not re-executed here. Note the new tests cover `anthropic` + `openai` but not `claude-cli` — the path that breaks.

Minimum to merge

Resolve claude-cli refs to the bare modelId before buildOrchestrator strips -cli (BLOCKER).
Let updateSubAgent clear model back to NULL (MAJOR) — or split the sub-agent picker out.
Scope the model picker / write-validation to the orchestrator's active provider so a cross-provider pick can't silently no-op (MAJOR).

Nits (non-blocking)

applyDiff.ts:188 — the platform default runtime.model is sent raw (never resolved). Safe today only because the orchestrator_model install field is an enum of bare ids; a defensive resolveModelIdForProvider(runtime.model, …) ?? runtime.model would harden it if that field ever becomes free-text or a registry picker.
InspectorPanel.tsx — useModelCatalog() refetches /v1/builder/models on every node selection (once per editor mount). Consider lifting/caching the catalog.
agentBuilder.ts reload() — the bare catch {} swallows reload failures, after which the response can echo a stale effectiveModel until the next NOTIFY. Low impact.

(Reviewed against head 9a6a2f9.)

ConnysCode · 2026-06-30T08:04:01Z

+  // resolving it would change established behaviour.
+  const activeProvider = deps.provider?.id;
+  const resolveOverlay = (ref: string | undefined): string | undefined =>
+    activeProvider === 'claude-cli'


BLOCKER (claude-cli deployments) — provider-qualified picker ids reach the CLI unresolved, so every per-Agent turn fails.

This activeProvider === 'claude-cli' branch skips resolution and passes the ref through raw. The new picker, though, writes provider-qualified ids, and on a keyless CLI deployment /v1/builder/models offers only claude-cli:opus-cli / sonnet-cli / haiku-cli (registered via listModels(), builtinLlmProviders.ts:136-159). validateModelRef accepts them, so model_routing.main = 'claude-cli:opus-cli' persists. At build the ref stays raw → buildOrchestrator.ts:335 runs config.model.replace(/-cli$/, '') → 'claude-cli:opus', which CliChatAgent passes verbatim as claude -p --model claude-cli:opus (cliChatAgent.ts:608). That's an invalid model id (the CLI expects the bare opus), so every turn for that Agent fails — and on a keyless deployment there is no other pick to choose.

Please route the CLI branch through the resolver as well, so it yields the bare modelId first:

const resolveOverlay = (ref: string | undefined): string | undefined => resolveModelIdForProvider(ref, activeProvider) ?? ref?.trim() || undefined;

resolveModelIdForProvider('claude-cli:opus-cli', 'claude-cli') returns 'opus-cli' (→ .replace(/-cli$/,'') → 'opus'), while a legacy bare 'opus' resolves cross-provider to undefined and still falls through to ref.trim() untouched. Apply the same to the overlayRouting sub-models. Please also add a depsWithProvider('claude-cli') case to orchestratorRegistry.test.ts — the current tests only exercise anthropic/openai, which is why this slipped.

ConnysCode · 2026-06-30T08:04:01Z

+    // Issue #296 follow-up — see `createSubAgent` rationale. `null` is the
+    // wire form for "clear" (no validation); empty-string is treated the
+    // same so the UI dropdown's "(default)" choice round-trips cleanly.
+    if (patch.model != null && patch.model.trim() !== '') {


MAJOR — a pinned sub-agent model can never be cleared back to "inherit".

This comment says null is the wire form for "clear", but the UPDATE below uses model = COALESCE($4, model) (line 366) with $4 = patch.model ?? null (line 379). COALESCE(NULL, model) keeps the existing value, so passing model: null is a no-op — an operator can't revert a sub-agent from a pinned model back to inherit-parent. (The sibling setSubAgentSkill uses a direct write precisely because COALESCE can't clear a column; there's no equivalent for model.)

Separately, createSubAgent persists '' rather than NULL for the empty-string "(default)" choice — input.model ?? null (line 336) doesn't catch ''. Benign at runtime (resolveSubAgentModel('') → parent default) but it leaves dirty data.

Please give model a real clear path: branch on patch.model === null → write model = NULL (or add a dedicated clearSubAgentModel direct write, mirroring setSubAgentSkill), and bind input.model?.trim() || null in createSubAgent so the column stays clean.

ConnysCode · 2026-06-30T08:04:01Z

@@ -17,6 +19,60 @@ import type { ModelRoutingConfig as RuntimeModelRouting } from '../modelRouter.j

 const DEFAULT_CLASSIFIER_MODEL = 'claude-haiku-4-5';


MINOR — the default classifier is an id your own validator would reject. claude-haiku-4-5 isn't in the registry (the registered Haiku is claude-haiku-4-5-20251001, alias haiku — modelRegistry.ts:12-16), so resolveModelRef('claude-haiku-4-5') is undefined. It survives only because resolveModelIdForProvider passes registry-unknown refs through raw and Anthropic happens to serve the undated alias — yet if an operator typed claude-haiku-4-5 into the triage field, validateModelRef would throw "not registered". Please align this default to the dated claude-haiku-4-5-20251001 (or the haiku alias) so the code's own default and its write-validation agree. Non-blocking.

feat(orchestrator): per-Agent LLM model selection

9a6a2f9

ConnysCode requested changes Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(orchestrator): per-Agent LLM model selection#387

feat(orchestrator): per-Agent LLM model selection#387
sneumannb5 wants to merge 1 commit into
byte5ai:mainfrom
sneumannb5:feat/define-llm-model

sneumannb5 commented Jun 29, 2026

Uh oh!

ConnysCode left a comment

Uh oh!

ConnysCode Jun 30, 2026

Uh oh!

ConnysCode Jun 30, 2026

Uh oh!

ConnysCode Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -17,6 +19,60 @@ import type { ModelRoutingConfig as RuntimeModelRouting } from '../modelRouter.j

		const DEFAULT_CLASSIFIER_MODEL = 'claude-haiku-4-5';

Uh oh!

Conversation

sneumannb5 commented Jun 29, 2026

What

Why

Test plan

Risk / blast radius

Naming-decisions still pending

Uh oh!

ConnysCode left a comment

Choose a reason for hiding this comment

Must change before merge

Scope — sub-agent model selection (your open question)

What I verified holds (no action needed)

PR description — factcheck

Minimum to merge

Uh oh!

ConnysCode Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

ConnysCode Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

ConnysCode Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants