Skip to content

feat(orchestrator): per-Agent LLM model selection#387

Open
sneumannb5 wants to merge 1 commit into
byte5ai:mainfrom
sneumannb5:feat/define-llm-model
Open

feat(orchestrator): per-Agent LLM model selection#387
sneumannb5 wants to merge 1 commit into
byte5ai:mainfrom
sneumannb5:feat/define-llm-model

Conversation

@sneumannb5

Copy link
Copy Markdown
Contributor

What

Per-Agent orchestrator LLM model selection (Closes #296). Each Agent can pin its own model (and per-turn triage routing) via the Admin builder instead of one
global ORCHESTRATOR_MODEL. Model picker is populated from the @omadia/llm-provider registry.

Why

Cost/latency differ per Agent — a high-volume support Agent can run on a cheap model while a reasoning-heavy one stays on the frontier model. Turns an
operator runtime choice into a per-instance setting instead of a deploy-time env decision. The pluggable-provider work only pays off once instances can
actually pick different models.

Test plan

  • cd middleware && npm run typecheck -w @omadia/orchestrator
  • cd middleware && npx tsc --noEmit in middleware and web-ui (clean)
  • new/affected tests pass: configStoreModelRoutingValidation, agentBuilderAgentNode, agentGraphStoreSubAgentModelValidation, orchestratorRegistry, subAgentModelResolution, agentBuilderSubAgentTools
  • manual: set per-Agent model in Admin builder → confirm effectiveModel badge + registry agent rebuilt log, turn runs without 404

Note: full middleware suite has 5 pre-existing failures in durableRecall/recallProbe — bisected to 1550268 (#385), date-relative test rot
(fixed 2026-06-01 data vs now()), unrelated to this PR.

Risk / blast radius

  • No schema change — reuses existing agents.model_routing (JSONB) + agent_subagents.model.
  • No new env var. ORCHESTRATOR_MODEL kept as the fallback tier → existing deployments unchanged.
  • Resolution behaviour change (core): the orchestrator, per-turn router, and in-process sub-agents send model raw to a single provider adapter.
    Per-Agent refs are now resolved to the active provider's bare modelId before build (resolveModelIdForProvider): registry-known same-provider →
    modelId; cross-provider → dropped to default (out of scope, would 404 on the wrong adapter); registry-unknown → passed through raw (preserves
    pre-resolution behaviour, e.g. the default classifier).
  • Public API (additive): GET /v1/builder/models now returns model_id (bare vendor id) alongside id.
  • Hot-applies via the existing per-Agent rebuild — no restart.

Naming-decisions still pending

@ConnysCode ConnysCode left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Request changes

One blocker and two majors before merge. The core resolution design (3-tier overlay, hot-apply, the effectiveModel surface) is sound, and I verified the no-schema-change / no-env-var claims hold — findings below are scoped, all traced against the cloned head.

Must change before merge

🔴 BLOCKER — per-Agent model selection is broken on claude-cli (subscription) deployments. The claude-cli branch at applyDiff.ts:178 skips ref→modelId resolution and passes the ref through raw. But the new picker writes provider-qualified ids: on a keyless CLI deployment /v1/builder/models offers only claude-cli:opus-cli / sonnet-cli / haiku-cli (registered via listModels(), builtinLlmProviders.ts:136-159), validateModelRef accepts them, so model_routing.main = 'claude-cli:opus-cli' persists. At build the ref stays raw → buildOrchestrator.ts:335 runs config.model.replace(/-cli$/, '')'claude-cli:opus'CliChatAgent passes it verbatim as claude -p --model claude-cli:opus (cliChatAgent.ts:608), an invalid model, so every turn for that Agent fails. The CLI expects the bare alias opus. On a keyless deployment there's no working pick, so the headline feature can't be used. Traced end-to-end; fix inline.

🟠 MAJOR — a pinned sub-agent model can never be cleared back to "inherit". updateSubAgent uses model = COALESCE($4, model) with $4 = patch.model ?? null (agentGraphStore.ts:359/366), so the documented null = "clear" is a no-op — COALESCE(NULL, model) keeps the old value. The operator can't revert a sub-agent from a pinned model to inherit-parent. Fix inline.

🟠 MAJOR — a cross-provider main-model pick is accepted at write time, then silently dropped at runtime. The picker lists every connected provider's models, not just the orchestrator's active provider, and validateModelRef (configStore.ts:343) resolves against the full registry with no defaultProvider — so e.g. openai:gpt-5.5 under an Anthropic orchestrator passes validation and persists. At build, resolveModelIdForProvider returns undefined for the cross-provider id and model falls back to the platform default (applyDiff.ts:188). The picker shows GPT-5.5, the Agent runs on Opus, and the only signal is the effectiveModel badge — no error at write time. (deps.provider is a single global provider — confirmed via OrchestratorRegistrybuildForAgent.) #296 scoped cross-provider routing out, but the picker actively offers a choice that silently won't run. Please scope the per-Agent / sub-agent model choice to the active provider: filter the offered list, or resolve main with { defaultProvider: <active provider> } at write time and reject (or warn on) a cross-provider pick.

Scope — sub-agent model selection (your open question)

#296 explicitly lists "Sub-agent model selection (SUB_AGENT_MODEL)" as out of scope, and this PR ships the sub-agent picker + validateModelRef guard + resolveSubAgentModel. There's no schema cost (it reuses the existing agent_subagents.model column — no migration), so keeping it is defensible — but the sub-agent half carries the MAJOR clear-bug above. Recommendation: if you keep it here, fix the updateSubAgent clear path first; if you'd rather land #296 clean, split the sub-agent picker into its own PR (the clear-bug travels with it). Either is fine — just don't merge the half-working clear.

What I verified holds (no action needed)

  • Hot-reload is race-free. setModelRoutingawait reload() swaps this.active synchronously before the response reads effectiveModel, so acceptance #1/#4 deliver (agentBuilder.ts:194-196, registry/index.ts:212-223).
  • No schema change / no new env var. agents.model_routing + agent_subagents.model both pre-exist (migration 0003); no .sql added; the three DEFAULT_ORCHESTRATOR_MODEL copies (agentRuntime.ts, plugin.ts, config.ts) all agree on claude-opus-4-8.
  • assembleGraph / agentNode blast radius is clean — every call site passes registry, and l.registry is wired in production (index.ts:2009).
  • web-ui ↔ server shapes agree (effectiveModel, model_id, aliases); the additive /v1/builder/models change is behind requireAuth and has no OpenAPI spec to drift.

PR description — factcheck

Claim Reality
"registry-known same-provider → modelId; cross-provider → default; registry-unknown → raw" True for the Anthropic/OpenAI path. Incomplete for claude-cli — those refs are passed raw without resolution → the BLOCKER.
"cross-provider → dropped to default (out of scope, would 404)" True at build — but the write succeeds with no operator signal, and the picker offered the pick.
"No schema change — reuses agents.model_routing + agent_subagents.model" ✅ Verified — no migration added.
"No new env var" ✅ Verified.
"Hot-applies via the existing per-Agent rebuild — no restart" ✅ Verified — race-free on the happy path.
"typecheck clean / listed tests pass" Not re-executed here. Note the new tests cover anthropic + openai but not claude-cli — the path that breaks.

Minimum to merge

  1. Resolve claude-cli refs to the bare modelId before buildOrchestrator strips -cli (BLOCKER).
  2. Let updateSubAgent clear model back to NULL (MAJOR) — or split the sub-agent picker out.
  3. Scope the model picker / write-validation to the orchestrator's active provider so a cross-provider pick can't silently no-op (MAJOR).
Nits (non-blocking)
  • applyDiff.ts:188 — the platform default runtime.model is sent raw (never resolved). Safe today only because the orchestrator_model install field is an enum of bare ids; a defensive resolveModelIdForProvider(runtime.model, …) ?? runtime.model would harden it if that field ever becomes free-text or a registry picker.
  • InspectorPanel.tsxuseModelCatalog() refetches /v1/builder/models on every node selection (once per editor mount). Consider lifting/caching the catalog.
  • agentBuilder.ts reload() — the bare catch {} swallows reload failures, after which the response can echo a stale effectiveModel until the next NOTIFY. Low impact.

(Reviewed against head 9a6a2f9.)

// resolving it would change established behaviour.
const activeProvider = deps.provider?.id;
const resolveOverlay = (ref: string | undefined): string | undefined =>
activeProvider === 'claude-cli'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCKER (claude-cli deployments) — provider-qualified picker ids reach the CLI unresolved, so every per-Agent turn fails.

This activeProvider === 'claude-cli' branch skips resolution and passes the ref through raw. The new picker, though, writes provider-qualified ids, and on a keyless CLI deployment /v1/builder/models offers only claude-cli:opus-cli / sonnet-cli / haiku-cli (registered via listModels(), builtinLlmProviders.ts:136-159). validateModelRef accepts them, so model_routing.main = 'claude-cli:opus-cli' persists. At build the ref stays raw → buildOrchestrator.ts:335 runs config.model.replace(/-cli$/, '')'claude-cli:opus', which CliChatAgent passes verbatim as claude -p --model claude-cli:opus (cliChatAgent.ts:608). That's an invalid model id (the CLI expects the bare opus), so every turn for that Agent fails — and on a keyless deployment there is no other pick to choose.

Please route the CLI branch through the resolver as well, so it yields the bare modelId first:

const resolveOverlay = (ref: string | undefined): string | undefined =>
  resolveModelIdForProvider(ref, activeProvider) ?? ref?.trim() || undefined;

resolveModelIdForProvider('claude-cli:opus-cli', 'claude-cli') returns 'opus-cli' (→ .replace(/-cli$/,'')'opus'), while a legacy bare 'opus' resolves cross-provider to undefined and still falls through to ref.trim() untouched. Apply the same to the overlayRouting sub-models. Please also add a depsWithProvider('claude-cli') case to orchestratorRegistry.test.ts — the current tests only exercise anthropic/openai, which is why this slipped.

// Issue #296 follow-up — see `createSubAgent` rationale. `null` is the
// wire form for "clear" (no validation); empty-string is treated the
// same so the UI dropdown's "(default)" choice round-trips cleanly.
if (patch.model != null && patch.model.trim() !== '') {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAJOR — a pinned sub-agent model can never be cleared back to "inherit".

This comment says null is the wire form for "clear", but the UPDATE below uses model = COALESCE($4, model) (line 366) with $4 = patch.model ?? null (line 379). COALESCE(NULL, model) keeps the existing value, so passing model: null is a no-op — an operator can't revert a sub-agent from a pinned model back to inherit-parent. (The sibling setSubAgentSkill uses a direct write precisely because COALESCE can't clear a column; there's no equivalent for model.)

Separately, createSubAgent persists '' rather than NULL for the empty-string "(default)" choice — input.model ?? null (line 336) doesn't catch ''. Benign at runtime (resolveSubAgentModel('') → parent default) but it leaves dirty data.

Please give model a real clear path: branch on patch.model === null → write model = NULL (or add a dedicated clearSubAgentModel direct write, mirroring setSubAgentSkill), and bind input.model?.trim() || null in createSubAgent so the column stays clean.

@@ -17,6 +19,60 @@ import type { ModelRoutingConfig as RuntimeModelRouting } from '../modelRouter.j

const DEFAULT_CLASSIFIER_MODEL = 'claude-haiku-4-5';

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MINOR — the default classifier is an id your own validator would reject. claude-haiku-4-5 isn't in the registry (the registered Haiku is claude-haiku-4-5-20251001, alias haikumodelRegistry.ts:12-16), so resolveModelRef('claude-haiku-4-5') is undefined. It survives only because resolveModelIdForProvider passes registry-unknown refs through raw and Anthropic happens to serve the undated alias — yet if an operator typed claude-haiku-4-5 into the triage field, validateModelRef would throw "not registered". Please align this default to the dated claude-haiku-4-5-20251001 (or the haiku alias) so the code's own default and its write-validation agree. Non-blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow defining the LLM model per orchestrator instance

2 participants