feat(run-eval): add router-classified-3tier MODELS entry + recursive preflight#3636
feat(run-eval): add router-classified-3tier MODELS entry + recursive preflight#3636juanmichelini wants to merge 2 commits into
router-classified-3tier MODELS entry + recursive preflight#3636Conversation
…preflight Companion change to OpenHands/benchmarks#742 (intelligent per-instance model routing). With this PR the SDK can dispatch a router-shaped llm_config to the evaluation pipeline; the benchmarks side already understands the intelligent-router-v0 shape and will classify each instance and route to the matching tier model. Changes: - New MODELS entry 'router-classified-3tier' (classifier=minimax-m2.7, tiers={kimi-k2.6, minimax-m2.7, gpt-5.5}, default iter5 routing). - New helpers ROUTER_CONFIG_KIND and is_router_config(). - check_model() now detects router entries and recurses into each tier sub-model, aggregating success/failure. - Pydantic validator in tests learns about RouterLLMConfig and the registry's llm_config is now 'RouterLLMConfig | LLMConfig'. - 14 new tests covering the new entry, is_router_config, and recursive preflight. Note: the matching OpenHands/evaluation change to eval-job/scripts/build_matrix.py (handle no-top-level-model router entries when deriving the GCS slug) is required for end-to-end dispatch and will be opened separately. Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: router-classified-3tier
Taste Rating
🟢 Good taste - Clean implementation with minimal complexity.
Analysis
This PR adds intelligent per-instance model routing. The design is sound: router config discriminator separates router entries from plain model entries, check_model recurses into tier sub-models during preflight, and pydantic validation catches internal consistency errors at test-time.
What works well:
- is_router_config() is a clean, side-effect-free predicate
- _check_router_tiers aggregates results cleanly without duplicating logic
- RouterLLMConfig model validator enforces reference consistency
- 14 new tests cover key paths with appropriate mocking
Style Notes (minor):
- Block comment (~Line 440-455) explaining routing table is verbose - the table is self-evident from the code
- Comment referencing build_matrix.py (~Line 572) may drift since that code is out-of-scope per PR
Risk Assessment: 🟢 LOW
Pure additive change. Existing plain-model paths unchanged. Pydantic union is backward-compatible.
Verdict
✅ Worth merging - Core logic sound, tests comprehensive, design extensible.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
The router entry and recursive preflight behavior work locally, but the real Run Eval resolver CLI currently aborts because the live proxy rejects the kimi-k2.6 tier.
Does this PR achieve its stated goal?
Partially. The PR does add router-classified-3tier and changes preflight from the old model=unknown failure mode into recursive tier validation; I verified that against a local OpenAI-compatible endpoint with real litellm HTTP calls. However, exercising the actual resolver CLI as the workflow would (MODEL_IDS=router-classified-3tier) fails preflight against the default live proxy because moonshot/kimi-k2.6 is rejected, so the new model is not currently dispatch-ready in this environment.
| Phase | Result |
|---|---|
| Environment Setup | ✅ uv run created/used the project environment and the resolver executed successfully. |
| CI Status | 🟡 At refresh: 22 successful checks, 6 in progress, 3 skipped. I did not run tests/linters locally. |
| Functional Verification |
Functional Verification
Test 1: Model resolution before/after
Step 1 — Establish baseline on origin/main:
Ran a short user-style resolver invocation for find_models_by_id(["router-classified-3tier"]):
has_router_entry= False
ERROR: Model ID 'router-classified-3tier' not found. Available models: ...
find_models_by_id_ok= False
SystemExit 1
This confirms the base branch cannot dispatch this model id at all.
Step 2 — Apply the PR changes:
Checked out dc25347887e8394255a699a36c4bf39e91a5b4b9.
Step 3 — Re-run with the fix in place:
Ran the same resolver flow:
type= list
[
{
"display_name": "Router (3-tier, classifier=minimax-m2.7)",
"id": "router-classified-3tier",
"llm_config": {
"classifier_model_id": "minimax-m2.7",
"fallback_model_id": "gpt-5.5",
"kind": "intelligent-router-v0",
"tiers": { ... },
"vision_capable_model_ids": ["kimi-k2.6", "gpt-5.5"]
}
}
]
This confirms the new model id resolves to a router-shaped payload with no top-level model.
Test 2: Recursive preflight behavior before/after
Step 1 — Establish baseline on origin/main:
Ran check_model() on a router-shaped config:
success= False
✗ Router Test: Bad request - litellm.BadRequestError: LLM Provider NOT provided. Pass in the LLM provider you are trying to call. You passed model=unknown
This confirms the old code path treated router configs as plain configs and tried model=unknown.
Step 2 — Apply the PR changes:
Checked out dc25347887e8394255a699a36c4bf39e91a5b4b9 and started an in-process local OpenAI-compatible HTTP endpoint.
Step 3 — Re-run with the fix in place:
Ran check_model() on the real router-classified-3tier entry using the local endpoint:
has_router_entry= True
resolved_ids= ['router-classified-3tier']
is_router_config= True
top_level_model_present= False
tier_ids= ['gpt-5.5', 'kimi-k2.6', 'minimax-m2.7']
preflight_success= True
Router (3-tier, classifier=minimax-m2.7): validating 3 tier model(s)...
✓ Router (3-tier, classifier=minimax-m2.7) :: kimi-k2.6: OK
✓ Router (3-tier, classifier=minimax-m2.7) :: minimax-m2.7: OK
✓ Router (3-tier, classifier=minimax-m2.7) :: gpt-5.5: OK
✓ Router (3-tier, classifier=minimax-m2.7): OK (3 tier(s))
Captured HTTP requests from litellm:
[
{"path": "/chat/completions", "model": "moonshot/kimi-k2.6", "temperature": 1.0, "top_p": null, "reasoning_effort": null},
{"path": "/chat/completions", "model": "minimax/MiniMax-M2.7", "temperature": 1.0, "top_p": 0.95, "reasoning_effort": null},
{"path": "/chat/completions", "model": "openai/gpt-5.5", "temperature": null, "top_p": null, "reasoning_effort": "high"}
]This confirms recursive preflight now hits each tier and forwards the per-tier parameters.
Test 3: Actual workflow-style CLI execution against the live proxy
Step 1 — Run the actual resolver CLI for the new model:
Ran:
LLM_API_KEY="$LLM_API_KEY" LITELLM_API_KEY="$LLM_API_KEY" OPENAI_API_KEY="$LLM_API_KEY" MODEL_IDS=router-classified-3tier GITHUB_OUTPUT=/tmp/resolve_model_config_output.txt uv run python .github/run-eval/resolve_model_config.pyObserved:
Resolved 1 model(s): router-classified-3tier
✓ Proxy reachable at https://llm-proxy.app.all-hands.dev
Preflight LLM check for 1 model(s)...
Checking Router (3-tier, classifier=minimax-m2.7)...
Router (3-tier, classifier=minimax-m2.7): validating 3 tier model(s)...
✗ Router (3-tier, classifier=minimax-m2.7) :: kimi-k2.6: Bad request - litellm.BadRequestError: Litellm_proxyException - /chat/completions: Invalid model name passed in model=moonshot/kimi-k2.6. Call `/v1/models` to view available models for your key.
✓ Router (3-tier, classifier=minimax-m2.7) :: minimax-m2.7: OK
✓ Router (3-tier, classifier=minimax-m2.7) :: gpt-5.5: OK
✗ Router (3-tier, classifier=minimax-m2.7): one or more tiers failed
✗ Some models failed preflight check
ERROR: Preflight LLM check failed
exit_code=1
--- GITHUB_OUTPUT ---
(missing)
This shows the real workflow-style dispatch path currently aborts before producing GITHUB_OUTPUT.
Step 2 — Compare the underlying plain tier:
Ran the same CLI for MODEL_IDS=kimi-k2.6:
Resolved 1 model(s): kimi-k2.6
✓ Proxy reachable at https://llm-proxy.app.all-hands.dev
Checking Kimi K2.6...
✗ Kimi K2.6: Bad request - litellm.BadRequestError: Litellm_proxyException - /chat/completions: Invalid model name passed in model=moonshot/kimi-k2.6. Call `/v1/models` to view available models for your key.
ERROR: Preflight LLM check failed
exit_code=1
This suggests the recursion itself is working correctly, but the kimi-k2.6 tier is not currently usable through the live proxy credentials/environment I exercised.
Issues Found
- 🟠 Issue:
MODEL_IDS=router-classified-3tieris not currently dispatch-ready against the live default proxy because thekimi-k2.6tier fails preflight withInvalid model name passed in model=moonshot/kimi-k2.6. The plainkimi-k2.6entry fails the same way, so this looks like a proxy provisioning/model-name issue rather than a recursion bug, but it still blocks the PR’s stated dispatch-readiness goal.
Automated QA review generated by an AI agent (OpenHands) on behalf of the requester.
| "fallback_model_id": "gpt-5.5", | ||
| "tiers": { | ||
| "kimi-k2.6": { | ||
| "model": "litellm_proxy/moonshot/kimi-k2.6", |
There was a problem hiding this comment.
🟠 Important: I exercised the actual resolver CLI with MODEL_IDS=router-classified-3tier against the live default proxy using the available LLM credentials. Preflight recursed correctly, but this tier failed with Invalid model name passed in model=moonshot/kimi-k2.6; running the plain MODEL_IDS=kimi-k2.6 entry failed the same way. Until the proxy/model name is provisioned or this tier is changed to a reachable model, the new router model aborts before writing GITHUB_OUTPUT, so the dispatching end is not fully ready.
Automated QA finding generated by an AI agent (OpenHands) on behalf of the requester.
|
[Automatic Post]: It has been a while since there was any activity on this PR. @juanmichelini, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up. This comment was created by an AI agent (OpenHands) on behalf of the user. |
Summary
Companion PR to
OpenHands/benchmarks#742(per-instance intelligent model routing for the 5 default-agent benchmarks). That PR makes the receiving end ready: when a benchmark's--llm-config-pathpoints at anintelligent-router-v0JSON, each instance is classified once and the agent conversation is routed to the matching tier model.This PR makes the dispatching end ready:
resolve_model_config.MODELSnow contains arouter-classified-3tierentry whosellm_configis exactly that router payload, andcheck_model(preflight) knows how to recurse into the tier sub-models.After both PRs land, dispatching
Run Evalwithmodel_ids=router-classified-3tierwill produce a run that routes per instance instead of running a single model end-to-end. Until then the entry is dormant on the SDK side and harmless to existing flows.What's in this PR
.github/run-eval/resolve_model_config.pyrouter-classified-3tier; new helpersROUTER_CONFIG_KINDandis_router_config();check_model()now detects router entries and recurses into each tier sub-model via a new_check_router_tiers()helper..github/run-eval/ADDINGMODEL.mdtests/cross/test_resolve_model_config.pyRouterLLMConfigpydantic validator (mirrorsLLMConfig);EvalModelConfig.llm_configis nowRouterLLMConfig | LLMConfig; 14 new tests covering the registry entry, the predicate, and the recursive preflight.Total: +412 / −3 across 3 files.
The new MODELS entry
Each tier sub-config is byte-identical to the matching plain MODELS entry (
kimi-k2.6,minimax-m2.7,gpt-5.5), so all proxy provisioning that already works for those models keeps working here. The classifier reusesminimax-m2.7, exactly mirroringOpenHands/benchmarks's sample router config.Preflight: recursing into tier sub-models
A router payload has no top-level
"model"— so the existingcheck_modelwould have calledlitellm.completion(model="unknown", …)and failed in a confusing way. The new shape:_check_router_tiersrunscheck_modelon each tier sub-model and aggregates the result. Per-entry output stays a one-liner in the preflight summary, with indented per-tier diagnostics directly underneath:If any tier fails (provisioning, parameter shape, etc.) the aggregate fails and the per-tier failure line is surfaced so the cause is obvious from the workflow log.
Pydantic validator update
tests/cross/test_resolve_model_config.pyalready enforces that every MODELS entry validates againstEvalModelConfig. Without the router shape that test fails for the new entry because router payloads have nomodelfield. The fix is a newRouterLLMConfig(parallelsLLMConfig) andEvalModelConfig.llm_config: RouterLLMConfig | LLMConfig. Pydantic union resolution picksRouterLLMConfigfor payloads carryingkind: "intelligent-router-v0"andLLMConfigotherwise. Existing models are unaffected.RouterLLMConfigadditionally enforces internal consistency:classifier_model_id,fallback_model_id, everyroutingtarget, and everyvision_capable_model_idsentry must all be keys intiers. This catches typos at test-time instead of at run-time.New tests (14)
TestRouterClassified3Tier(5): the entry is router-shaped, refs are consistent, every tier is a validlitellm_proxy/…config, the iter5 5-category routing table is complete, the payload satisfiesRouterLLMConfig.TestIsRouterConfig(6): plain configs, missing-kind, missing-tiers, wrong-kind, canonical-payload, non-dict inputs.TestCheckModelRouterRecursion(4): all tiers succeed → router passes (withlitellm.completioncalled once per tier andmodel=correctly forwarded); one tier failure → router fails; emptytiersshort-circuits without ever calling litellm; per-tier parameters (temperature,top_p) are forwarded correctly.All tests use the existing
litellm.completion-mock pattern fromTestTestModel; no real network calls.Verification
uv run ruff format .github/run-eval/resolve_model_config.py tests/cross/test_resolve_model_config.py— cleanuv run ruff check .github/run-eval/resolve_model_config.py tests/cross/test_resolve_model_config.py—All checks passed!uv run pyright .github/run-eval/resolve_model_config.py—0 errors, 0 warnings, 0 informationsuv run pytest tests/cross/test_resolve_model_config.py— 58 passed (44 pre-existing + 14 new), 0 failed.find_models_by_id(["router-classified-3tier"])returns the full routerllm_configas themodels_jsonpayload that would be passed downstream.Out of scope (will be a separate PR)
The matching change to
OpenHands/evaluation/eval-job/scripts/build_matrix.pyis still needed for end-to-end dispatch. That script currently derives the GCS artifact slug fromllm_config["model"]and will exit withERROR: llm_config missing 'model'when handed a router payload. It needs to detectis_router_config(llm_config), fall back to deriving the slug from the entry'sid(e.g."router-classified-3tier"→"router-classified-3tier"), and otherwise pass thellm_configthrough to the benchmark untouched. That's a one-file change I can put up next; opening it separately to keep the two reviews independent.How to test end-to-end after the matching
evaluationPR landsRun Evalwithmodel_ids=router-classified-3tier,benchmark=swebench,eval_limit=10.metadata.routingis non-null in the resultingresults.tar.gz(vs.nullin the gpt-5.4 run we just looked at).benchmarks.utils.intelligent_routinglogger) likeintelligent-routing instance=… category=Frontend model=kimi-k2.6 ….output.jsonl[*].metrics.costs[*].modelcontains a mix of the three tier model strings instead of a single repeated value.This PR was prepared by an AI agent (OpenHands) on behalf of @juanmichelini.
@juanmichelini can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:dc25347-pythonRun
All tags pushed for this build
About Multi-Architecture Support
dc25347-python) is a multi-arch manifest supporting both amd64 and arm64dc25347-python-amd64) are also available if needed