Skip to content

feat(run-eval): add router-classified-3tier MODELS entry + recursive preflight#3636

Open
juanmichelini wants to merge 2 commits into
mainfrom
feat/router-classified-3tier-model
Open

feat(run-eval): add router-classified-3tier MODELS entry + recursive preflight#3636
juanmichelini wants to merge 2 commits into
mainfrom
feat/router-classified-3tier-model

Conversation

@juanmichelini

@juanmichelini juanmichelini commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

Companion PR to OpenHands/benchmarks#742 (per-instance intelligent model routing for the 5 default-agent benchmarks). That PR makes the receiving end ready: when a benchmark's --llm-config-path points at an intelligent-router-v0 JSON, each instance is classified once and the agent conversation is routed to the matching tier model.

This PR makes the dispatching end ready: resolve_model_config.MODELS now contains a router-classified-3tier entry whose llm_config is exactly that router payload, and check_model (preflight) knows how to recurse into the tier sub-models.

After both PRs land, dispatching Run Eval with model_ids=router-classified-3tier will produce a run that routes per instance instead of running a single model end-to-end. Until then the entry is dormant on the SDK side and harmless to existing flows.

What's in this PR

File Purpose
.github/run-eval/resolve_model_config.py New MODELS entry router-classified-3tier; new helpers ROUTER_CONFIG_KIND and is_router_config(); check_model() now detects router entries and recurses into each tier sub-model via a new _check_router_tiers() helper.
.github/run-eval/ADDINGMODEL.md New "Two kinds of MODELS entries" section documenting the plain-vs-router distinction and pointing at the canonical router entry.
tests/cross/test_resolve_model_config.py New RouterLLMConfig pydantic validator (mirrors LLMConfig); EvalModelConfig.llm_config is now RouterLLMConfig | LLMConfig; 14 new tests covering the registry entry, the predicate, and the recursive preflight.

Total: +412 / −3 across 3 files.

The new MODELS entry

"router-classified-3tier": {
    "id": "router-classified-3tier",
    "display_name": "Router (3-tier, classifier=minimax-m2.7)",
    "llm_config": {
        "kind": "intelligent-router-v0",
        "classifier_model_id": "minimax-m2.7",
        "fallback_model_id": "gpt-5.5",
        "tiers": {
            "kimi-k2.6":     {"model": "litellm_proxy/moonshot/kimi-k2.6",     "temperature": 1.0, "inline_image_urls": True},
            "minimax-m2.7":  {"model": "litellm_proxy/minimax/MiniMax-M2.7",   "temperature": 1.0, "top_p": 0.95},
            "gpt-5.5":       {"model": "litellm_proxy/openai/gpt-5.5",         "reasoning_effort": "high"},
        },
        "routing": {
            "Frontend":                 "kimi-k2.6",
            "Issue Resolution (other)": "minimax-m2.7",
            "Greenfield":               "gpt-5.5",
            "Testing":                  "gpt-5.5",
            "Information Gathering":    "gpt-5.5",
        },
        "vision_capable_model_ids": ["kimi-k2.6", "gpt-5.5"],
    },
},

Each tier sub-config is byte-identical to the matching plain MODELS entry (kimi-k2.6, minimax-m2.7, gpt-5.5), so all proxy provisioning that already works for those models keeps working here. The classifier reuses minimax-m2.7, exactly mirroring OpenHands/benchmarks's sample router config.

Preflight: recursing into tier sub-models

A router payload has no top-level "model" — so the existing check_model would have called litellm.completion(model="unknown", …) and failed in a confusing way. The new shape:

def check_model(model_config, api_key, base_url, timeout=60):
    llm_config = model_config.get("llm_config", {})
    if is_router_config(llm_config):
        return _check_router_tiers(model_config, api_key, base_url, timeout)
    # ... existing plain-model code path, unchanged

_check_router_tiers runs check_model on each tier sub-model and aggregates the result. Per-entry output stays a one-liner in the preflight summary, with indented per-tier diagnostics directly underneath:

  Router (3-tier, classifier=minimax-m2.7): validating 3 tier model(s)...
    ✓ Router (3-tier, classifier=minimax-m2.7) :: kimi-k2.6: OK
    ✓ Router (3-tier, classifier=minimax-m2.7) :: minimax-m2.7: OK
    ✓ Router (3-tier, classifier=minimax-m2.7) :: gpt-5.5: OK
  ✓ Router (3-tier, classifier=minimax-m2.7): OK (3 tier(s))

If any tier fails (provisioning, parameter shape, etc.) the aggregate fails and the per-tier failure line is surfaced so the cause is obvious from the workflow log.

Pydantic validator update

tests/cross/test_resolve_model_config.py already enforces that every MODELS entry validates against EvalModelConfig. Without the router shape that test fails for the new entry because router payloads have no model field. The fix is a new RouterLLMConfig (parallels LLMConfig) and EvalModelConfig.llm_config: RouterLLMConfig | LLMConfig. Pydantic union resolution picks RouterLLMConfig for payloads carrying kind: "intelligent-router-v0" and LLMConfig otherwise. Existing models are unaffected.

RouterLLMConfig additionally enforces internal consistency: classifier_model_id, fallback_model_id, every routing target, and every vision_capable_model_ids entry must all be keys in tiers. This catches typos at test-time instead of at run-time.

New tests (14)

  • TestRouterClassified3Tier (5): the entry is router-shaped, refs are consistent, every tier is a valid litellm_proxy/… config, the iter5 5-category routing table is complete, the payload satisfies RouterLLMConfig.
  • TestIsRouterConfig (6): plain configs, missing-kind, missing-tiers, wrong-kind, canonical-payload, non-dict inputs.
  • TestCheckModelRouterRecursion (4): all tiers succeed → router passes (with litellm.completion called once per tier and model= correctly forwarded); one tier failure → router fails; empty tiers short-circuits without ever calling litellm; per-tier parameters (temperature, top_p) are forwarded correctly.

All tests use the existing litellm.completion-mock pattern from TestTestModel; no real network calls.

Verification

  • uv run ruff format .github/run-eval/resolve_model_config.py tests/cross/test_resolve_model_config.py — clean
  • uv run ruff check .github/run-eval/resolve_model_config.py tests/cross/test_resolve_model_config.pyAll checks passed!
  • uv run pyright .github/run-eval/resolve_model_config.py0 errors, 0 warnings, 0 informations
  • uv run pytest tests/cross/test_resolve_model_config.py58 passed (44 pre-existing + 14 new), 0 failed.
  • Sanity-checked find_models_by_id(["router-classified-3tier"]) returns the full router llm_config as the models_json payload that would be passed downstream.

Out of scope (will be a separate PR)

The matching change to OpenHands/evaluation/eval-job/scripts/build_matrix.py is still needed for end-to-end dispatch. That script currently derives the GCS artifact slug from llm_config["model"] and will exit with ERROR: llm_config missing 'model' when handed a router payload. It needs to detect is_router_config(llm_config), fall back to deriving the slug from the entry's id (e.g. "router-classified-3tier""router-classified-3tier"), and otherwise pass the llm_config through to the benchmark untouched. That's a one-file change I can put up next; opening it separately to keep the two reviews independent.

How to test end-to-end after the matching evaluation PR lands

  1. Dispatch Run Eval with model_ids=router-classified-3tier, benchmark=swebench, eval_limit=10.
  2. Check that metadata.routing is non-null in the resulting results.tar.gz (vs. null in the gpt-5.4 run we just looked at).
  3. Confirm per-instance routing log lines (benchmarks.utils.intelligent_routing logger) like intelligent-routing instance=… category=Frontend model=kimi-k2.6 ….
  4. Confirm output.jsonl[*].metrics.costs[*].model contains a mix of the three tier model strings instead of a single repeated value.

This PR was prepared by an AI agent (OpenHands) on behalf of @juanmichelini.

@juanmichelini can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:dc25347-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-dc25347-python \
  ghcr.io/openhands/agent-server:dc25347-python

All tags pushed for this build

ghcr.io/openhands/agent-server:dc25347-golang-amd64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-golang-amd64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-golang-amd64
ghcr.io/openhands/agent-server:dc25347-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:dc25347-golang-arm64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-golang-arm64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-golang-arm64
ghcr.io/openhands/agent-server:dc25347-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:dc25347-java-amd64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-java-amd64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-java-amd64
ghcr.io/openhands/agent-server:dc25347-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:dc25347-java-arm64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-java-arm64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-java-arm64
ghcr.io/openhands/agent-server:dc25347-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:dc25347-python-amd64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-python-amd64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-python-amd64
ghcr.io/openhands/agent-server:dc25347-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:dc25347-python-arm64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-python-arm64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-python-arm64
ghcr.io/openhands/agent-server:dc25347-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:dc25347-golang
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-golang
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-golang
ghcr.io/openhands/agent-server:dc25347-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:dc25347-java
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-java
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-java
ghcr.io/openhands/agent-server:dc25347-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:dc25347-python
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-python
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-python
ghcr.io/openhands/agent-server:dc25347-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., dc25347-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., dc25347-python-amd64) are also available if needed

…preflight

Companion change to OpenHands/benchmarks#742 (intelligent per-instance model
routing). With this PR the SDK can dispatch a router-shaped llm_config to the
evaluation pipeline; the benchmarks side already understands the
intelligent-router-v0 shape and will classify each instance and route to the
matching tier model.

Changes:
- New MODELS entry 'router-classified-3tier' (classifier=minimax-m2.7,
  tiers={kimi-k2.6, minimax-m2.7, gpt-5.5}, default iter5 routing).
- New helpers ROUTER_CONFIG_KIND and is_router_config().
- check_model() now detects router entries and recurses into each tier
  sub-model, aggregating success/failure.
- Pydantic validator in tests learns about RouterLLMConfig and the registry's
  llm_config is now 'RouterLLMConfig | LLMConfig'.
- 14 new tests covering the new entry, is_router_config, and recursive preflight.

Note: the matching OpenHands/evaluation change to
eval-job/scripts/build_matrix.py (handle no-top-level-model router entries
when deriving the GCS slug) is required for end-to-end dispatch and will be
opened separately.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@juanmichelini juanmichelini marked this pull request as ready for review June 13, 2026 00:31

all-hands-bot commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: router-classified-3tier

Taste Rating

🟢 Good taste - Clean implementation with minimal complexity.

Analysis

This PR adds intelligent per-instance model routing. The design is sound: router config discriminator separates router entries from plain model entries, check_model recurses into tier sub-models during preflight, and pydantic validation catches internal consistency errors at test-time.

What works well:

  • is_router_config() is a clean, side-effect-free predicate
  • _check_router_tiers aggregates results cleanly without duplicating logic
  • RouterLLMConfig model validator enforces reference consistency
  • 14 new tests cover key paths with appropriate mocking

Style Notes (minor):

  • Block comment (~Line 440-455) explaining routing table is verbose - the table is self-evident from the code
  • Comment referencing build_matrix.py (~Line 572) may drift since that code is out-of-scope per PR

Risk Assessment: 🟢 LOW

Pure additive change. Existing plain-model paths unchanged. Pydantic union is backward-compatible.

Verdict

✅ Worth merging - Core logic sound, tests comprehensive, design extensible.


This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ QA Report: PASS WITH ISSUES

The router entry and recursive preflight behavior work locally, but the real Run Eval resolver CLI currently aborts because the live proxy rejects the kimi-k2.6 tier.

Does this PR achieve its stated goal?

Partially. The PR does add router-classified-3tier and changes preflight from the old model=unknown failure mode into recursive tier validation; I verified that against a local OpenAI-compatible endpoint with real litellm HTTP calls. However, exercising the actual resolver CLI as the workflow would (MODEL_IDS=router-classified-3tier) fails preflight against the default live proxy because moonshot/kimi-k2.6 is rejected, so the new model is not currently dispatch-ready in this environment.

Phase Result
Environment Setup uv run created/used the project environment and the resolver executed successfully.
CI Status 🟡 At refresh: 22 successful checks, 6 in progress, 3 skipped. I did not run tests/linters locally.
Functional Verification ⚠️ Resolver + recursion verified locally; live proxy preflight for the new router model fails.
Functional Verification

Test 1: Model resolution before/after

Step 1 — Establish baseline on origin/main:
Ran a short user-style resolver invocation for find_models_by_id(["router-classified-3tier"]):

has_router_entry= False
ERROR: Model ID 'router-classified-3tier' not found. Available models: ...
find_models_by_id_ok= False
SystemExit 1

This confirms the base branch cannot dispatch this model id at all.

Step 2 — Apply the PR changes:
Checked out dc25347887e8394255a699a36c4bf39e91a5b4b9.

Step 3 — Re-run with the fix in place:
Ran the same resolver flow:

type= list
[
  {
    "display_name": "Router (3-tier, classifier=minimax-m2.7)",
    "id": "router-classified-3tier",
    "llm_config": {
      "classifier_model_id": "minimax-m2.7",
      "fallback_model_id": "gpt-5.5",
      "kind": "intelligent-router-v0",
      "tiers": { ... },
      "vision_capable_model_ids": ["kimi-k2.6", "gpt-5.5"]
    }
  }
]

This confirms the new model id resolves to a router-shaped payload with no top-level model.

Test 2: Recursive preflight behavior before/after

Step 1 — Establish baseline on origin/main:
Ran check_model() on a router-shaped config:

success= False
✗ Router Test: Bad request - litellm.BadRequestError: LLM Provider NOT provided. Pass in the LLM provider you are trying to call. You passed model=unknown

This confirms the old code path treated router configs as plain configs and tried model=unknown.

Step 2 — Apply the PR changes:
Checked out dc25347887e8394255a699a36c4bf39e91a5b4b9 and started an in-process local OpenAI-compatible HTTP endpoint.

Step 3 — Re-run with the fix in place:
Ran check_model() on the real router-classified-3tier entry using the local endpoint:

has_router_entry= True
resolved_ids= ['router-classified-3tier']
is_router_config= True
top_level_model_present= False
tier_ids= ['gpt-5.5', 'kimi-k2.6', 'minimax-m2.7']
preflight_success= True
  Router (3-tier, classifier=minimax-m2.7): validating 3 tier model(s)...
    ✓ Router (3-tier, classifier=minimax-m2.7) :: kimi-k2.6: OK
    ✓ Router (3-tier, classifier=minimax-m2.7) :: minimax-m2.7: OK
    ✓ Router (3-tier, classifier=minimax-m2.7) :: gpt-5.5: OK
✓ Router (3-tier, classifier=minimax-m2.7): OK (3 tier(s))

Captured HTTP requests from litellm:

[
  {"path": "/chat/completions", "model": "moonshot/kimi-k2.6", "temperature": 1.0, "top_p": null, "reasoning_effort": null},
  {"path": "/chat/completions", "model": "minimax/MiniMax-M2.7", "temperature": 1.0, "top_p": 0.95, "reasoning_effort": null},
  {"path": "/chat/completions", "model": "openai/gpt-5.5", "temperature": null, "top_p": null, "reasoning_effort": "high"}
]

This confirms recursive preflight now hits each tier and forwards the per-tier parameters.

Test 3: Actual workflow-style CLI execution against the live proxy

Step 1 — Run the actual resolver CLI for the new model:
Ran:

LLM_API_KEY="$LLM_API_KEY" LITELLM_API_KEY="$LLM_API_KEY" OPENAI_API_KEY="$LLM_API_KEY"   MODEL_IDS=router-classified-3tier   GITHUB_OUTPUT=/tmp/resolve_model_config_output.txt   uv run python .github/run-eval/resolve_model_config.py

Observed:

Resolved 1 model(s): router-classified-3tier
✓ Proxy reachable at https://llm-proxy.app.all-hands.dev
Preflight LLM check for 1 model(s)...
  Checking Router (3-tier, classifier=minimax-m2.7)...
    Router (3-tier, classifier=minimax-m2.7): validating 3 tier model(s)...
    ✗ Router (3-tier, classifier=minimax-m2.7) :: kimi-k2.6: Bad request - litellm.BadRequestError: Litellm_proxyException - /chat/completions: Invalid model name passed in model=moonshot/kimi-k2.6. Call `/v1/models` to view available models for your key.
    ✓ Router (3-tier, classifier=minimax-m2.7) :: minimax-m2.7: OK
    ✓ Router (3-tier, classifier=minimax-m2.7) :: gpt-5.5: OK
✗ Router (3-tier, classifier=minimax-m2.7): one or more tiers failed
✗ Some models failed preflight check
ERROR: Preflight LLM check failed
exit_code=1
--- GITHUB_OUTPUT ---
(missing)

This shows the real workflow-style dispatch path currently aborts before producing GITHUB_OUTPUT.

Step 2 — Compare the underlying plain tier:
Ran the same CLI for MODEL_IDS=kimi-k2.6:

Resolved 1 model(s): kimi-k2.6
✓ Proxy reachable at https://llm-proxy.app.all-hands.dev
  Checking Kimi K2.6...
  ✗ Kimi K2.6: Bad request - litellm.BadRequestError: Litellm_proxyException - /chat/completions: Invalid model name passed in model=moonshot/kimi-k2.6. Call `/v1/models` to view available models for your key.
ERROR: Preflight LLM check failed
exit_code=1

This suggests the recursion itself is working correctly, but the kimi-k2.6 tier is not currently usable through the live proxy credentials/environment I exercised.

Issues Found

  • 🟠 Issue: MODEL_IDS=router-classified-3tier is not currently dispatch-ready against the live default proxy because the kimi-k2.6 tier fails preflight with Invalid model name passed in model=moonshot/kimi-k2.6. The plain kimi-k2.6 entry fails the same way, so this looks like a proxy provisioning/model-name issue rather than a recursion bug, but it still blocks the PR’s stated dispatch-readiness goal.

Automated QA review generated by an AI agent (OpenHands) on behalf of the requester.

"fallback_model_id": "gpt-5.5",
"tiers": {
"kimi-k2.6": {
"model": "litellm_proxy/moonshot/kimi-k2.6",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: I exercised the actual resolver CLI with MODEL_IDS=router-classified-3tier against the live default proxy using the available LLM credentials. Preflight recursed correctly, but this tier failed with Invalid model name passed in model=moonshot/kimi-k2.6; running the plain MODEL_IDS=kimi-k2.6 entry failed the same way. Until the proxy/model name is provisioned or this tier is changed to a reachable model, the new router model aborts before writing GITHUB_OUTPUT, so the dispatching end is not fully ready.

Automated QA finding generated by an AI agent (OpenHands) on behalf of the requester.

@all-hands-bot

Copy link
Copy Markdown
Collaborator

[Automatic Post]: It has been a while since there was any activity on this PR. @juanmichelini, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

This comment was created by an AI agent (OpenHands) on behalf of the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants