feat(run-eval): add `router-classified-3tier` MODELS entry + recursive preflight by juanmichelini · Pull Request #3636 · OpenHands/software-agent-sdk

juanmichelini · 2026-06-10T17:03:16Z

Summary

Companion PR to OpenHands/benchmarks#742 (per-instance intelligent model routing for the 5 default-agent benchmarks). That PR makes the receiving end ready: when a benchmark's --llm-config-path points at an intelligent-router-v0 JSON, each instance is classified once and the agent conversation is routed to the matching tier model.

This PR makes the dispatching end ready: resolve_model_config.MODELS now contains a router-classified-3tier entry whose llm_config is exactly that router payload, and check_model (preflight) knows how to recurse into the tier sub-models.

After both PRs land, dispatching Run Eval with model_ids=router-classified-3tier will produce a run that routes per instance instead of running a single model end-to-end. Until then the entry is dormant on the SDK side and harmless to existing flows.

What's in this PR

File	Purpose
`.github/run-eval/resolve_model_config.py`	New MODELS entry `router-classified-3tier`; new helpers `ROUTER_CONFIG_KIND` and `is_router_config()`; `check_model()` now detects router entries and recurses into each tier sub-model via a new `_check_router_tiers()` helper.
`.github/run-eval/ADDINGMODEL.md`	New "Two kinds of MODELS entries" section documenting the plain-vs-router distinction and pointing at the canonical router entry.
`tests/cross/test_resolve_model_config.py`	New `RouterLLMConfig` pydantic validator (mirrors `LLMConfig`); `EvalModelConfig.llm_config` is now `RouterLLMConfig \| LLMConfig`; 14 new tests covering the registry entry, the predicate, and the recursive preflight.

Total: +412 / −3 across 3 files.

The new MODELS entry

"router-classified-3tier": {
    "id": "router-classified-3tier",
    "display_name": "Router (3-tier, classifier=minimax-m2.7)",
    "llm_config": {
        "kind": "intelligent-router-v0",
        "classifier_model_id": "minimax-m2.7",
        "fallback_model_id": "gpt-5.5",
        "tiers": {
            "kimi-k2.6":     {"model": "litellm_proxy/moonshot/kimi-k2.6",     "temperature": 1.0, "inline_image_urls": True},
            "minimax-m2.7":  {"model": "litellm_proxy/minimax/MiniMax-M2.7",   "temperature": 1.0, "top_p": 0.95},
            "gpt-5.5":       {"model": "litellm_proxy/openai/gpt-5.5",         "reasoning_effort": "high"},
        },
        "routing": {
            "Frontend":                 "kimi-k2.6",
            "Issue Resolution (other)": "minimax-m2.7",
            "Greenfield":               "gpt-5.5",
            "Testing":                  "gpt-5.5",
            "Information Gathering":    "gpt-5.5",
        },
        "vision_capable_model_ids": ["kimi-k2.6", "gpt-5.5"],
    },
},

Each tier sub-config is byte-identical to the matching plain MODELS entry (kimi-k2.6, minimax-m2.7, gpt-5.5), so all proxy provisioning that already works for those models keeps working here. The classifier reuses minimax-m2.7, exactly mirroring OpenHands/benchmarks's sample router config.

Preflight: recursing into tier sub-models

A router payload has no top-level "model" — so the existing check_model would have called litellm.completion(model="unknown", …) and failed in a confusing way. The new shape:

def check_model(model_config, api_key, base_url, timeout=60):
    llm_config = model_config.get("llm_config", {})
    if is_router_config(llm_config):
        return _check_router_tiers(model_config, api_key, base_url, timeout)
    # ... existing plain-model code path, unchanged

_check_router_tiers runs check_model on each tier sub-model and aggregates the result. Per-entry output stays a one-liner in the preflight summary, with indented per-tier diagnostics directly underneath:

  Router (3-tier, classifier=minimax-m2.7): validating 3 tier model(s)...
    ✓ Router (3-tier, classifier=minimax-m2.7) :: kimi-k2.6: OK
    ✓ Router (3-tier, classifier=minimax-m2.7) :: minimax-m2.7: OK
    ✓ Router (3-tier, classifier=minimax-m2.7) :: gpt-5.5: OK
  ✓ Router (3-tier, classifier=minimax-m2.7): OK (3 tier(s))

If any tier fails (provisioning, parameter shape, etc.) the aggregate fails and the per-tier failure line is surfaced so the cause is obvious from the workflow log.

Pydantic validator update

tests/cross/test_resolve_model_config.py already enforces that every MODELS entry validates against EvalModelConfig. Without the router shape that test fails for the new entry because router payloads have no model field. The fix is a new RouterLLMConfig (parallels LLMConfig) and EvalModelConfig.llm_config: RouterLLMConfig | LLMConfig. Pydantic union resolution picks RouterLLMConfig for payloads carrying kind: "intelligent-router-v0" and LLMConfig otherwise. Existing models are unaffected.

RouterLLMConfig additionally enforces internal consistency: classifier_model_id, fallback_model_id, every routing target, and every vision_capable_model_ids entry must all be keys in tiers. This catches typos at test-time instead of at run-time.

New tests (14)

TestRouterClassified3Tier (5): the entry is router-shaped, refs are consistent, every tier is a valid litellm_proxy/… config, the iter5 5-category routing table is complete, the payload satisfies RouterLLMConfig.
TestIsRouterConfig (6): plain configs, missing-kind, missing-tiers, wrong-kind, canonical-payload, non-dict inputs.
TestCheckModelRouterRecursion (4): all tiers succeed → router passes (with litellm.completion called once per tier and model= correctly forwarded); one tier failure → router fails; empty tiers short-circuits without ever calling litellm; per-tier parameters (temperature, top_p) are forwarded correctly.

All tests use the existing litellm.completion-mock pattern from TestTestModel; no real network calls.

Verification

uv run ruff format .github/run-eval/resolve_model_config.py tests/cross/test_resolve_model_config.py — clean
uv run ruff check .github/run-eval/resolve_model_config.py tests/cross/test_resolve_model_config.py — All checks passed!
uv run pyright .github/run-eval/resolve_model_config.py — 0 errors, 0 warnings, 0 informations
uv run pytest tests/cross/test_resolve_model_config.py — 58 passed (44 pre-existing + 14 new), 0 failed.
Sanity-checked find_models_by_id(["router-classified-3tier"]) returns the full router llm_config as the models_json payload that would be passed downstream.

Out of scope (will be a separate PR)

The matching change to OpenHands/evaluation/eval-job/scripts/build_matrix.py is still needed for end-to-end dispatch. That script currently derives the GCS artifact slug from llm_config["model"] and will exit with ERROR: llm_config missing 'model' when handed a router payload. It needs to detect is_router_config(llm_config), fall back to deriving the slug from the entry's id (e.g. "router-classified-3tier" → "router-classified-3tier"), and otherwise pass the llm_config through to the benchmark untouched. That's a one-file change I can put up next; opening it separately to keep the two reviews independent.

How to test end-to-end after the matching `evaluation` PR lands

Dispatch Run Eval with model_ids=router-classified-3tier, benchmark=swebench, eval_limit=10.
Check that metadata.routing is non-null in the resulting results.tar.gz (vs. null in the gpt-5.4 run we just looked at).
Confirm per-instance routing log lines (benchmarks.utils.intelligent_routing logger) like intelligent-routing instance=… category=Frontend model=kimi-k2.6 ….
Confirm output.jsonl[*].metrics.costs[*].model contains a mix of the three tier model strings instead of a single repeated value.

This PR was prepared by an AI agent (OpenHands) on behalf of @juanmichelini.

@juanmichelini can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:dc25347-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-dc25347-python \
  ghcr.io/openhands/agent-server:dc25347-python

All tags pushed for this build

ghcr.io/openhands/agent-server:dc25347-golang-amd64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-golang-amd64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-golang-amd64
ghcr.io/openhands/agent-server:dc25347-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:dc25347-golang-arm64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-golang-arm64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-golang-arm64
ghcr.io/openhands/agent-server:dc25347-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:dc25347-java-amd64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-java-amd64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-java-amd64
ghcr.io/openhands/agent-server:dc25347-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:dc25347-java-arm64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-java-arm64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-java-arm64
ghcr.io/openhands/agent-server:dc25347-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:dc25347-python-amd64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-python-amd64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-python-amd64
ghcr.io/openhands/agent-server:dc25347-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:dc25347-python-arm64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-python-arm64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-python-arm64
ghcr.io/openhands/agent-server:dc25347-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:dc25347-golang
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-golang
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-golang
ghcr.io/openhands/agent-server:dc25347-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:dc25347-java
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-java
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-java
ghcr.io/openhands/agent-server:dc25347-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:dc25347-python
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-python
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-python
ghcr.io/openhands/agent-server:dc25347-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., dc25347-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., dc25347-python-amd64) are also available if needed

…preflight Companion change to OpenHands/benchmarks#742 (intelligent per-instance model routing). With this PR the SDK can dispatch a router-shaped llm_config to the evaluation pipeline; the benchmarks side already understands the intelligent-router-v0 shape and will classify each instance and route to the matching tier model. Changes: - New MODELS entry 'router-classified-3tier' (classifier=minimax-m2.7, tiers={kimi-k2.6, minimax-m2.7, gpt-5.5}, default iter5 routing). - New helpers ROUTER_CONFIG_KIND and is_router_config(). - check_model() now detects router entries and recurses into each tier sub-model, aggregating success/failure. - Pydantic validator in tests learns about RouterLLMConfig and the registry's llm_config is now 'RouterLLMConfig | LLMConfig'. - 14 new tests covering the new entry, is_router_config, and recursive preflight. Note: the matching OpenHands/evaluation change to eval-job/scripts/build_matrix.py (handle no-top-level-model router entries when deriving the GCS slug) is required for end-to-end dispatch and will be opened separately. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-06-10T17:03:42Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-06-10T17:03:55Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

all-hands-bot · 2026-06-13T00:33:01Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review: router-classified-3tier

Taste Rating

🟢 Good taste - Clean implementation with minimal complexity.

Analysis

This PR adds intelligent per-instance model routing. The design is sound: router config discriminator separates router entries from plain model entries, check_model recurses into tier sub-models during preflight, and pydantic validation catches internal consistency errors at test-time.

What works well:

is_router_config() is a clean, side-effect-free predicate
_check_router_tiers aggregates results cleanly without duplicating logic
RouterLLMConfig model validator enforces reference consistency
14 new tests cover key paths with appropriate mocking

Style Notes (minor):

Block comment (~Line 440-455) explaining routing table is verbose - the table is self-evident from the code
Comment referencing build_matrix.py (~Line 572) may drift since that code is out-of-scope per PR

Risk Assessment: 🟢 LOW

Pure additive change. Existing plain-model paths unchanged. Pydantic union is backward-compatible.

Verdict

✅ Worth merging - Core logic sound, tests comprehensive, design extensible.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot

⚠️ QA Report: PASS WITH ISSUES

The router entry and recursive preflight behavior work locally, but the real Run Eval resolver CLI currently aborts because the live proxy rejects the kimi-k2.6 tier.

Does this PR achieve its stated goal?

Partially. The PR does add router-classified-3tier and changes preflight from the old model=unknown failure mode into recursive tier validation; I verified that against a local OpenAI-compatible endpoint with real litellm HTTP calls. However, exercising the actual resolver CLI as the workflow would (MODEL_IDS=router-classified-3tier) fails preflight against the default live proxy because moonshot/kimi-k2.6 is rejected, so the new model is not currently dispatch-ready in this environment.

Phase	Result
Environment Setup	✅ `uv run` created/used the project environment and the resolver executed successfully.
CI Status	🟡 At refresh: 22 successful checks, 6 in progress, 3 skipped. I did not run tests/linters locally.
Functional Verification	⚠️ Resolver + recursion verified locally; live proxy preflight for the new router model fails.

Functional Verification

Test 1: Model resolution before/after

Step 1 — Establish baseline on origin/main:
Ran a short user-style resolver invocation for find_models_by_id(["router-classified-3tier"]):

has_router_entry= False
ERROR: Model ID 'router-classified-3tier' not found. Available models: ...
find_models_by_id_ok= False
SystemExit 1

This confirms the base branch cannot dispatch this model id at all.

Step 2 — Apply the PR changes:
Checked out dc25347887e8394255a699a36c4bf39e91a5b4b9.

Step 3 — Re-run with the fix in place:
Ran the same resolver flow:

type= list
[
  {
    "display_name": "Router (3-tier, classifier=minimax-m2.7)",
    "id": "router-classified-3tier",
    "llm_config": {
      "classifier_model_id": "minimax-m2.7",
      "fallback_model_id": "gpt-5.5",
      "kind": "intelligent-router-v0",
      "tiers": { ... },
      "vision_capable_model_ids": ["kimi-k2.6", "gpt-5.5"]
    }
  }
]

This confirms the new model id resolves to a router-shaped payload with no top-level model.

Test 2: Recursive preflight behavior before/after

Step 1 — Establish baseline on origin/main:
Ran check_model() on a router-shaped config:

success= False
✗ Router Test: Bad request - litellm.BadRequestError: LLM Provider NOT provided. Pass in the LLM provider you are trying to call. You passed model=unknown

This confirms the old code path treated router configs as plain configs and tried model=unknown.

Step 2 — Apply the PR changes:
Checked out dc25347887e8394255a699a36c4bf39e91a5b4b9 and started an in-process local OpenAI-compatible HTTP endpoint.

Step 3 — Re-run with the fix in place:
Ran check_model() on the real router-classified-3tier entry using the local endpoint:

has_router_entry= True
resolved_ids= ['router-classified-3tier']
is_router_config= True
top_level_model_present= False
tier_ids= ['gpt-5.5', 'kimi-k2.6', 'minimax-m2.7']
preflight_success= True
  Router (3-tier, classifier=minimax-m2.7): validating 3 tier model(s)...
    ✓ Router (3-tier, classifier=minimax-m2.7) :: kimi-k2.6: OK
    ✓ Router (3-tier, classifier=minimax-m2.7) :: minimax-m2.7: OK
    ✓ Router (3-tier, classifier=minimax-m2.7) :: gpt-5.5: OK
✓ Router (3-tier, classifier=minimax-m2.7): OK (3 tier(s))

Captured HTTP requests from litellm:

[
  {"path": "/chat/completions", "model": "moonshot/kimi-k2.6", "temperature": 1.0, "top_p": null, "reasoning_effort": null},
  {"path": "/chat/completions", "model": "minimax/MiniMax-M2.7", "temperature": 1.0, "top_p": 0.95, "reasoning_effort": null},
  {"path": "/chat/completions", "model": "openai/gpt-5.5", "temperature": null, "top_p": null, "reasoning_effort": "high"}
]

This confirms recursive preflight now hits each tier and forwards the per-tier parameters.

Test 3: Actual workflow-style CLI execution against the live proxy

Step 1 — Run the actual resolver CLI for the new model:
Ran:

LLM_API_KEY="$LLM_API_KEY" LITELLM_API_KEY="$LLM_API_KEY" OPENAI_API_KEY="$LLM_API_KEY"   MODEL_IDS=router-classified-3tier   GITHUB_OUTPUT=/tmp/resolve_model_config_output.txt   uv run python .github/run-eval/resolve_model_config.py

Observed:

Resolved 1 model(s): router-classified-3tier
✓ Proxy reachable at https://llm-proxy.app.all-hands.dev
Preflight LLM check for 1 model(s)...
  Checking Router (3-tier, classifier=minimax-m2.7)...
    Router (3-tier, classifier=minimax-m2.7): validating 3 tier model(s)...
    ✗ Router (3-tier, classifier=minimax-m2.7) :: kimi-k2.6: Bad request - litellm.BadRequestError: Litellm_proxyException - /chat/completions: Invalid model name passed in model=moonshot/kimi-k2.6. Call `/v1/models` to view available models for your key.
    ✓ Router (3-tier, classifier=minimax-m2.7) :: minimax-m2.7: OK
    ✓ Router (3-tier, classifier=minimax-m2.7) :: gpt-5.5: OK
✗ Router (3-tier, classifier=minimax-m2.7): one or more tiers failed
✗ Some models failed preflight check
ERROR: Preflight LLM check failed
exit_code=1
--- GITHUB_OUTPUT ---
(missing)

This shows the real workflow-style dispatch path currently aborts before producing GITHUB_OUTPUT.

Step 2 — Compare the underlying plain tier:
Ran the same CLI for MODEL_IDS=kimi-k2.6:

Resolved 1 model(s): kimi-k2.6
✓ Proxy reachable at https://llm-proxy.app.all-hands.dev
  Checking Kimi K2.6...
  ✗ Kimi K2.6: Bad request - litellm.BadRequestError: Litellm_proxyException - /chat/completions: Invalid model name passed in model=moonshot/kimi-k2.6. Call `/v1/models` to view available models for your key.
ERROR: Preflight LLM check failed
exit_code=1

This suggests the recursion itself is working correctly, but the kimi-k2.6 tier is not currently usable through the live proxy credentials/environment I exercised.

Issues Found

🟠 Issue: MODEL_IDS=router-classified-3tier is not currently dispatch-ready against the live default proxy because the kimi-k2.6 tier fails preflight with Invalid model name passed in model=moonshot/kimi-k2.6. The plain kimi-k2.6 entry fails the same way, so this looks like a proxy provisioning/model-name issue rather than a recursion bug, but it still blocks the PR’s stated dispatch-readiness goal.

Automated QA review generated by an AI agent (OpenHands) on behalf of the requester.

all-hands-bot · 2026-06-13T00:40:04Z

+            "fallback_model_id": "gpt-5.5",
+            "tiers": {
+                "kimi-k2.6": {
+                    "model": "litellm_proxy/moonshot/kimi-k2.6",


🟠 Important: I exercised the actual resolver CLI with MODEL_IDS=router-classified-3tier against the live default proxy using the available LLM credentials. Preflight recursed correctly, but this tier failed with Invalid model name passed in model=moonshot/kimi-k2.6; running the plain MODEL_IDS=kimi-k2.6 entry failed the same way. Until the proxy/model name is provisioned or this tier is changed to a reachable model, the new router model aborts before writing GITHUB_OUTPUT, so the dispatching end is not fully ready.

Automated QA finding generated by an AI agent (OpenHands) on behalf of the requester.

all-hands-bot · 2026-06-18T13:37:36Z

[Automatic Post]: It has been a while since there was any activity on this PR. @juanmichelini, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

This comment was created by an AI agent (OpenHands) on behalf of the user.

juanmichelini requested a review from all-hands-bot June 13, 2026 00:31

juanmichelini marked this pull request as ready for review June 13, 2026 00:31

Merge branch 'main' into feat/router-classified-3tier-model

dc25347

all-hands-bot reviewed Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(run-eval): add `router-classified-3tier` MODELS entry + recursive preflight#3636

feat(run-eval): add `router-classified-3tier` MODELS entry + recursive preflight#3636
juanmichelini wants to merge 2 commits into
mainfrom
feat/router-classified-3tier-model

juanmichelini commented Jun 10, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

all-hands-bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Jun 13, 2026

Uh oh!

all-hands-bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

juanmichelini commented Jun 10, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

The new MODELS entry

Preflight: recursing into tier sub-models

Pydantic validator update

New tests (14)

Verification

Out of scope (will be a separate PR)

How to test end-to-end after the matching evaluation PR lands

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

all-hands-bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review: router-classified-3tier

Taste Rating

Analysis

What works well:

Style Notes (minor):

Risk Assessment: 🟢 LOW

Verdict

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

⚠️ QA Report: PASS WITH ISSUES

Does this PR achieve its stated goal?

Test 1: Model resolution before/after

Test 2: Recursive preflight behavior before/after

Test 3: Actual workflow-style CLI execution against the live proxy

Issues Found

Uh oh!

all-hands-bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

juanmichelini commented Jun 10, 2026 •

edited by github-actions Bot

Loading

How to test end-to-end after the matching `evaluation` PR lands

github-actions Bot commented Jun 10, 2026 •

edited

Loading

github-actions Bot commented Jun 10, 2026 •

edited

Loading

all-hands-bot commented Jun 13, 2026 •

edited

Loading