Update verified_models.py with latest completed benchmark models#3800
Update verified_models.py with latest completed benchmark models#3800all-hands-bot wants to merge 1 commit into
Conversation
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
The PR successfully adds the completed benchmark model claude-opus-4-5 to the SDK's OpenHands verified model list, with no functional issues found in manual verification.
Does this PR achieve its stated goal?
Yes. The PR goal is to add models that completed all 5 benchmarks to verified_models.py; the referenced benchmark source contains results/claude-opus-4-5, and after applying the PR, SDK consumers importing VERIFIED_OPENHANDS_MODELS see claude-opus-4-5 where base main did not. I verified this by bootstrapping the repo and running the SDK import path directly, not by running tests.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed and installed the editable SDK packages. |
| CI Status | |
| Functional Verification | ✅ Base/PR comparison confirms the OpenHands verified list now exposes the completed model. |
Functional Verification
Test 1: Completed benchmark source contains the model
Step 1 — Establish source baseline:
Ran python against the PR-linked complete-models.json:
{"model-path": "results/claude-opus-4-5", "timestamp": "2026-06-12T13:00:20.000+00:00"}
This shows the authoritative completed-models source includes the OpenHands result path for claude-opus-4-5.
Test 2: SDK verified OpenHands model list before and after the PR
Step 1 — Reproduce / establish baseline without the fix:
Checked out origin/main and ran:
OPENHANDS_SUPPRESS_BANNER=1 uv run python -c 'from openhands.sdk.llm.utils import verified_models as vm; target="claude-opus-4-5"; print("branch_context=origin/main"); print("\\n".join(f"{n}: contains_target={target in getattr(vm,n)}; size={len(getattr(vm,n))}" for n in ("VERIFIED_ANTHROPIC_MODELS","VERIFIED_OPENHANDS_MODELS")))'Observed:
branch_context=origin/main
VERIFIED_ANTHROPIC_MODELS: contains_target=True; size=19
VERIFIED_OPENHANDS_MODELS: contains_target=False; size=39
This confirms the base branch already knew the Anthropic model slug, but OpenHands' verified model list did not expose the completed results/claude-opus-4-5 benchmark model.
Step 2 — Apply the PR's changes:
Checked out automated/update-verified-models at d404e7fb1ac0c56bbbe323fa08cb26925455d7f5.
Step 3 — Re-run with the fix in place:
Ran the same SDK import check:
branch_context=automated/update-verified-models
VERIFIED_ANTHROPIC_MODELS: contains_target=True; size=19
VERIFIED_OPENHANDS_MODELS: contains_target=True; size=40
This confirms the PR changes the actual SDK runtime data a user imports: claude-opus-4-5 is now present in VERIFIED_OPENHANDS_MODELS, and the list grew by exactly one entry.
Test 3: Repository setup
Ran make build:
Dependencies installed successfully.
Pre-commit hooks installed successfully.
Build complete! Development environment is ready.
This confirms the repo could be bootstrapped for runtime verification. I did not run the test suite, linters, formatters, type checkers, or pre-commit checks.
Issues Found
None from functional QA. Note: CI currently has a failing PR Description Check and pending jobs, so merge readiness still depends on CI/human follow-up.
This QA review was created by an AI agent (OpenHands) on behalf of the user.
d404e7f to
b2ff154
Compare
b2ff154 to
d8dbbb0
Compare
This PR was automatically created by the Update Complete Models workflow in openhands-index-results.
It adds models that have completed all 5 benchmarks to the verified model lists in
verified_models.py.See complete-models.json for the full list of completed models.
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:d8dbbb0-pythonRun
All tags pushed for this build
About Multi-Architecture Support
d8dbbb0-python) is a multi-arch manifest supporting both amd64 and arm64d8dbbb0-python-amd64) are also available if needed