Skip to content

Update verified_models.py with latest completed benchmark models#3800

Open
all-hands-bot wants to merge 1 commit into
mainfrom
automated/update-verified-models
Open

Update verified_models.py with latest completed benchmark models#3800
all-hands-bot wants to merge 1 commit into
mainfrom
automated/update-verified-models

Conversation

@all-hands-bot

@all-hands-bot all-hands-bot commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

This PR was automatically created by the Update Complete Models workflow in openhands-index-results.

It adds models that have completed all 5 benchmarks to the verified model lists in verified_models.py.

See complete-models.json for the full list of completed models.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:d8dbbb0-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-d8dbbb0-python \
  ghcr.io/openhands/agent-server:d8dbbb0-python

All tags pushed for this build

ghcr.io/openhands/agent-server:d8dbbb0-golang-amd64
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-golang-amd64
ghcr.io/openhands/agent-server:automated-update-verified-models-golang-amd64
ghcr.io/openhands/agent-server:d8dbbb0-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:d8dbbb0-golang-arm64
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-golang-arm64
ghcr.io/openhands/agent-server:automated-update-verified-models-golang-arm64
ghcr.io/openhands/agent-server:d8dbbb0-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:d8dbbb0-java-amd64
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-java-amd64
ghcr.io/openhands/agent-server:automated-update-verified-models-java-amd64
ghcr.io/openhands/agent-server:d8dbbb0-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:d8dbbb0-java-arm64
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-java-arm64
ghcr.io/openhands/agent-server:automated-update-verified-models-java-arm64
ghcr.io/openhands/agent-server:d8dbbb0-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:d8dbbb0-python-amd64
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-python-amd64
ghcr.io/openhands/agent-server:automated-update-verified-models-python-amd64
ghcr.io/openhands/agent-server:d8dbbb0-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:d8dbbb0-python-arm64
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-python-arm64
ghcr.io/openhands/agent-server:automated-update-verified-models-python-arm64
ghcr.io/openhands/agent-server:d8dbbb0-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:d8dbbb0-golang
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-golang
ghcr.io/openhands/agent-server:automated-update-verified-models-golang
ghcr.io/openhands/agent-server:d8dbbb0-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:d8dbbb0-java
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-java
ghcr.io/openhands/agent-server:automated-update-verified-models-java
ghcr.io/openhands/agent-server:d8dbbb0-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:d8dbbb0-python
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-python
ghcr.io/openhands/agent-server:automated-update-verified-models-python
ghcr.io/openhands/agent-server:d8dbbb0-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., d8dbbb0-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., d8dbbb0-python-amd64) are also available if needed

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

The PR successfully adds the completed benchmark model claude-opus-4-5 to the SDK's OpenHands verified model list, with no functional issues found in manual verification.

Does this PR achieve its stated goal?

Yes. The PR goal is to add models that completed all 5 benchmarks to verified_models.py; the referenced benchmark source contains results/claude-opus-4-5, and after applying the PR, SDK consumers importing VERIFIED_OPENHANDS_MODELS see claude-opus-4-5 where base main did not. I verified this by bootstrapping the repo and running the SDK import path directly, not by running tests.

Phase Result
Environment Setup make build completed and installed the editable SDK packages.
CI Status ⚠️ GitHub reports 19 successful, 1 failing PR Description Check, 11 pending, and 14 skipped checks at review time.
Functional Verification ✅ Base/PR comparison confirms the OpenHands verified list now exposes the completed model.
Functional Verification

Test 1: Completed benchmark source contains the model

Step 1 — Establish source baseline:
Ran python against the PR-linked complete-models.json:

{"model-path": "results/claude-opus-4-5", "timestamp": "2026-06-12T13:00:20.000+00:00"}

This shows the authoritative completed-models source includes the OpenHands result path for claude-opus-4-5.

Test 2: SDK verified OpenHands model list before and after the PR

Step 1 — Reproduce / establish baseline without the fix:
Checked out origin/main and ran:

OPENHANDS_SUPPRESS_BANNER=1 uv run python -c 'from openhands.sdk.llm.utils import verified_models as vm; target="claude-opus-4-5"; print("branch_context=origin/main"); print("\\n".join(f"{n}: contains_target={target in getattr(vm,n)}; size={len(getattr(vm,n))}" for n in ("VERIFIED_ANTHROPIC_MODELS","VERIFIED_OPENHANDS_MODELS")))'

Observed:

branch_context=origin/main
VERIFIED_ANTHROPIC_MODELS: contains_target=True; size=19
VERIFIED_OPENHANDS_MODELS: contains_target=False; size=39

This confirms the base branch already knew the Anthropic model slug, but OpenHands' verified model list did not expose the completed results/claude-opus-4-5 benchmark model.

Step 2 — Apply the PR's changes:
Checked out automated/update-verified-models at d404e7fb1ac0c56bbbe323fa08cb26925455d7f5.

Step 3 — Re-run with the fix in place:
Ran the same SDK import check:

branch_context=automated/update-verified-models
VERIFIED_ANTHROPIC_MODELS: contains_target=True; size=19
VERIFIED_OPENHANDS_MODELS: contains_target=True; size=40

This confirms the PR changes the actual SDK runtime data a user imports: claude-opus-4-5 is now present in VERIFIED_OPENHANDS_MODELS, and the list grew by exactly one entry.

Test 3: Repository setup

Ran make build:

Dependencies installed successfully.
Pre-commit hooks installed successfully.
Build complete! Development environment is ready.

This confirms the repo could be bootstrapped for runtime verification. I did not run the test suite, linters, formatters, type checkers, or pre-commit checks.

Issues Found

None from functional QA. Note: CI currently has a failing PR Description Check and pending jobs, so merge readiness still depends on CI/human follow-up.

This QA review was created by an AI agent (OpenHands) on behalf of the user.

@all-hands-bot all-hands-bot force-pushed the automated/update-verified-models branch from d404e7f to b2ff154 Compare June 18, 2026 23:14
@all-hands-bot all-hands-bot force-pushed the automated/update-verified-models branch from b2ff154 to d8dbbb0 Compare June 19, 2026 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants