Update verified_models.py with latest completed benchmark models by all-hands-bot · Pull Request #3800 · OpenHands/software-agent-sdk

all-hands-bot · 2026-06-18T20:37:48Z

This PR was automatically created by the Update Complete Models workflow in openhands-index-results.

It adds models that have completed all 5 benchmarks to the verified model lists in verified_models.py.

See complete-models.json for the full list of completed models.

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:d8dbbb0-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-d8dbbb0-python \
  ghcr.io/openhands/agent-server:d8dbbb0-python

All tags pushed for this build

ghcr.io/openhands/agent-server:d8dbbb0-golang-amd64
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-golang-amd64
ghcr.io/openhands/agent-server:automated-update-verified-models-golang-amd64
ghcr.io/openhands/agent-server:d8dbbb0-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:d8dbbb0-golang-arm64
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-golang-arm64
ghcr.io/openhands/agent-server:automated-update-verified-models-golang-arm64
ghcr.io/openhands/agent-server:d8dbbb0-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:d8dbbb0-java-amd64
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-java-amd64
ghcr.io/openhands/agent-server:automated-update-verified-models-java-amd64
ghcr.io/openhands/agent-server:d8dbbb0-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:d8dbbb0-java-arm64
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-java-arm64
ghcr.io/openhands/agent-server:automated-update-verified-models-java-arm64
ghcr.io/openhands/agent-server:d8dbbb0-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:d8dbbb0-python-amd64
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-python-amd64
ghcr.io/openhands/agent-server:automated-update-verified-models-python-amd64
ghcr.io/openhands/agent-server:d8dbbb0-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:d8dbbb0-python-arm64
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-python-arm64
ghcr.io/openhands/agent-server:automated-update-verified-models-python-arm64
ghcr.io/openhands/agent-server:d8dbbb0-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:d8dbbb0-golang
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-golang
ghcr.io/openhands/agent-server:automated-update-verified-models-golang
ghcr.io/openhands/agent-server:d8dbbb0-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:d8dbbb0-java
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-java
ghcr.io/openhands/agent-server:automated-update-verified-models-java
ghcr.io/openhands/agent-server:d8dbbb0-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:d8dbbb0-python
ghcr.io/openhands/agent-server:d8dbbb0bfc5691a509d2af9456ec95daf20f9fb7-python
ghcr.io/openhands/agent-server:automated-update-verified-models-python
ghcr.io/openhands/agent-server:d8dbbb0-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., d8dbbb0-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., d8dbbb0-python-amd64) are also available if needed

github-actions · 2026-06-18T20:38:18Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-06-18T20:38:30Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

all-hands-bot

✅ QA Report: PASS

The PR successfully adds the completed benchmark model claude-opus-4-5 to the SDK's OpenHands verified model list, with no functional issues found in manual verification.

Does this PR achieve its stated goal?

Yes. The PR goal is to add models that completed all 5 benchmarks to verified_models.py; the referenced benchmark source contains results/claude-opus-4-5, and after applying the PR, SDK consumers importing VERIFIED_OPENHANDS_MODELS see claude-opus-4-5 where base main did not. I verified this by bootstrapping the repo and running the SDK import path directly, not by running tests.

Phase	Result
Environment Setup	✅ `make build` completed and installed the editable SDK packages.
CI Status	⚠️ GitHub reports 19 successful, 1 failing PR Description Check, 11 pending, and 14 skipped checks at review time.
Functional Verification	✅ Base/PR comparison confirms the OpenHands verified list now exposes the completed model.

Functional Verification

Test 1: Completed benchmark source contains the model

Step 1 — Establish source baseline:
Ran python against the PR-linked complete-models.json:

{"model-path": "results/claude-opus-4-5", "timestamp": "2026-06-12T13:00:20.000+00:00"}

This shows the authoritative completed-models source includes the OpenHands result path for claude-opus-4-5.

Test 2: SDK verified OpenHands model list before and after the PR

Step 1 — Reproduce / establish baseline without the fix:
Checked out origin/main and ran:

OPENHANDS_SUPPRESS_BANNER=1 uv run python -c 'from openhands.sdk.llm.utils import verified_models as vm; target="claude-opus-4-5"; print("branch_context=origin/main"); print("\\n".join(f"{n}: contains_target={target in getattr(vm,n)}; size={len(getattr(vm,n))}" for n in ("VERIFIED_ANTHROPIC_MODELS","VERIFIED_OPENHANDS_MODELS")))'

Observed:

branch_context=origin/main
VERIFIED_ANTHROPIC_MODELS: contains_target=True; size=19
VERIFIED_OPENHANDS_MODELS: contains_target=False; size=39

This confirms the base branch already knew the Anthropic model slug, but OpenHands' verified model list did not expose the completed results/claude-opus-4-5 benchmark model.

Step 2 — Apply the PR's changes:
Checked out automated/update-verified-models at d404e7fb1ac0c56bbbe323fa08cb26925455d7f5.

Step 3 — Re-run with the fix in place:
Ran the same SDK import check:

branch_context=automated/update-verified-models
VERIFIED_ANTHROPIC_MODELS: contains_target=True; size=19
VERIFIED_OPENHANDS_MODELS: contains_target=True; size=40

This confirms the PR changes the actual SDK runtime data a user imports: claude-opus-4-5 is now present in VERIFIED_OPENHANDS_MODELS, and the list grew by exactly one entry.

Test 3: Repository setup

Ran make build:

Dependencies installed successfully.
Pre-commit hooks installed successfully.
Build complete! Development environment is ready.

This confirms the repo could be bootstrapped for runtime verification. I did not run the test suite, linters, formatters, type checkers, or pre-commit checks.

Issues Found

None from functional QA. Note: CI currently has a failing PR Description Check and pending jobs, so merge readiness still depends on CI/human follow-up.

This QA review was created by an AI agent (OpenHands) on behalf of the user.

all-hands-bot added the automated label Jun 18, 2026

all-hands-bot assigned juanmichelini Jun 18, 2026

all-hands-bot commented Jun 18, 2026

View reviewed changes

all-hands-bot force-pushed the automated/update-verified-models branch from d404e7f to b2ff154 Compare June 18, 2026 23:14

Update verified_models.py with latest completed benchmark models

d8dbbb0

all-hands-bot force-pushed the automated/update-verified-models branch from b2ff154 to d8dbbb0 Compare June 19, 2026 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update verified_models.py with latest completed benchmark models#3800

Update verified_models.py with latest completed benchmark models#3800
all-hands-bot wants to merge 1 commit into
mainfrom
automated/update-verified-models

all-hands-bot commented Jun 18, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

all-hands-bot commented Jun 18, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

✅ QA Report: PASS

Does this PR achieve its stated goal?

Test 1: Completed benchmark source contains the model

Test 2: SDK verified OpenHands model list before and after the PR

Test 3: Repository setup

Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

all-hands-bot commented Jun 18, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented Jun 18, 2026 •

edited

Loading

github-actions Bot commented Jun 18, 2026 •

edited

Loading