feat(sdk): add /goal SDK core (judge-driven goal-completion loop) by VascoSch92 · Pull Request #3769 · OpenHands/software-agent-sdk

VascoSch92 · 2026-06-17T08:26:02Z

HUMAN:

I added the code to the SDK for the /goal command. You can find an example of how it works in the .pr folder. The idea is to follow the same architecture as the Critic, making the goal command an extension that can be applied to a conversation.

A human has tested these changes.

AGENT:

Why

The SDK has no built-in way to drive a long-running objective to verified
completion: the agent finishes a turn, but nothing checks the goal is actually
done. This adds a judge-driven /goal loop (#3569) — the agent works, a second
LLM judges whether the objective is complete, and the loop re-prompts with that
feedback until it is (or a max_iterations cap is hit).

Summary

judge_goal + GoalVerdict: the reusable kernel — renders the conversation
transcript (the agent system prompt is excluded to keep judge token cost down)
and asks a judge LLM for a strict-JSON verdict.
GoalController: transport-agnostic continue-vs-stop decision logic + the
iteration cap, with no I/O.
run_goal: a thin synchronous driver over the controller (send objective →
run agent → judge → re-prompt). It composes with any existing critic — the
critic governs each inner run(), this loop governs the overall objective.

No agent-server dependency; fully self-contained.

Issue Number

#3569

How to Test

Unit tests:

uv run pytest tests/sdk/conversation/goal/ -q

End-to-end, no API key required (deterministic scripted LLMs). This proves
the goal work lands in the same conversation history as the main chat (no fork):

uv run python .pr/goal_shared_history.py

Observed tail:

===== PROOF (shared history) =====
same conversation id .............. True
only one Conversation object ...... True (no fork was created)
event log GREW in place ........... 3 -> 7
main-convo events still present ... True
goal objective is in THIS log ..... True
goal outcome ...................... complete (after 2 round(s))

Run against a real LLM instead (creates files, runs pytest):

GOAL_DEMO_REAL=1 LLM_API_KEY=sk-... LLM_MODEL=gpt-5.5 uv run python .pr/goal_shared_history.py

Video/Screenshots

N/A — library change. The .pr/goal_shared_history.py output above is the
end-to-end evidence.

Type

Feature

Notes

Stacked work: the agent-server integration (HTTP endpoint, background loop,
goal-status events, stop/resume) is a follow-up PR opened against this branch.

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:5982000-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-5982000-python \
  ghcr.io/openhands/agent-server:5982000-python

All tags pushed for this build

ghcr.io/openhands/agent-server:5982000-golang-amd64
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-golang-amd64
ghcr.io/openhands/agent-server:vasco-goal-sdk-golang-amd64
ghcr.io/openhands/agent-server:5982000-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:5982000-golang-arm64
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-golang-arm64
ghcr.io/openhands/agent-server:vasco-goal-sdk-golang-arm64
ghcr.io/openhands/agent-server:5982000-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:5982000-java-amd64
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-java-amd64
ghcr.io/openhands/agent-server:vasco-goal-sdk-java-amd64
ghcr.io/openhands/agent-server:5982000-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:5982000-java-arm64
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-java-arm64
ghcr.io/openhands/agent-server:vasco-goal-sdk-java-arm64
ghcr.io/openhands/agent-server:5982000-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:5982000-python-amd64
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-python-amd64
ghcr.io/openhands/agent-server:vasco-goal-sdk-python-amd64
ghcr.io/openhands/agent-server:5982000-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:5982000-python-arm64
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-python-arm64
ghcr.io/openhands/agent-server:vasco-goal-sdk-python-arm64
ghcr.io/openhands/agent-server:5982000-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:5982000-golang
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-golang
ghcr.io/openhands/agent-server:vasco-goal-sdk-golang
ghcr.io/openhands/agent-server:5982000-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:5982000-java
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-java
ghcr.io/openhands/agent-server:vasco-goal-sdk-java
ghcr.io/openhands/agent-server:5982000-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:5982000-python
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-python
ghcr.io/openhands/agent-server:vasco-goal-sdk-python
ghcr.io/openhands/agent-server:5982000-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., 5982000-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 5982000-python-amd64) are also available if needed

Add openhands.sdk.conversation.goal: a conversation-level "/goal" driver that pursues an objective by running the agent, judging completion with a second LLM, and re-prompting until the goal is done or a cap is reached. - judge_goal + GoalVerdict: the reusable objective+transcript -> verdict kernel (renders the transcript, excluding the system prompt, and asks a judge LLM for a strict-JSON verdict). - GoalController: transport-agnostic continue-vs-stop decision logic and the iteration cap. - run_goal: a thin synchronous driver over the controller that composes with any existing critic (the critic governs each inner run(); this loop governs the overall objective). Self-contained, with no agent-server dependency. Includes a runnable demo under .pr/ proving the goal work lands in the same conversation history. Relates to #3569.

github-actions · 2026-06-17T08:26:24Z

✅ PR Artifacts Cleaned Up

The .pr/ directory has been automatically removed.

github-actions · 2026-06-17T08:26:42Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-06-17T08:26:45Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-06-17T08:29:07Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/conversation/goal
judge.py	50	2	96%	113–114
TOTAL	32149	8816	72%

all-hands-bot

⚠️ QA Report: PASS WITH ISSUES

The new SDK /goal entrypoint worked end-to-end in deterministic user-style runs: it drove the same conversation through judge feedback, completed after re-prompting, and capped when the judge never approved.

Does this PR achieve its stated goal?

Yes. On main, importing openhands.sdk.conversation.goal fails because no built-in /goal SDK module exists; on this PR, the documented demo and an independent SDK script both run successfully. The demo showed the goal objective and follow-up appended to the same Conversation history (3 -> 7 events, same conversation id, complete after 2 rounds), and the cap script showed conservative handling of an unparseable judge response plus status=capped at max_iterations=2.

Phase	Result
Environment Setup	✅ `make build` completed successfully; no tests/linters/pre-commit were run.
CI Status	⚠️ 19 checks passing, 10 pending, 1 failing: `Validate PR description` (human-only PR description field).
Functional Verification	✅ SDK `/goal` import, shared-history loop, completion, follow-up, and cap behavior were exercised via real `uv run python` commands.

Functional Verification

Test 1: Baseline main does not have a `/goal` SDK entrypoint

Step 1 — Reproduce / establish baseline (without the fix):
Ran cd /tmp/qa-goal-main && uv run python - <<'PY' ... from openhands.sdk.conversation.goal import run_goal ... PY against a detached origin/main worktree:

ModuleNotFoundError: No module named 'openhands.sdk.conversation.goal'

This establishes the pre-PR state: the SDK had no importable built-in /goal core.

Step 2 — Apply the PR's changes:
Used the checked-out PR branch vasco/goal-sdk at commit c32f5dc99430ef9eae52b1ceb490a4e54e3e5bf5.

Step 3 — Re-run with the fix in place:
Ran the documented deterministic demo command:

OPENHANDS_SUPPRESS_BANNER=1 uv run python .pr/goal_shared_history.py

Relevant output:

mode: DETERMINISTIC (scripted TestLLM)
===== AFTER MAIN CONVERSATION TURN =====
conversation id : ce5aec49-267c-4d64-aa7e-4f88de8633bd
total events    : 3
...
Goal audit 1/5: score=0.30 complete=False
Goal audit 2/5: score=1.00 complete=True
===== AFTER /goal LOOP (SAME CONVERSATION) =====
conversation id : ce5aec49-267c-4d64-aa7e-4f88de8633bd
total events    : 7
...
===== PROOF (shared history) =====
same conversation id .............. True
only one Conversation object ...... True (no fork was created)
event log GREW in place ........... 3 -> 7
main-convo events still present ... True
goal objective is in THIS log ..... True
goal outcome ...................... complete (after 2 round(s))

This confirms the PR delivers the main feature claim: run_goal uses the existing conversation, appends the objective/follow-up/work to the same event log, and stops only after the judge returns complete.

Test 2: Goal loop caps instead of silently succeeding when the judge does not approve

Step 1 — Reproduce / establish baseline (without the fix):
The same baseline import check above shows this behavior could not be exercised on main because openhands.sdk.conversation.goal did not exist.

Step 2 — Apply the PR's changes:
Used the checked-out PR branch and called the SDK directly from a short Python script with a scripted agent and scripted judge.

Step 3 — Re-run with the fix in place:
Ran a direct SDK script that imports run_goal, creates a Conversation, uses an unparseable first judge verdict, then a second incomplete verdict with max_iterations=2.

Relevant output:

judge_goal: could not parse verdict: 'not json at all'
Goal audit 1/2: score=0.00 complete=False
Goal audit 2/2: score=0.40 complete=False
outcome capped 2 False artifact missing
message_count 5
followup_present True
objective_present True
tail produce a verified artifact | first attempt | The goal is NOT yet complete (audit iteration 1). Outstanding: Judge verdict could not be parsed. ... | second attempt

This confirms the loop conservatively continues after an unparseable judge response, preserves the objective/follow-up in the real conversation history, and returns a capped outcome at the configured iteration limit instead of reporting success.

Issues Found

🟡 Minor: The documented deterministic demo leaves a generated conversation persistence directory in the repository root after a normal run; I added an inline comment on the source line that caused git status to show ?? ce5aec49267c4d64aa7e4f88de8633bd/.
⚠️ CI / process: Validate PR description is failing. Per repository policy, the human-only PR description field must be updated by a human, not by this QA agent.

This review was created by an AI agent (OpenHands) on behalf of the user.

…d future imports

all-hands-bot

✅ QA Report: PASS

The new SDK /goal entry point works in deterministic SDK usage: it imports only on the PR branch, completes after judge approval, preserves the same conversation history, caps when still incomplete, and rejects empty objectives.

Does this PR achieve its stated goal?

Yes. The stated goal was to add SDK core for a judge-driven goal-completion loop that drives the existing Conversation, re-prompts from judge feedback, stops on completion or max_iterations, and keeps all work in shared history. I exercised that API with scripted SDK LLMs and the author’s deterministic demo; both showed complete after 2 judge rounds, history growth 3 -> 7 with original events preserved, objective/followup present in the same log, and capped behavior at max_iterations=2.

Phase	Result
Environment Setup	✅ `make build` completed successfully and installed the uv-managed dev environment.
CI Status	⚠️ Snapshot showed `mergeStateStatus=BLOCKED`: 18 success, 1 skipped, 11 in progress, and 1 cancelled review-thread-gate check. I did not run tests locally.
Functional Verification	✅ Deterministic SDK usage and `.pr/goal_shared_history.py` both exercised the new `/goal` behavior successfully.

Functional Verification

Test 1: SDK `/goal` API behavior before and after the PR

Step 1 — Reproduce / establish baseline (without the feature):
Ran git checkout --detach origin/main; uv run python /tmp/qa_goal_behavior.py with a script that imports openhands.sdk.conversation.goal.run_goal and drives a scripted SDK Conversation:

Traceback (most recent call last):
  File "/tmp/qa_goal_behavior.py", line 2, in <module>
    from openhands.sdk.conversation.goal import run_goal
ModuleNotFoundError: No module named 'openhands.sdk.conversation.goal'
baseline_exit=1

This establishes the baseline: the SDK had no /goal module/API on origin/main.

Step 2 — Apply the PR's changes:
Checked out vasco/goal-sdk and reset to 5ee822da8f1d0a85515dd99ee4a5c4523c9fcb13.

Step 3 — Re-run with the PR in place:
Ran the same uv run python /tmp/qa_goal_behavior.py script. It created real SDK Conversation objects with deterministic TestLLMs, sent a normal message, ran run_goal(...), then exercised completion, capped, and invalid-input outcomes:

complete_outcome complete 2
shared_history 3 -> 7 True
objective_in_history True
followup_in_history True
capped_outcome capped 2
empty_objective_error Goal objective must not be empty.
pr_exit=0

This shows the new API works as claimed: the loop continued once on judge feedback, returned complete after the second audit, appended to the same history while preserving existing events, stopped as capped when the judge never completed within the cap, and rejected an empty objective.

Test 2: Author-provided shared-history demo

Step 1 — Baseline:
The same origin/main import failure above establishes that the demo’s /goal dependency does not exist before the PR.

Step 2 — Apply the PR's changes:
Used the PR branch at commit 5ee822da8f1d0a85515dd99ee4a5c4523c9fcb13.

Step 3 — Run the real demo command:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python .pr/goal_shared_history.py:

demo_exit=0
===== PROOF (shared history) =====
same conversation id .............. True
only one Conversation object ...... True (no fork was created)
event log GREW in place ........... 3 -> 7
main-convo events still present ... True
goal objective is in THIS log ..... True
goal outcome ...................... complete (after 2 round(s))

This independently confirms the user-facing example behaves as the PR description claims: /goal drives the existing conversation rather than forking a sidecar.

Issues Found

None.

Final verdict: PASS.

This QA review was created by an AI agent (OpenHands) on behalf of the user.

xingyaoww

LGTM as long as you can maybe add an example script and show that it works?

Nit: we could also make the GoalController a subclass of "Critic" too!

VascoSch92 · 2026-06-18T09:35:12Z

@OpenHands Create a page in OpenHands/Docs to document the new feature added in this PR. Follow the structure that also other docs page have. Moreover, open a draft PR. The name of the branch should match the name of the branch of this PR

openhands-development · 2026-06-18T09:35:17Z

@VascoSch92 it looks like you haven't created an OpenHands account yet. Please sign up at OpenHands Cloud and try again.

openhands-ai · 2026-06-18T09:35:28Z

I'm on it! VascoSch92 can track my progress at all-hands.dev

openhands-ai · 2026-06-18T09:40:09Z

Summary

I've completed the user's request: create a docs page for the new /goal feature in OpenHands/Docs, following the structure of other docs pages, and open a draft PR with a branch matching the SDK PR's branch name.

Checklist

Docs page created — sdk/guides/convo-goal.mdx, modeled after sdk/guides/critic.mdx and sdk/guides/convo-fork.mdx (same frontmatter, RunExampleCode snippet, section layout: Overview → How It Works → Quick Start → Result tables → Parameters → Composing with a Critic → Lower-Level Building Blocks → Ready-to-run Example → Next Steps).
Followed existing structure — frontmatter title/description, leading "ready-to-run example" pointer, Note callouts, parameter tables, embedded example linked to the SDK PR's examples/01_standalone_sdk/54_goal_completion_loop.py.
Navigation updated — added sdk/guides/convo-goal to docs.json under the Conversation Features group (where conversation-level drivers like convo-fork, convo-ask-agent, and hooks already live).
JSON validated — confirmed docs.json still parses.
Branch name matches the SDK PR — vasco/goal-sdk.
Draft PR opened — docs(sdk): document /goal judge-driven goal-completion loop docs#580 (the repo was renamed Docs → docs; GitHub redirected the push). The PR description links back to SDK PR feat(sdk): add /goal SDK core (judge-driven goal-completion loop) #3769 and includes the AI-disclosure note on behalf of @VascoSch92.

Conciseness

The change set is minimal and scoped to exactly what was requested — only two files touched:

sdk/guides/convo-goal.mdx (new)
docs.json (one line added under "Conversation Features")

No extraneous edits to llms.txt/llms-full.txt (those are auto-generated by Mintlify) and no changes to unrelated guides.

VascoSch92 · 2026-06-18T09:54:22Z

Tested locally that the example is working properly

======================================================================
Goal complete after 2 audit round(s).
Judge score: 1.00
Workspace: /var/folders/27/789vf87x4dn_zh358_3nk83m0000gq/T/goal_demo_idx2wyka
======================================================================
EXAMPLE_COST: 0.189634

VascoSch92 mentioned this pull request Jun 17, 2026

feat(agent-server): add /goal agent-server endpoint, background loop, and stop/resume #3770

Open

2 tasks

VascoSch92 changed the title ~~feat: add /goal SDK core (judge-driven goal-completion loop)~~ feat(sdk): add /goal SDK core (judge-driven goal-completion loop) Jun 17, 2026

VascoSch92 marked this pull request as draft June 17, 2026 08:29

all-hands-bot reviewed Jun 17, 2026

View reviewed changes

Comment thread .pr/goal_shared_history.py Outdated

VascoSch92 and others added 2 commits June 17, 2026 14:58

chore(sdk): tidy /goal core — temp-dir demo persistence, drop unneede…

2e37bea

…d future imports

Merge branch 'main' into vasco/goal-sdk

5ee822d

VascoSch92 marked this pull request as ready for review June 17, 2026 13:17

all-hands-bot reviewed Jun 17, 2026

View reviewed changes

xingyaoww approved these changes Jun 17, 2026

View reviewed changes

allhands-bot and others added 4 commits June 17, 2026 15:26

chore: Remove PR-only artifacts [automated]

6fe198c

Merge branch 'main' into vasco/goal-sdk

ef4b808

Merge branch 'main' into vasco/goal-sdk

8e0f859

docs(examples): add /goal judge-driven completion-loop example

5982000

VascoSch92 mentioned this pull request Jun 18, 2026

docs(sdk): document /goal judge-driven goal-completion loop OpenHands/docs#580

Open

VascoSch92 merged commit cf56307 into main Jun 18, 2026
36 of 37 checks passed

VascoSch92 deleted the vasco/goal-sdk branch June 18, 2026 09:54

Conversation

VascoSch92 commented Jun 17, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Summary

Issue Number

How to Test

Video/Screenshots

Type

Notes

Uh oh!

github-actions Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

github-actions Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

⚠️ QA Report: PASS WITH ISSUES

Does this PR achieve its stated goal?

Test 1: Baseline main does not have a /goal SDK entrypoint

Test 2: Goal loop caps instead of silently succeeding when the judge does not approve

Issues Found

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

✅ QA Report: PASS

Does this PR achieve its stated goal?

Test 1: SDK /goal API behavior before and after the PR

Test 2: Author-provided shared-history demo

Issues Found

Uh oh!

xingyaoww left a comment

Choose a reason for hiding this comment

Uh oh!

VascoSch92 commented Jun 18, 2026

Uh oh!

openhands-development Bot commented Jun 18, 2026

Uh oh!

openhands-ai Bot commented Jun 18, 2026

Uh oh!

openhands-ai Bot commented Jun 18, 2026

Summary

Checklist

Conciseness

Uh oh!

VascoSch92 commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

VascoSch92 commented Jun 17, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented Jun 17, 2026 •

edited

Loading

github-actions Bot commented Jun 17, 2026 •

edited

Loading

github-actions Bot commented Jun 17, 2026 •

edited

Loading

github-actions Bot commented Jun 17, 2026 •

edited

Loading

Test 1: Baseline main does not have a `/goal` SDK entrypoint

Test 1: SDK `/goal` API behavior before and after the PR