feat(sdk): add /goal SDK core (judge-driven goal-completion loop)#3769
Conversation
Add openhands.sdk.conversation.goal: a conversation-level "/goal" driver that pursues an objective by running the agent, judging completion with a second LLM, and re-prompting until the goal is done or a cap is reached. - judge_goal + GoalVerdict: the reusable objective+transcript -> verdict kernel (renders the transcript, excluding the system prompt, and asks a judge LLM for a strict-JSON verdict). - GoalController: transport-agnostic continue-vs-stop decision logic and the iteration cap. - run_goal: a thin synchronous driver over the controller that composes with any existing critic (the critic governs each inner run(); this loop governs the overall objective). Self-contained, with no agent-server dependency. Includes a runnable demo under .pr/ proving the goal work lands in the same conversation history. Relates to #3569.
|
✅ PR Artifacts Cleaned Up The |
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
The new SDK /goal entrypoint worked end-to-end in deterministic user-style runs: it drove the same conversation through judge feedback, completed after re-prompting, and capped when the judge never approved.
Does this PR achieve its stated goal?
Yes. On main, importing openhands.sdk.conversation.goal fails because no built-in /goal SDK module exists; on this PR, the documented demo and an independent SDK script both run successfully. The demo showed the goal objective and follow-up appended to the same Conversation history (3 -> 7 events, same conversation id, complete after 2 rounds), and the cap script showed conservative handling of an unparseable judge response plus status=capped at max_iterations=2.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully; no tests/linters/pre-commit were run. |
| CI Status | Validate PR description (human-only PR description field). |
| Functional Verification | ✅ SDK /goal import, shared-history loop, completion, follow-up, and cap behavior were exercised via real uv run python commands. |
Functional Verification
Test 1: Baseline main does not have a /goal SDK entrypoint
Step 1 — Reproduce / establish baseline (without the fix):
Ran cd /tmp/qa-goal-main && uv run python - <<'PY' ... from openhands.sdk.conversation.goal import run_goal ... PY against a detached origin/main worktree:
ModuleNotFoundError: No module named 'openhands.sdk.conversation.goal'
This establishes the pre-PR state: the SDK had no importable built-in /goal core.
Step 2 — Apply the PR's changes:
Used the checked-out PR branch vasco/goal-sdk at commit c32f5dc99430ef9eae52b1ceb490a4e54e3e5bf5.
Step 3 — Re-run with the fix in place:
Ran the documented deterministic demo command:
OPENHANDS_SUPPRESS_BANNER=1 uv run python .pr/goal_shared_history.pyRelevant output:
mode: DETERMINISTIC (scripted TestLLM)
===== AFTER MAIN CONVERSATION TURN =====
conversation id : ce5aec49-267c-4d64-aa7e-4f88de8633bd
total events : 3
...
Goal audit 1/5: score=0.30 complete=False
Goal audit 2/5: score=1.00 complete=True
===== AFTER /goal LOOP (SAME CONVERSATION) =====
conversation id : ce5aec49-267c-4d64-aa7e-4f88de8633bd
total events : 7
...
===== PROOF (shared history) =====
same conversation id .............. True
only one Conversation object ...... True (no fork was created)
event log GREW in place ........... 3 -> 7
main-convo events still present ... True
goal objective is in THIS log ..... True
goal outcome ...................... complete (after 2 round(s))
This confirms the PR delivers the main feature claim: run_goal uses the existing conversation, appends the objective/follow-up/work to the same event log, and stops only after the judge returns complete.
Test 2: Goal loop caps instead of silently succeeding when the judge does not approve
Step 1 — Reproduce / establish baseline (without the fix):
The same baseline import check above shows this behavior could not be exercised on main because openhands.sdk.conversation.goal did not exist.
Step 2 — Apply the PR's changes:
Used the checked-out PR branch and called the SDK directly from a short Python script with a scripted agent and scripted judge.
Step 3 — Re-run with the fix in place:
Ran a direct SDK script that imports run_goal, creates a Conversation, uses an unparseable first judge verdict, then a second incomplete verdict with max_iterations=2.
Relevant output:
judge_goal: could not parse verdict: 'not json at all'
Goal audit 1/2: score=0.00 complete=False
Goal audit 2/2: score=0.40 complete=False
outcome capped 2 False artifact missing
message_count 5
followup_present True
objective_present True
tail produce a verified artifact | first attempt | The goal is NOT yet complete (audit iteration 1). Outstanding: Judge verdict could not be parsed. ... | second attempt
This confirms the loop conservatively continues after an unparseable judge response, preserves the objective/follow-up in the real conversation history, and returns a capped outcome at the configured iteration limit instead of reporting success.
Issues Found
- 🟡 Minor: The documented deterministic demo leaves a generated conversation persistence directory in the repository root after a normal run; I added an inline comment on the source line that caused
git statusto show?? ce5aec49267c4d64aa7e4f88de8633bd/. ⚠️ CI / process:Validate PR descriptionis failing. Per repository policy, the human-only PR description field must be updated by a human, not by this QA agent.
This review was created by an AI agent (OpenHands) on behalf of the user.
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
The new SDK /goal entry point works in deterministic SDK usage: it imports only on the PR branch, completes after judge approval, preserves the same conversation history, caps when still incomplete, and rejects empty objectives.
Does this PR achieve its stated goal?
Yes. The stated goal was to add SDK core for a judge-driven goal-completion loop that drives the existing Conversation, re-prompts from judge feedback, stops on completion or max_iterations, and keeps all work in shared history. I exercised that API with scripted SDK LLMs and the author’s deterministic demo; both showed complete after 2 judge rounds, history growth 3 -> 7 with original events preserved, objective/followup present in the same log, and capped behavior at max_iterations=2.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully and installed the uv-managed dev environment. |
| CI Status | mergeStateStatus=BLOCKED: 18 success, 1 skipped, 11 in progress, and 1 cancelled review-thread-gate check. I did not run tests locally. |
| Functional Verification | ✅ Deterministic SDK usage and .pr/goal_shared_history.py both exercised the new /goal behavior successfully. |
Functional Verification
Test 1: SDK /goal API behavior before and after the PR
Step 1 — Reproduce / establish baseline (without the feature):
Ran git checkout --detach origin/main; uv run python /tmp/qa_goal_behavior.py with a script that imports openhands.sdk.conversation.goal.run_goal and drives a scripted SDK Conversation:
Traceback (most recent call last):
File "/tmp/qa_goal_behavior.py", line 2, in <module>
from openhands.sdk.conversation.goal import run_goal
ModuleNotFoundError: No module named 'openhands.sdk.conversation.goal'
baseline_exit=1
This establishes the baseline: the SDK had no /goal module/API on origin/main.
Step 2 — Apply the PR's changes:
Checked out vasco/goal-sdk and reset to 5ee822da8f1d0a85515dd99ee4a5c4523c9fcb13.
Step 3 — Re-run with the PR in place:
Ran the same uv run python /tmp/qa_goal_behavior.py script. It created real SDK Conversation objects with deterministic TestLLMs, sent a normal message, ran run_goal(...), then exercised completion, capped, and invalid-input outcomes:
complete_outcome complete 2
shared_history 3 -> 7 True
objective_in_history True
followup_in_history True
capped_outcome capped 2
empty_objective_error Goal objective must not be empty.
pr_exit=0
This shows the new API works as claimed: the loop continued once on judge feedback, returned complete after the second audit, appended to the same history while preserving existing events, stopped as capped when the judge never completed within the cap, and rejected an empty objective.
Test 2: Author-provided shared-history demo
Step 1 — Baseline:
The same origin/main import failure above establishes that the demo’s /goal dependency does not exist before the PR.
Step 2 — Apply the PR's changes:
Used the PR branch at commit 5ee822da8f1d0a85515dd99ee4a5c4523c9fcb13.
Step 3 — Run the real demo command:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python .pr/goal_shared_history.py:
demo_exit=0
===== PROOF (shared history) =====
same conversation id .............. True
only one Conversation object ...... True (no fork was created)
event log GREW in place ........... 3 -> 7
main-convo events still present ... True
goal objective is in THIS log ..... True
goal outcome ...................... complete (after 2 round(s))
This independently confirms the user-facing example behaves as the PR description claims: /goal drives the existing conversation rather than forking a sidecar.
Issues Found
None.
Final verdict: PASS.
This QA review was created by an AI agent (OpenHands) on behalf of the user.
xingyaoww
left a comment
There was a problem hiding this comment.
LGTM as long as you can maybe add an example script and show that it works?
Nit: we could also make the GoalController a subclass of "Critic" too!
|
@OpenHands Create a page in OpenHands/Docs to document the new feature added in this PR. Follow the structure that also other docs page have. Moreover, open a draft PR. The name of the branch should match the name of the branch of this PR |
|
@VascoSch92 it looks like you haven't created an OpenHands account yet. Please sign up at OpenHands Cloud and try again. |
|
I'm on it! VascoSch92 can track my progress at all-hands.dev |
SummaryI've completed the user's request: create a docs page for the new Checklist
ConcisenessThe change set is minimal and scoped to exactly what was requested — only two files touched:
No extraneous edits to |
|
Tested locally that the example is working properly |
HUMAN:
I added the code to the SDK for the /goal command. You can find an example of how it works in the .pr folder. The idea is to follow the same architecture as the Critic, making the goal command an extension that can be applied to a conversation.
AGENT:
Why
The SDK has no built-in way to drive a long-running objective to verified
completion: the agent finishes a turn, but nothing checks the goal is actually
done. This adds a judge-driven
/goalloop (#3569) — the agent works, a secondLLM judges whether the objective is complete, and the loop re-prompts with that
feedback until it is (or a
max_iterationscap is hit).Summary
judge_goal+GoalVerdict: the reusable kernel — renders the conversationtranscript (the agent system prompt is excluded to keep judge token cost down)
and asks a judge LLM for a strict-JSON verdict.
GoalController: transport-agnostic continue-vs-stop decision logic + theiteration cap, with no I/O.
run_goal: a thin synchronous driver over the controller (send objective →run agent → judge → re-prompt). It composes with any existing critic — the
critic governs each inner
run(), this loop governs the overall objective.No agent-server dependency; fully self-contained.
Issue Number
#3569
How to Test
Unit tests:
End-to-end, no API key required (deterministic scripted LLMs). This proves
the goal work lands in the same conversation history as the main chat (no fork):
Observed tail:
Run against a real LLM instead (creates files, runs pytest):
Video/Screenshots
N/A — library change. The
.pr/goal_shared_history.pyoutput above is theend-to-end evidence.
Type
Notes
Stacked work: the agent-server integration (HTTP endpoint, background loop,
goal-status events, stop/resume) is a follow-up PR opened against this branch.
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:5982000-pythonRun
All tags pushed for this build
About Multi-Architecture Support
5982000-python) is a multi-arch manifest supporting both amd64 and arm645982000-python-amd64) are also available if needed