Skip to content

feat(sdk): add /goal SDK core (judge-driven goal-completion loop)#3769

Merged
VascoSch92 merged 7 commits into
mainfrom
vasco/goal-sdk
Jun 18, 2026
Merged

feat(sdk): add /goal SDK core (judge-driven goal-completion loop)#3769
VascoSch92 merged 7 commits into
mainfrom
vasco/goal-sdk

Conversation

@VascoSch92

@VascoSch92 VascoSch92 commented Jun 17, 2026

Copy link
Copy Markdown
Member

HUMAN:

I added the code to the SDK for the /goal command. You can find an example of how it works in the .pr folder. The idea is to follow the same architecture as the Critic, making the goal command an extension that can be applied to a conversation.

  • A human has tested these changes.

AGENT:

Why

The SDK has no built-in way to drive a long-running objective to verified
completion: the agent finishes a turn, but nothing checks the goal is actually
done. This adds a judge-driven /goal loop (#3569) — the agent works, a second
LLM judges whether the objective is complete, and the loop re-prompts with that
feedback until it is (or a max_iterations cap is hit).

Summary

  • judge_goal + GoalVerdict: the reusable kernel — renders the conversation
    transcript (the agent system prompt is excluded to keep judge token cost down)
    and asks a judge LLM for a strict-JSON verdict.
  • GoalController: transport-agnostic continue-vs-stop decision logic + the
    iteration cap, with no I/O.
  • run_goal: a thin synchronous driver over the controller (send objective →
    run agent → judge → re-prompt). It composes with any existing critic — the
    critic governs each inner run(), this loop governs the overall objective.

No agent-server dependency; fully self-contained.

Issue Number

#3569

How to Test

Unit tests:

uv run pytest tests/sdk/conversation/goal/ -q

End-to-end, no API key required (deterministic scripted LLMs). This proves
the goal work lands in the same conversation history as the main chat (no fork):

uv run python .pr/goal_shared_history.py

Observed tail:

===== PROOF (shared history) =====
same conversation id .............. True
only one Conversation object ...... True (no fork was created)
event log GREW in place ........... 3 -> 7
main-convo events still present ... True
goal objective is in THIS log ..... True
goal outcome ...................... complete (after 2 round(s))

Run against a real LLM instead (creates files, runs pytest):

GOAL_DEMO_REAL=1 LLM_API_KEY=sk-... LLM_MODEL=gpt-5.5 uv run python .pr/goal_shared_history.py

Video/Screenshots

N/A — library change. The .pr/goal_shared_history.py output above is the
end-to-end evidence.

Type

  • Feature

Notes

Stacked work: the agent-server integration (HTTP endpoint, background loop,
goal-status events, stop/resume) is a follow-up PR opened against this branch.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:5982000-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-5982000-python \
  ghcr.io/openhands/agent-server:5982000-python

All tags pushed for this build

ghcr.io/openhands/agent-server:5982000-golang-amd64
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-golang-amd64
ghcr.io/openhands/agent-server:vasco-goal-sdk-golang-amd64
ghcr.io/openhands/agent-server:5982000-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:5982000-golang-arm64
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-golang-arm64
ghcr.io/openhands/agent-server:vasco-goal-sdk-golang-arm64
ghcr.io/openhands/agent-server:5982000-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:5982000-java-amd64
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-java-amd64
ghcr.io/openhands/agent-server:vasco-goal-sdk-java-amd64
ghcr.io/openhands/agent-server:5982000-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:5982000-java-arm64
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-java-arm64
ghcr.io/openhands/agent-server:vasco-goal-sdk-java-arm64
ghcr.io/openhands/agent-server:5982000-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:5982000-python-amd64
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-python-amd64
ghcr.io/openhands/agent-server:vasco-goal-sdk-python-amd64
ghcr.io/openhands/agent-server:5982000-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:5982000-python-arm64
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-python-arm64
ghcr.io/openhands/agent-server:vasco-goal-sdk-python-arm64
ghcr.io/openhands/agent-server:5982000-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:5982000-golang
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-golang
ghcr.io/openhands/agent-server:vasco-goal-sdk-golang
ghcr.io/openhands/agent-server:5982000-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:5982000-java
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-java
ghcr.io/openhands/agent-server:vasco-goal-sdk-java
ghcr.io/openhands/agent-server:5982000-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:5982000-python
ghcr.io/openhands/agent-server:59820007b94230d26fb2b3bb4a0108c76ff7e327-python
ghcr.io/openhands/agent-server:vasco-goal-sdk-python
ghcr.io/openhands/agent-server:5982000-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., 5982000-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 5982000-python-amd64) are also available if needed

Add openhands.sdk.conversation.goal: a conversation-level "/goal" driver
that pursues an objective by running the agent, judging completion with a
second LLM, and re-prompting until the goal is done or a cap is reached.

- judge_goal + GoalVerdict: the reusable objective+transcript -> verdict
  kernel (renders the transcript, excluding the system prompt, and asks a
  judge LLM for a strict-JSON verdict).
- GoalController: transport-agnostic continue-vs-stop decision logic and
  the iteration cap.
- run_goal: a thin synchronous driver over the controller that composes
  with any existing critic (the critic governs each inner run(); this loop
  governs the overall objective).

Self-contained, with no agent-server dependency. Includes a runnable demo
under .pr/ proving the goal work lands in the same conversation history.

Relates to #3569.
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

PR Artifacts Cleaned Up

The .pr/ directory has been automatically removed.

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@VascoSch92 VascoSch92 changed the title feat: add /goal SDK core (judge-driven goal-completion loop) feat(sdk): add /goal SDK core (judge-driven goal-completion loop) Jun 17, 2026
@VascoSch92 VascoSch92 marked this pull request as draft June 17, 2026 08:29
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/conversation/goal
   judge.py50296%113–114
TOTAL32149881672% 

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ QA Report: PASS WITH ISSUES

The new SDK /goal entrypoint worked end-to-end in deterministic user-style runs: it drove the same conversation through judge feedback, completed after re-prompting, and capped when the judge never approved.

Does this PR achieve its stated goal?

Yes. On main, importing openhands.sdk.conversation.goal fails because no built-in /goal SDK module exists; on this PR, the documented demo and an independent SDK script both run successfully. The demo showed the goal objective and follow-up appended to the same Conversation history (3 -> 7 events, same conversation id, complete after 2 rounds), and the cap script showed conservative handling of an unparseable judge response plus status=capped at max_iterations=2.

Phase Result
Environment Setup make build completed successfully; no tests/linters/pre-commit were run.
CI Status ⚠️ 19 checks passing, 10 pending, 1 failing: Validate PR description (human-only PR description field).
Functional Verification ✅ SDK /goal import, shared-history loop, completion, follow-up, and cap behavior were exercised via real uv run python commands.
Functional Verification

Test 1: Baseline main does not have a /goal SDK entrypoint

Step 1 — Reproduce / establish baseline (without the fix):
Ran cd /tmp/qa-goal-main && uv run python - <<'PY' ... from openhands.sdk.conversation.goal import run_goal ... PY against a detached origin/main worktree:

ModuleNotFoundError: No module named 'openhands.sdk.conversation.goal'

This establishes the pre-PR state: the SDK had no importable built-in /goal core.

Step 2 — Apply the PR's changes:
Used the checked-out PR branch vasco/goal-sdk at commit c32f5dc99430ef9eae52b1ceb490a4e54e3e5bf5.

Step 3 — Re-run with the fix in place:
Ran the documented deterministic demo command:

OPENHANDS_SUPPRESS_BANNER=1 uv run python .pr/goal_shared_history.py

Relevant output:

mode: DETERMINISTIC (scripted TestLLM)
===== AFTER MAIN CONVERSATION TURN =====
conversation id : ce5aec49-267c-4d64-aa7e-4f88de8633bd
total events    : 3
...
Goal audit 1/5: score=0.30 complete=False
Goal audit 2/5: score=1.00 complete=True
===== AFTER /goal LOOP (SAME CONVERSATION) =====
conversation id : ce5aec49-267c-4d64-aa7e-4f88de8633bd
total events    : 7
...
===== PROOF (shared history) =====
same conversation id .............. True
only one Conversation object ...... True (no fork was created)
event log GREW in place ........... 3 -> 7
main-convo events still present ... True
goal objective is in THIS log ..... True
goal outcome ...................... complete (after 2 round(s))

This confirms the PR delivers the main feature claim: run_goal uses the existing conversation, appends the objective/follow-up/work to the same event log, and stops only after the judge returns complete.

Test 2: Goal loop caps instead of silently succeeding when the judge does not approve

Step 1 — Reproduce / establish baseline (without the fix):
The same baseline import check above shows this behavior could not be exercised on main because openhands.sdk.conversation.goal did not exist.

Step 2 — Apply the PR's changes:
Used the checked-out PR branch and called the SDK directly from a short Python script with a scripted agent and scripted judge.

Step 3 — Re-run with the fix in place:
Ran a direct SDK script that imports run_goal, creates a Conversation, uses an unparseable first judge verdict, then a second incomplete verdict with max_iterations=2.

Relevant output:

judge_goal: could not parse verdict: 'not json at all'
Goal audit 1/2: score=0.00 complete=False
Goal audit 2/2: score=0.40 complete=False
outcome capped 2 False artifact missing
message_count 5
followup_present True
objective_present True
tail produce a verified artifact | first attempt | The goal is NOT yet complete (audit iteration 1). Outstanding: Judge verdict could not be parsed. ... | second attempt

This confirms the loop conservatively continues after an unparseable judge response, preserves the objective/follow-up in the real conversation history, and returns a capped outcome at the configured iteration limit instead of reporting success.

Issues Found

  • 🟡 Minor: The documented deterministic demo leaves a generated conversation persistence directory in the repository root after a normal run; I added an inline comment on the source line that caused git status to show ?? ce5aec49267c4d64aa7e4f88de8633bd/.
  • ⚠️ CI / process: Validate PR description is failing. Per repository policy, the human-only PR description field must be updated by a human, not by this QA agent.

This review was created by an AI agent (OpenHands) on behalf of the user.

Comment thread .pr/goal_shared_history.py Outdated
@VascoSch92 VascoSch92 marked this pull request as ready for review June 17, 2026 13:17

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

The new SDK /goal entry point works in deterministic SDK usage: it imports only on the PR branch, completes after judge approval, preserves the same conversation history, caps when still incomplete, and rejects empty objectives.

Does this PR achieve its stated goal?

Yes. The stated goal was to add SDK core for a judge-driven goal-completion loop that drives the existing Conversation, re-prompts from judge feedback, stops on completion or max_iterations, and keeps all work in shared history. I exercised that API with scripted SDK LLMs and the author’s deterministic demo; both showed complete after 2 judge rounds, history growth 3 -> 7 with original events preserved, objective/followup present in the same log, and capped behavior at max_iterations=2.

Phase Result
Environment Setup make build completed successfully and installed the uv-managed dev environment.
CI Status ⚠️ Snapshot showed mergeStateStatus=BLOCKED: 18 success, 1 skipped, 11 in progress, and 1 cancelled review-thread-gate check. I did not run tests locally.
Functional Verification ✅ Deterministic SDK usage and .pr/goal_shared_history.py both exercised the new /goal behavior successfully.
Functional Verification

Test 1: SDK /goal API behavior before and after the PR

Step 1 — Reproduce / establish baseline (without the feature):
Ran git checkout --detach origin/main; uv run python /tmp/qa_goal_behavior.py with a script that imports openhands.sdk.conversation.goal.run_goal and drives a scripted SDK Conversation:

Traceback (most recent call last):
  File "/tmp/qa_goal_behavior.py", line 2, in <module>
    from openhands.sdk.conversation.goal import run_goal
ModuleNotFoundError: No module named 'openhands.sdk.conversation.goal'
baseline_exit=1

This establishes the baseline: the SDK had no /goal module/API on origin/main.

Step 2 — Apply the PR's changes:
Checked out vasco/goal-sdk and reset to 5ee822da8f1d0a85515dd99ee4a5c4523c9fcb13.

Step 3 — Re-run with the PR in place:
Ran the same uv run python /tmp/qa_goal_behavior.py script. It created real SDK Conversation objects with deterministic TestLLMs, sent a normal message, ran run_goal(...), then exercised completion, capped, and invalid-input outcomes:

complete_outcome complete 2
shared_history 3 -> 7 True
objective_in_history True
followup_in_history True
capped_outcome capped 2
empty_objective_error Goal objective must not be empty.
pr_exit=0

This shows the new API works as claimed: the loop continued once on judge feedback, returned complete after the second audit, appended to the same history while preserving existing events, stopped as capped when the judge never completed within the cap, and rejected an empty objective.

Test 2: Author-provided shared-history demo

Step 1 — Baseline:
The same origin/main import failure above establishes that the demo’s /goal dependency does not exist before the PR.

Step 2 — Apply the PR's changes:
Used the PR branch at commit 5ee822da8f1d0a85515dd99ee4a5c4523c9fcb13.

Step 3 — Run the real demo command:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python .pr/goal_shared_history.py:

demo_exit=0
===== PROOF (shared history) =====
same conversation id .............. True
only one Conversation object ...... True (no fork was created)
event log GREW in place ........... 3 -> 7
main-convo events still present ... True
goal objective is in THIS log ..... True
goal outcome ...................... complete (after 2 round(s))

This independently confirms the user-facing example behaves as the PR description claims: /goal drives the existing conversation rather than forking a sidecar.

Issues Found

None.

Final verdict: PASS.

This QA review was created by an AI agent (OpenHands) on behalf of the user.

@xingyaoww xingyaoww left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as you can maybe add an example script and show that it works?

Nit: we could also make the GoalController a subclass of "Critic" too!

@VascoSch92

Copy link
Copy Markdown
Member Author

@OpenHands Create a page in OpenHands/Docs to document the new feature added in this PR. Follow the structure that also other docs page have. Moreover, open a draft PR. The name of the branch should match the name of the branch of this PR

@openhands-development

Copy link
Copy Markdown

@VascoSch92 it looks like you haven't created an OpenHands account yet. Please sign up at OpenHands Cloud and try again.

@openhands-ai

openhands-ai Bot commented Jun 18, 2026

Copy link
Copy Markdown

I'm on it! VascoSch92 can track my progress at all-hands.dev

@openhands-ai

openhands-ai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Summary

I've completed the user's request: create a docs page for the new /goal feature in OpenHands/Docs, following the structure of other docs pages, and open a draft PR with a branch matching the SDK PR's branch name.

Checklist

  • Docs page createdsdk/guides/convo-goal.mdx, modeled after sdk/guides/critic.mdx and sdk/guides/convo-fork.mdx (same frontmatter, RunExampleCode snippet, section layout: Overview → How It Works → Quick Start → Result tables → Parameters → Composing with a Critic → Lower-Level Building Blocks → Ready-to-run Example → Next Steps).
  • Followed existing structure — frontmatter title/description, leading "ready-to-run example" pointer, Note callouts, parameter tables, embedded example linked to the SDK PR's examples/01_standalone_sdk/54_goal_completion_loop.py.
  • Navigation updated — added sdk/guides/convo-goal to docs.json under the Conversation Features group (where conversation-level drivers like convo-fork, convo-ask-agent, and hooks already live).
  • JSON validated — confirmed docs.json still parses.
  • Branch name matches the SDK PRvasco/goal-sdk.
  • Draft PR openeddocs(sdk): document /goal judge-driven goal-completion loop docs#580 (the repo was renamed Docsdocs; GitHub redirected the push). The PR description links back to SDK PR feat(sdk): add /goal SDK core (judge-driven goal-completion loop) #3769 and includes the AI-disclosure note on behalf of @VascoSch92.

Conciseness

The change set is minimal and scoped to exactly what was requested — only two files touched:

  1. sdk/guides/convo-goal.mdx (new)
  2. docs.json (one line added under "Conversation Features")

No extraneous edits to llms.txt/llms-full.txt (those are auto-generated by Mintlify) and no changes to unrelated guides.

@VascoSch92

Copy link
Copy Markdown
Member Author

Tested locally that the example is working properly

======================================================================
Goal complete after 2 audit round(s).
Judge score: 1.00
Workspace: /var/folders/27/789vf87x4dn_zh358_3nk83m0000gq/T/goal_demo_idx2wyka
======================================================================
EXAMPLE_COST: 0.189634

@VascoSch92 VascoSch92 merged commit cf56307 into main Jun 18, 2026
36 of 37 checks passed
@VascoSch92 VascoSch92 deleted the vasco/goal-sdk branch June 18, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants