Skip to content

feat: attach expected output in the DSL to seed platform evaluation sets#147

Merged
nsmnds merged 2 commits into
mainfrom
claude/exciting-dijkstra-H42kd
Jun 5, 2026
Merged

feat: attach expected output in the DSL to seed platform evaluation sets#147
nsmnds merged 2 commits into
mainfrom
claude/exciting-dijkstra-H42kd

Conversation

@nsmnds

@nsmnds nsmnds commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Why

During an agent's development phase the developer often already has the expected values (e.g. a folder of PDFs plus the expected extracted fields). Today the only way to seed annotations on the Aigentic Platform is to publish runs and then manually annotate each one field-by-field in the UI. This lets the developer attach the expected structured output directly in the agent DSL, so a published run is automatically added to a named evaluation set and scored — no manual annotation step.

Scope v1: structured-output agents only. The wire contract is designed so tool-call expectations can be added later, but tool calls are not implemented now.

Pairs with the companion server-side PR in flock-community/aigentic-platform (same branch). The new evaluation field is optional, so the server is deployable independently — old clients omit it, old gateways ignore it.

Two entry points

Upfront — expected value known at start():

extractor.start(
    Attachment.Base64.pdf(pdf1Base64),
    expected = Expected(
        evaluationSet = "invoice-golden-set",
        output = InvoiceFields("INV-001", "D-100", "1250.00"),
    ),
)

Deferred / human-in-the-loop — only a runId known later (e.g. after your own backend confirms/corrects the value):

val run = extractor.start(Attachment.Base64.pdf(pdf1Base64))
val runId = run.platformRunId ?: error("run was not published")
val confirmed: InvoiceFields = myBackend.review(run.outcome)
extractor.addToEvaluationSet(runId, "invoice-golden-set", expected = confirmed)

Both compile to one uniform wire shape (evaluationSet, expectedResponse-JSON); the platform turns it into evaluation fields + annotations.

What changed

Contract (src/platform/wirespec/gateway.ws, kept identical to the platform repo):

  • type RunEvaluationDto { evaluationSet, expectedResponse } and type RunCreatedDto { runId }
  • optional evaluation: RunEvaluationDto? on RunDto
  • POST /gateway/runs 201 now returns RunCreatedDto (was Unit) so the client learns the run id
  • new POST /gateway/runs/{runId}/evaluation (AddToEvaluationSet) reusing RunEvaluationDto

Client:

  • new Expected<O>(evaluationSet, output) wrapper in core (the platform { } block and Platform interface are unchanged)
  • start(..., expected: Expected<O>? = null) — trailing name-bound param; serialized with the same outputSerializer that already encodes FinishedResultDto.response, so the expected JSON lines up with the run's own response server-side
  • AgentRun.platformRunId: RunId? populated from the 201 RunCreatedDto (null-safe for old gateways that send an empty 201 body)
  • deferred Agent.addToEvaluationSet(runId, evaluationSet, expected) + Platform.addToEvaluationSet(...) + sealed EvaluationSubmitResult

Testing

./gradlew --no-build-cache clean :src:core:jvmTest :src:platform:jvmTestgreen; spotlessCheck clean. New tests cover: mapper builds/omits RunEvaluationDto; start(expected = …) puts the serialized output + set name in the POST body; start() populates platformRunId (and stays null on an empty 201 body); addToEvaluationSet(...) POSTs to the /evaluation path.

https://claude.ai/code/session_01DRjhdMQLdNYJ6SY5ML1LUb


Generated by Claude Code

…ission

Let developers attach an expected structured output so a published run is
auto-stored as an evaluation set on the platform.

- Add Expected<O>(evaluationSet, output) wrapper in core
- start(..., expected = Expected(...)) threads the expected output to the
  gateway via the new RunDto.evaluation field (RunEvaluationDto)
- Capture the created run id from the 201 RunCreatedDto body and expose it as
  AgentRun.platformRunId (nullable; tolerates empty 201 bodies from old gateways)
- Add deferred Agent.addToEvaluationSet(runId, evaluationSet, expected) posting
  to /gateway/runs/{runId}/evaluation, with EvaluationSubmitResult
- Regenerate gateway wirespec types (RunEvaluationDto, RunCreatedDto,
  AddToEvaluationSet endpoint)
- make RunSentResult.Success.runId non-null; map an id-less 201 to an Error
- accept a plain String runId in Agent.addToEvaluationSet
- rename the deferred gateway endpoint to POST /gateway/runs/{runId}/annotations
@nsmnds nsmnds merged commit 4a28a9e into main Jun 5, 2026
5 checks passed
@nsmnds nsmnds deleted the claude/exciting-dijkstra-H42kd branch June 5, 2026 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants