feat: attach expected output in the DSL to seed platform evaluation sets by nsmnds · Pull Request #147 · flock-community/aigentic

nsmnds · 2026-06-05T10:50:41Z

Why

During an agent's development phase the developer often already has the expected values (e.g. a folder of PDFs plus the expected extracted fields). Today the only way to seed annotations on the Aigentic Platform is to publish runs and then manually annotate each one field-by-field in the UI. This lets the developer attach the expected structured output directly in the agent DSL, so a published run is automatically added to a named evaluation set and scored — no manual annotation step.

Scope v1: structured-output agents only. The wire contract is designed so tool-call expectations can be added later, but tool calls are not implemented now.

Pairs with the companion server-side PR in flock-community/aigentic-platform (same branch). The new evaluation field is optional, so the server is deployable independently — old clients omit it, old gateways ignore it.

Two entry points

Upfront — expected value known at start():

extractor.start(
    Attachment.Base64.pdf(pdf1Base64),
    expected = Expected(
        evaluationSet = "invoice-golden-set",
        output = InvoiceFields("INV-001", "D-100", "1250.00"),
    ),
)

Deferred / human-in-the-loop — only a runId known later (e.g. after your own backend confirms/corrects the value):

val run = extractor.start(Attachment.Base64.pdf(pdf1Base64))
val runId = run.platformRunId ?: error("run was not published")
val confirmed: InvoiceFields = myBackend.review(run.outcome)
extractor.addToEvaluationSet(runId, "invoice-golden-set", expected = confirmed)

Both compile to one uniform wire shape (evaluationSet, expectedResponse-JSON); the platform turns it into evaluation fields + annotations.

What changed

Contract (src/platform/wirespec/gateway.ws, kept identical to the platform repo):

type RunEvaluationDto { evaluationSet, expectedResponse } and type RunCreatedDto { runId }
optional evaluation: RunEvaluationDto? on RunDto
POST /gateway/runs 201 now returns RunCreatedDto (was Unit) so the client learns the run id
new POST /gateway/runs/{runId}/evaluation (AddToEvaluationSet) reusing RunEvaluationDto

Client:

new Expected<O>(evaluationSet, output) wrapper in core (the platform { } block and Platform interface are unchanged)
start(..., expected: Expected<O>? = null) — trailing name-bound param; serialized with the same outputSerializer that already encodes FinishedResultDto.response, so the expected JSON lines up with the run's own response server-side
AgentRun.platformRunId: RunId? populated from the 201 RunCreatedDto (null-safe for old gateways that send an empty 201 body)
deferred Agent.addToEvaluationSet(runId, evaluationSet, expected) + Platform.addToEvaluationSet(...) + sealed EvaluationSubmitResult

Testing

./gradlew --no-build-cache clean :src:core:jvmTest :src:platform:jvmTest — green; spotlessCheck clean. New tests cover: mapper builds/omits RunEvaluationDto; start(expected = …) puts the serialized output + set name in the POST body; start() populates platformRunId (and stays null on an empty 201 body); addToEvaluationSet(...) POSTs to the /evaluation path.

https://claude.ai/code/session_01DRjhdMQLdNYJ6SY5ML1LUb

Generated by Claude Code

…ission Let developers attach an expected structured output so a published run is auto-stored as an evaluation set on the platform. - Add Expected<O>(evaluationSet, output) wrapper in core - start(..., expected = Expected(...)) threads the expected output to the gateway via the new RunDto.evaluation field (RunEvaluationDto) - Capture the created run id from the 201 RunCreatedDto body and expose it as AgentRun.platformRunId (nullable; tolerates empty 201 bodies from old gateways) - Add deferred Agent.addToEvaluationSet(runId, evaluationSet, expected) posting to /gateway/runs/{runId}/evaluation, with EvaluationSubmitResult - Regenerate gateway wirespec types (RunEvaluationDto, RunCreatedDto, AddToEvaluationSet endpoint)

- make RunSentResult.Success.runId non-null; map an id-less 201 to an Error - accept a plain String runId in Agent.addToEvaluationSet - rename the deferred gateway endpoint to POST /gateway/runs/{runId}/annotations

nsmnds requested review from Fputker, ceesjansenflock, sjorsdev and wilmveel as code owners June 5, 2026 10:50

refactor: refine SDK evaluation submission API

361069f

- make RunSentResult.Success.runId non-null; map an id-less 201 to an Error - accept a plain String runId in Agent.addToEvaluationSet - rename the deferred gateway endpoint to POST /gateway/runs/{runId}/annotations

nsmnds merged commit 4a28a9e into main Jun 5, 2026
5 checks passed

nsmnds deleted the claude/exciting-dijkstra-H42kd branch June 5, 2026 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: attach expected output in the DSL to seed platform evaluation sets#147

feat: attach expected output in the DSL to seed platform evaluation sets#147
nsmnds merged 2 commits into
mainfrom
claude/exciting-dijkstra-H42kd

nsmnds commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nsmnds commented Jun 5, 2026

Why

Two entry points

What changed

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants