feat: attach expected output in the DSL to seed platform evaluation sets#147
Merged
Conversation
…ission
Let developers attach an expected structured output so a published run is
auto-stored as an evaluation set on the platform.
- Add Expected<O>(evaluationSet, output) wrapper in core
- start(..., expected = Expected(...)) threads the expected output to the
gateway via the new RunDto.evaluation field (RunEvaluationDto)
- Capture the created run id from the 201 RunCreatedDto body and expose it as
AgentRun.platformRunId (nullable; tolerates empty 201 bodies from old gateways)
- Add deferred Agent.addToEvaluationSet(runId, evaluationSet, expected) posting
to /gateway/runs/{runId}/evaluation, with EvaluationSubmitResult
- Regenerate gateway wirespec types (RunEvaluationDto, RunCreatedDto,
AddToEvaluationSet endpoint)
- make RunSentResult.Success.runId non-null; map an id-less 201 to an Error
- accept a plain String runId in Agent.addToEvaluationSet
- rename the deferred gateway endpoint to POST /gateway/runs/{runId}/annotations
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
During an agent's development phase the developer often already has the expected values (e.g. a folder of PDFs plus the expected extracted fields). Today the only way to seed annotations on the Aigentic Platform is to publish runs and then manually annotate each one field-by-field in the UI. This lets the developer attach the expected structured output directly in the agent DSL, so a published run is automatically added to a named evaluation set and scored — no manual annotation step.
Scope v1: structured-output agents only. The wire contract is designed so tool-call expectations can be added later, but tool calls are not implemented now.
Two entry points
Upfront — expected value known at
start():extractor.start( Attachment.Base64.pdf(pdf1Base64), expected = Expected( evaluationSet = "invoice-golden-set", output = InvoiceFields("INV-001", "D-100", "1250.00"), ), )Deferred / human-in-the-loop — only a
runIdknown later (e.g. after your own backend confirms/corrects the value):Both compile to one uniform wire shape
(evaluationSet, expectedResponse-JSON); the platform turns it into evaluation fields + annotations.What changed
Contract (
src/platform/wirespec/gateway.ws, kept identical to the platform repo):type RunEvaluationDto { evaluationSet, expectedResponse }andtype RunCreatedDto { runId }evaluation: RunEvaluationDto?onRunDtoPOST /gateway/runs201now returnsRunCreatedDto(wasUnit) so the client learns the run idPOST /gateway/runs/{runId}/evaluation(AddToEvaluationSet) reusingRunEvaluationDtoClient:
Expected<O>(evaluationSet, output)wrapper in core (theplatform { }block andPlatforminterface are unchanged)start(..., expected: Expected<O>? = null)— trailing name-bound param; serialized with the sameoutputSerializerthat already encodesFinishedResultDto.response, so the expected JSON lines up with the run's own response server-sideAgentRun.platformRunId: RunId?populated from the201 RunCreatedDto(null-safe for old gateways that send an empty 201 body)Agent.addToEvaluationSet(runId, evaluationSet, expected)+Platform.addToEvaluationSet(...)+ sealedEvaluationSubmitResultTesting
./gradlew --no-build-cache clean :src:core:jvmTest :src:platform:jvmTest— green;spotlessCheckclean. New tests cover: mapper builds/omitsRunEvaluationDto;start(expected = …)puts the serialized output + set name in the POST body;start()populatesplatformRunId(and stays null on an empty 201 body);addToEvaluationSet(...)POSTs to the/evaluationpath.https://claude.ai/code/session_01DRjhdMQLdNYJ6SY5ML1LUb
Generated by Claude Code