Show and tell: Agent Execution Proof — structured audit records for tool calls #1

telleroutlook · 2026-06-26T01:23:16Z

telleroutlook
Jun 26, 2026
Maintainer

Show and tell: Agent Execution Proof — structured audit records for tool calls

Hey everyone 👋

We've been working on something we're calling Agent Execution Proof (AEP) — a structured, schema-versioned audit record that captures every tool call an AI agent makes, along with the policy decision that allowed it, the budget it consumed, and the pre/post state digests.

The motivation: logs tell you what happened, but they can't prove in what order things happened or whether the policy gate ran before or after the tool. In multi-agent systems, that distinction matters.

What it looks like in practice

import { AEPEmitter } from "@wasmagent/aep";

const emitter = new AEPEmitter({ run_id: "run-abc", repo_commit: "5c1102f" });

emitter.addAction({
  tool_name: "fs_write",
  state_changing: true,
  capability_decision: { decision: "allow", reason_code: "policy:default-v1" },
  precondition_digest: "sha256:a1b2c3...",
  post_state_digest:   "sha256:d4e5f6...",
});

emitter.setBudgetLedger({
  token_budget: { limit: 4000, spent: 142 },
  risk_budget:  { limit: 1.0,  spent: 0.2 },
});

const record = emitter.build();
// Zod-validated AEPRecord with schema_version, actions[], budget_ledger, etc.

The record is versioned (aep/v0.1, aep/v0.2) and feeds downstream into a compliance evaluation and trace-to-training pipeline.

9 OTel span names

For real-time observability alongside the audit record:

tool.call · policy.check · sandbox.exec · verifier.check
redaction.apply · dataset.export · agent.run · llm.generate · mcp.request

These flow into any OTel collector. The AEPRecord is what you persist.

Repo

Questions for the community:

Are there tool call audit patterns you use today that AEP doesn't cover?
Would you want pre-built OTel dashboard templates (Grafana/Jaeger) for these spans?
Thoughts on the v0.2 delegation chain model for multi-agent trust boundaries?

Happy to discuss any of this here.

armorer-labs · 2026-06-26T03:34:27Z

armorer-labs
Jun 26, 2026

This is a useful direction. The part I would make very explicit in AEP is the boundary between an observability span and an authority receipt.

For the audit record, I would want a few fields that are not only useful for dashboards:

requested action: normalized tool name, schema version, argument hash, and side-effect class
authority decision: policy id/version, decision, approval route, principal/run identity, and resource scope
execution binding: whether the observed tool call matched the approved arguments or drifted before execution
state boundary: pre/post digest plus what those digests actually cover, since a repo hash, sandbox snapshot, database row set, and browser state all mean different things
replay boundary: which inputs are sufficient to reproduce the decision, and which external state is intentionally only referenced
delegation chain: parent run, child run, delegated capability, expiry, and the receipt that proves the child stayed inside the grant

The OTel span names look reasonable for live operations, but I would avoid making the span model carry the whole compliance story. Spans are great for latency, causality, and error paths; the persisted AEP record should be the thing that survives sampling, retention changes, collector outages, and later training-data export.

One pattern that has worked well for agent ops is to store a compact run-level capability snapshot once, then refer to it from individual tool receipts. Otherwise every tool call either repeats too much context or loses the answer to "what could this agent have done at the time?"

Disclosure: I work on Armorer Labs.

0 replies

telleroutlook · 2026-06-26T08:14:24Z

telleroutlook
Jun 26, 2026
Maintainer Author

Thanks for the detailed read — your taxonomy maps onto fields we
already shipped in aep/v0.2, but you're pointing at three real
gaps. Quick field-map first so we're talking about the same artifact,
then the gaps and the RFC.

Already in aep/v0.2 (see packages/aep/src/types.ts):

Your category	AEP field(s)
Requested action — normalized tool name, schema version, arg context	`tool_name`, `tool_descriptor_digest`, `tool_manifest_digest`, `state_changing`
Authority decision — policy / decision / principal / resource	`decision` (allow/deny/ask_user/dry_run), `subject`, `resource`, `capability`, `reason_code`, `policy_bundle_digest`, `mcp_server_card_digest`
State pre/post digest	`pre_state_digest`, `post_state_digest`, `precondition_digest`, `result_digest`
Delegation chain	`parent_action_id`, `causal_chain_id`, `scope_lease_id` — `AgentTeam` carries this end-to-end
Run-level capability snapshot	One `CapabilityDecision` per session, referenced by digest from each `ActionEvidence`. Same pattern you described.

Three real gaps you're surfacing:

Side-effect class beyond a boolean. state_changing: boolean
collapses read / mutate-local / mutate-external / network-egress
into one bit.
State-digest coverage is implicit. No machine-readable way to
distinguish a git-tree digest from a sandbox snapshot, DB row set,
or browser DOM. They can't be compared without knowing the kind.
Execution-binding drift is implicit. approval_context_hash is
carried and the runtime checks it, but a historical reader can't
tell "matched" from "skipped" from "re-approved on drift".

These are now drafted as point fixes in RFC #7 (aep/v0.3),
along with approval_mode from your other comment. Schema diff is
in the issue; nothing in v0.2 is removed.

On the OTel question — we agree, and the current split is what you'd
recommend: the OTel spans (@wasmagent/otel-exporter,
GENAI_SEMCONV) carry latency / causality / errors; the AEP record
is the persisted authority receipt with its own schema, signing, and
retention story. Mapped via aepActionToOtelSpan for live
observability without conflating them.

Comments on the RFC welcome, particularly on the state_digest_kind
enum coverage and whether the decision_envelope belongs in v0.3 or
v0.4.

Disclosure: I maintain wasmagent-js, the project being discussed.

0 replies

telleroutlook · 2026-06-26T08:20:14Z

telleroutlook
Jun 26, 2026
Maintainer Author

Also — since you mentioned you work on Armorer Labs and I just looked
at Armorer-Guard and
Armorer — there's real
overlap in the MCP tools/call interception layer, and what looks
like a complementary gap on either side.

We're both MIT-licensed and shipping in public, so let me be direct
about where I think we line up:

Your Armorer-Guard does the fast local scan (Rust, the
HF-hosted classifier, the 0.0247ms latency number) — we don't have
a comparable in-process ML classifier; our @wasmagent/mcp-firewall
vetting is keyword bag + lightweight n-gram logistic, deliberately
positioned as "first-line filter, not adversarial-grade detection"
(the project's own README says so).
Our @wasmagent/aep does the persistent authority receipt
layer — schema-versioned, Ed25519-signed, designed to outlive
spans and feed downstream training-data export. Your
Armorer-Guard #15
"Add runtime receipts for Guard decisions" is exactly this layer,
and if it's useful, the v0.2 schema + emitter + signer are
ready-to-consume MIT code — happy for it to inform your design
even if you re-implement in Rust.

If there's appetite for it on your side, two concrete things that
could happen without anyone giving up anything:

Shared schema for the receipt itself. If Armorer-Guard's
future runtime-receipt and our AEPRecord agree on field names
for the things both projects record (tool name, args digest,
decision, policy id, scope) then operators running both can
feed a single audit pipeline. RFC Update README: correct wasmagent-ops visibility claim #7 in this repo is the
in-flight version of v0.3; comments on the schema there are
welcome on the merits regardless of whether you adopt anything.
Acknowledged inspiration in both directions. If our taint
propagation / consent ledger / ApprovalReceipt / ScopeLease
primitives are useful inspiration for Armorer's approvals layer,
please use them — that's the point of publishing under MIT. We'd
do the same with anything in Armorer-Guard that's better than
what we have (the in-process classifier is the obvious example).

No need to coordinate roadmap or even agree on naming. Just flagging
that the two projects don't have to converge or compete — there's a
clean split between "fast local scan" and "persistent signed
receipt" that lines up with how each of you / us has been
investing.

0 replies

armorer-labs · 2026-06-26T19:46:40Z

armorer-labs
Jun 26, 2026

Thanks for spelling this out. I agree with the clean split: scanner/runtime-risk decision on one side, persistent authority receipt on the other. The integration point I would keep boring is a receipt schema contract, not runtime coupling.

For the Guard side, the useful receipt fields are probably something like:

guard decision id
model or ruleset version
policy pack/version
normalized prompt/tool payload digest
redaction/provenance summary
action class and target boundary
decision: allow, block, escalate, or allow-with-receipt
reason class, not just free-text explanation
latency/runtime budget metadata

The important part is that this should be joinable to the downstream execution receipt without requiring both projects to share a process. A run id, tool-call id, normalized args digest, and policy/version tuple are enough for an operator to ask: what did the scanner decide before execution, what did the runtime actually execute, and did anything drift between those two points?

That also keeps the projects from pretending to solve the same layer. A fast local classifier can be wrong in ways the receipt layer later exposes, and a signed receipt can prove that a weak decision was faithfully executed without proving the decision was good. Those are complementary failure modes, so a shared vocabulary around common fields seems more useful than shared roadmap.

Disclosure: I work on Armorer Labs.

0 replies

telleroutlook · 2026-06-26T23:22:27Z

telleroutlook
Jun 26, 2026
Maintainer Author

The framing on complementary failure modes is exactly right, and I would take it one step further: a strong scanner + weak receipt is a worse combination than weak scanner + strong receipt, because the scanner failure is silent — the operator sees an allow and has no artifact showing what assumption the allow was built on. A signed receipt at least surfaces the question.

On the Guard receipt fields you listed — those map well onto the AEP fields already in aep/v0.2, with two gaps I would call out explicitly:

reason class, not free-text explanation — we have reason_code: z.string() today, but it is still free-form. Making it a typed enum (or at minimum a namespaced token) so two scanners can emit the same reason without coordination is on the v0.3 list under policy_deny_reason_class. Worth aligning vocabulary early if you are thinking about reason_class on the Guard side.
redaction/provenance summary — we have redaction_profile as a top-level reference on AEPRecord, but the per-action link is implicit. v0.3 will make redaction_profile joinable at the action level so a receipt reader knows exactly what was redacted before the digest was computed.

The join key you named — run id, tool-call id, normalized args digest, policy/version — is already our canonical join key between AEPRecord and CapabilityDecision. If Guard receipts and AEP records share that tuple, an operator with both in the same pipeline can answer the three-point question (what did the scanner decide, what did the runtime execute, did anything drift) without requiring any runtime coupling.

If you do wire up receipts on the Guard side and want to test the join against real AEP output, the emitter and signer are in packages/aep/src/ — happy to walk through the v0.2 schema if useful.

Disclosure: I maintain wasmagent-js, the project being discussed.

0 replies

armorer-labs · 2026-06-27T00:49:40Z

armorer-labs
Jun 27, 2026

That join test would be useful. I would keep the first interop target deliberately small: one Guard decision artifact, one runtime execution receipt, and one negative-path case where the digest or authority boundary does not match.

For reason_class, I would avoid making the first shared surface too product-specific. A namespaced token shape seems like the right compromise:

prompt-injection.indirect
tool-call.scope-escalation
credential.exposure-risk
data-egress.unapproved-target
policy.local-deny

The scanner can still keep richer local explanation text, but the receipt should carry a stable class that a downstream policy engine can group without parsing prose. I would also include the policy/ruleset version next to the class, because the same class can become stricter over time.

On redaction, I agree the per-action join is the important bit. The artifact should make it clear whether the digest was computed over raw input, redacted input, or a canonicalized form after redaction. Otherwise two systems can appear to agree on normalized_args_digest while actually hashing different material. A small tuple like redaction_profile_id, redaction_profile_version, digest_input_stage, and canonicalization_version would probably be enough for the first pass.

For an interop fixture, the useful acceptance test is not just the happy path. I would want the reader to prove three outcomes:

scanner allowed and runtime executed the same canonical payload
scanner denied and runtime did not execute, with the deny class preserved
scanner allowed payload A but runtime observed payload B, with drift called out explicitly rather than hidden as a generic deny

That keeps the coupling at the evidence boundary: Guard does not need to know about the runtime internals, and AEP does not need to know how Guard reached the decision, but an operator can still reconstruct whether the scan decision, approval grant, and executed call describe the same event.

Disclosure: I work on Armorer Labs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WasmAgent

Show and tell: Agent Execution Proof — structured audit records for tool calls #1

Uh oh!

{{title}}

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

WasmAgent

Show and tell: Agent Execution Proof — structured audit records for tool calls #1

Uh oh!

telleroutlook Jun 26, 2026 Maintainer

Show and tell: Agent Execution Proof — structured audit records for tool calls

What it looks like in practice

9 OTel span names

Repo

Replies: 6 comments

Uh oh!

armorer-labs Jun 26, 2026

Uh oh!

telleroutlook Jun 26, 2026 Maintainer Author

Uh oh!

telleroutlook Jun 26, 2026 Maintainer Author

Uh oh!

armorer-labs Jun 26, 2026

Uh oh!

telleroutlook Jun 26, 2026 Maintainer Author

Uh oh!

armorer-labs Jun 27, 2026

telleroutlook
Jun 26, 2026
Maintainer

armorer-labs
Jun 26, 2026

telleroutlook
Jun 26, 2026
Maintainer Author

telleroutlook
Jun 26, 2026
Maintainer Author

armorer-labs
Jun 26, 2026

telleroutlook
Jun 26, 2026
Maintainer Author

armorer-labs
Jun 27, 2026