Production AI Reliability & Operational Evidence · OEP | Agent Runtime · Observability · Evals · Replay
Platform artifacts for production AI and agent runtime systems: workflow orchestration, tool-call permission and identity records, agent-step telemetry, release manifests, eval traces, replay / reconstruction packets, rollout gates, and incident evidence.
Author of the Operational Evidence Plane (OEP) for Agentic AI - open reference architecture for the operational-evidence layer of agent runtime systems. v0.3.0 joins release manifests, runtime events, permission records, traces, evals, replay state, and reconstruction packets under stable decision_id values, with counterfactual replay across policy, cost, drift, cache, and identity metadata. Concept DOI 10.5281/zenodo.20051036; v0.3.0 archive 10.5281/zenodo.20363793.
Method spec: Decision Evidence Maturity Model (DEMM) - arXiv:2605.04093. Empirical pilot: arXiv:2605.12078.
PhD research: the Operational Evidence Plane and counterfactual replay for production AI and agent runtime systems.
- Primary proof: Operational Evidence Plane v0.3.0 - reference implementation for reconstructable agent-runtime evidence.
- Reconstruction proof: Decision Trace Reconstructor - reports evidenced, partial, absent, and opaque facts in agent / automated-decision traces.
- Research anchors: DEMM method preprint, agent-decision reconstructability pilot, and the publication pipeline below.
Agentic AI / DEMM:
- Decision Evidence Maturity Model for Agentic AI: A Property-Level Method Specification - arXiv:2605.04093.
- Property-Level Reconstructability of Agent Decisions: An Anchor-Level Pilot Across Vendor SDK Adapter Regimes - arXiv:2605.12078.
Operational evidence foundation:
- Decision Trace Schema for Governance Evidence - arXiv:2604.09296.
- Evidence Sufficiency / Delayed Ground Truth - arXiv:2604.15740.
- Label-Free Governance Degradation - arXiv:2604.17836.
- Governed Decisioning + Agentic - arXiv:2604.19112.
- Post-Incident Decision Reconstruction - SSRN DOI 10.2139/ssrn.6457861.
- Production AI Reliability (release manifests, eval-to-release gates, incident evidence)
- Agent Runtime / Workflow Infrastructure (tool-use workflows, state / replay, safe execution evidence)
- Observability & Evals Infrastructure (eval / telemetry linkage, traces, quality loops)
- Operational Evidence & Incident Reconstruction (event identity, lineage, reconstruction packets)
- Agent Permissions / Identity / Policy Controls (tool-call authorization, policy lifecycle, agent-to-service evidence)
- Release Gates & Reliability Engineering (canary, shadow, rollback, postmortem packets)
- Platform / Control Plane Engineering (distributed services, Kubernetes, GitOps, multi-cloud)
- Data & Streaming Infrastructure (events, schemas, evidence joins, delayed-label systems)
Current strongest public proof:
- operational-evidence-plane - v0.3.0 public reference implementation for production AI / agent-runtime operational evidence: release manifests, agent-step events, tool-call permission packets, operational traces, eval results, reconstruction packets, deterministic code-review demo, Bedrock translation, and counterfactual replay across policy / cost / drift / cache / identity metadata. Apache-2.0. Concept DOI: 10.5281/zenodo.20051036; v0.3.0 DOI: 10.5281/zenodo.20363793.
- decision-trace-reconstructor - v0.1.0 trace reconstruction tool that reports evidenced, partial, absent, and opaque decision facts across LangSmith, OpenTelemetry, Bedrock, OpenAI Agents, Anthropic, MCP, and other adapters. Zenodo DOI: 10.5281/zenodo.19851574.
Foundational operational-evidence artifacts:
- decision-event-schema - v0.3.0 JSON Schema for decision / action events and reconstruction-oriented evidence identity. Concept DOI: 10.5281/zenodo.18923177.
- evidence-collector-sdk - v0.2.0 SDK for turning raw operational signals into provenance-bearing decision evidence records. Concept DOI: 10.5281/zenodo.19245404.
- evidence-sufficiency-calc - v0.2.0 calculator for scoring whether available operational proof is sufficient for a decision context. Concept DOI: 10.5281/zenodo.19233930.
- governance-drift-toolkit - v0.2.1 toolkit for monitoring degradation of governance evidence in delayed-label environments. Concept DOI: 10.5281/zenodo.19236417.
- governance-benchmark-dataset - v0.2.0 benchmark dataset for comparing evidence-property feasibility across decision-system architectures. Concept DOI: 10.5281/zenodo.19248722.
Supporting policy-as-code project:
- RuleHub - supporting Policy-as-Code ecosystem for AI / ML guardrails, policy enforcement, and reproducible evidence; currently secondary to OEP and used as a policy / agent-runtime bridge rather than the lead artifact.