Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,9 +87,12 @@ See [`docs/protocol-faep.md`](docs/protocol-faep.md) for the full schema.
|---|---|---|
| `wasmagent-js` | Sandbox / tool-use runtime reference | No |
| `open-agent-audit` | Evidence record enhancement layer | Optional |
| `agent-trust-infra` | Trust Passport & AgentBOM standards for evidence | See docs |
| `trace-pipeline` | Export failure traces as training data | Phase 2 |
| `bscode` | Coding task source / solver baseline | Phase 2 |

See [`docs/audit-integration.md`](docs/audit-integration.md) for details on how FreshArena FAEP records map to the Trust Passport schema.

---

## License
Expand Down
75 changes: 75 additions & 0 deletions docs/audit-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,78 @@ FreshArena produces FAEP evaluation records. `open-agent-audit` can serve as an
When connecting, FreshArena emits one `faep_record` JSONL line per evaluation run. open-agent-audit ingests it via its standard evidence adapter. No structural changes to FAEP records are required — open-agent-audit wraps them, it does not replace them.

See [`docs/protocol-faep.md`](protocol-faep.md) for the full FAEP record schema.

---

## Agent Trust Infrastructure Integration

FreshArena evaluation records are designed to align with the **Agent Trust Infrastructure**'s Trust Passport and AgentBOM specifications defined in the sibling `agent-trust-infra` repository.

### Mapping FAEP Records to Trust Passport

FreshArena's `FaepRecord` serves as evidence artifacts that can be embedded within a Trust Passport:

| FAEP Field | Trust Passport Concept | Purpose |
|---|---|---|
| `run_id` | `evaluation_run_id` | Links evaluation to a specific test execution |
| `task.id` + `task.seed_hash` | `test_case_id` | Uniquely identifies the evaluated task instance |
| `solver.id` + `solver.track` | `agent_identifier` | Identifies which agent was evaluated |
| `solver.model_metadata_hash` | `agent.config_hash` | Links to AgentBOM component configuration |
| `solver.workflow_hash` | `agent.workflow_hash` | Attests to the agent's workflow/prompt configuration |
| `solver.artifact_hash` | `agent.binary_hash` | Links to the executable artifact |
| `score.canonical_pass` | `evaluation_result.pass` | Primary correctness verdict |
| `score.adversarial_pass` | `evaluation_result.adversarial_check` | Post-commit robustness evidence |
| `verifier.package` + `verifier.version` | `verifier_reference` | Links to deterministic verification standard |
| `verifier.result_hash` | `evidence_fingerprint` | Cryptographic fingerprint of the verification result |
| `replay.command` + `replay.log_hash` | `reproducibility_artifact` | Enables third-party verification |

### FAEP as AgentBOM Evidence Source

FreshArena records provide evidence that can be referenced in an AgentBOM:

1. **Component Verification**: The `solver.model_metadata_hash` and `solver.workflow_hash` provide provenance for the agent's configuration at test time.

2. **Version Evidence**: The `generator.version`, `tester.version`, and `verifier.version` fields document the full evaluation stack.

3. **Deterministic Verification**: The `verifier.result_hash` combined with `task.seed_hash` creates a reproducible fingerprint that can be independently verified.

### Consumption Pattern

To integrate FreshArena results with a Trust Passport:

```json
{
"trust_passport": {
"agent_id": "solver:my-agent-v1",
"evaluations": [
{
"source": "FreshArena",
"faep_record_ref": "faep:run_abc123_task_xyz789",
"task_family": "json_transform.normalize.v0",
"evidence_type": "deterministic_verification",
"timestamp": "2025-01-15T10:30:00Z",
"result": {
"canonical_pass": true,
"adversarial_pass": false,
"fresh_fixed_gap": 0.15
}
}
]
}
}
```

### Key Differences in Focus

| Aspect | FreshArena (FAEP) | Agent Trust Infrastructure |
|---|---|---|
| Primary Goal | Detect overfitting via fresh task generation | Aggregate and verify agent claims across projects |
| Evidence Type | Per-task evaluation records with adversarial checks | Cross-domain attestation and provenance |
| Replayability | Full deterministic replay via seed + verifier package | Claim verification via linked evidence artifacts |
| Freshness Check | Core: compares fresh vs fixed task performance | Optional: one of many evidence sources |

### References

- **Agent Trust Infrastructure**: https://github.com/WasmAgent/agent-trust-infra
- **Trust Passport Spec**: Trust Passport defines the standard schema for agent evaluation evidence
- **AgentBOM Spec**: AgentBOM defines the standard schema for agent component documentation
Loading