Filecoin Cloud Agent Benchmark

This repo is becoming the benchmark harness for Filecoin On Chain Cloud launch readiness.

The benchmark goal is simple:

run an agent regularly against the Filecoin Cloud getting-started guide
require the agent to generate a fresh wallet, fund it, upload a file, download it, and prove integrity
measure true end-to-end wall time, including agent inference and tool use
retain structured evidence and operator-readable feedback over time

Current State

What exists today:

a canonical local harness that runs Codex in a fresh workspace and writes normalized run artifacts under runs/<run-id>/
structured validation, dashboard record generation, local feed aggregation, and alert evaluation
Filecoin Cloud artifact bundle publishing plus per-artifact evidence publishing with a Filecoin-hosted artifact index
a Token Host Builder benchmark registry schema deployed on Filecoin Calibration
local dashboard views that combine feed-backed operator metrics with chain-backed collection pages
one preserved historical manual run plus newer validated benchmark runs under ignored runs/

Historical evidence lives in historical-runs/2026-03-04-initial-manual.

Working Direction

The active repo direction is documented here:

Active Entry Point

Local execution now goes through bin/run-benchmark.sh, which creates a fresh temporary workspace, launches Codex, captures outer timing, and writes run artifacts under ignored runs/<run-id>/ directories.

Each run now produces:

run-summary.json: harness-normalized result record
validation-result.json: validator output derived from structured evidence
dashboard-records.json: dashboard-ready records derived from the validated run
artifact-publish-result.json: Filecoin artifact publication result for the bundle, artifact index, and individually published evidence files
workspace-output.tgz: captured workspace bundle for later publishing
docs-snapshot.html and docs-snapshot.sha256 when the guide can be fetched at run start

Prompt versions now live under prompts/, and the harness selects a version from the benchmark MODE.

Current benchmark modes:

fresh-follow-docs: agent must generate and fund a fresh wallet
inherited-key-follow-docs: agent inherits PRIVATE_KEY from the harness environment and uses that wallet for the guide flow
scripted-regression: deterministic harness sanity mode

Current environment assumptions:

artifact publishing backend target will be the Filecoin storage service under test
dashboard development can run locally first against Filecoin Calibration
scheduling can remain a locally invoked script until the harness output stabilizes

Local operations now support:

npm run benchmark:cycle: run benchmark, publish the bundle plus artifact index/evidence files to Filecoin, rebuild dashboard feed, then evaluate alerts
npm run benchmark:publish: publish the latest run bundle to Filecoin Cloud using the dev wallet
npm run benchmark:alerts: evaluate local alert thresholds without running a new benchmark
PUBLISH_DASHBOARD_APP=1 npm run benchmark:cycle: also publish the finalized run record into the deployed Token Host registry

Dashboard

The benchmark dashboard is defined as a Token Host Builder app schema in dashboard/schema.json.

Current dashboard behavior:

the homepage and /run?id=<runId> use the local validated benchmark feed for operator metrics and artifact inspection
the generated collection routes such as /BenchmarkRun and /BenchmarkIncident read live records from the deployed Calibration app
the dashboard feed is scoped to the current deployed registry by default so wiped/redeployed registries do not mix with stale local history
artifact buttons in the custom run view point at Filecoin-hosted evidence URLs, not localhost routes

Useful commands:

npm run dashboard:dev: start the local Next dashboard with the current manifest, ABI, and feed copied into public/
npm run dashboard:ui-sync: refresh only the generated dashboard UI from the current schema, compiled ABI, manifest, and UI overrides without redeploying the registry contract
npm run dashboard:up: deploy the dashboard schema to Filecoin Calibration and refresh the generated manifest
npm run dashboard:publish: publish the latest finalized run into the deployed BenchmarkRun registry
npm run dashboard:republish-history: republish all local runs/*/dashboard-records.json into the current deployment
npm run dashboard:reset: deploy a fresh registry and republish local benchmark history into it

Latest validated end-to-end inherited-key run:

20260306T000054Z-f0eb33
on-chain BenchmarkRun record #1 at deployment 0x10ce50e27b9dc9b333345aa4ce6cc5f39a2160ce
artifact index: https://calib.ezpdpz.net/piece/bafkzcibdrenqqtcqh2dgfjb5bkxo7rfsp62thiakhg6hgd3c54tm5rjghtsff2qj

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
bin		bin
dashboard		dashboard
docs		docs
historical-runs/2026-03-04-initial-manual		historical-runs/2026-03-04-initial-manual
prompts		prompts
schemas		schemas
scripts		scripts
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
prompt.md		prompt.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Filecoin Cloud Agent Benchmark

Current State

Working Direction

Active Entry Point

Dashboard

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Filecoin Cloud Agent Benchmark

Current State

Working Direction

Active Entry Point

Dashboard

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages