This repo is becoming the benchmark harness for Filecoin On Chain Cloud launch readiness.
The benchmark goal is simple:
- run an agent regularly against the Filecoin Cloud getting-started guide
- require the agent to generate a fresh wallet, fund it, upload a file, download it, and prove integrity
- measure true end-to-end wall time, including agent inference and tool use
- retain structured evidence and operator-readable feedback over time
What exists today:
- a canonical local harness that runs Codex in a fresh workspace and writes normalized run artifacts under
runs/<run-id>/ - structured validation, dashboard record generation, local feed aggregation, and alert evaluation
- Filecoin Cloud artifact bundle publishing plus per-artifact evidence publishing with a Filecoin-hosted artifact index
- a Token Host Builder benchmark registry schema deployed on Filecoin Calibration
- local dashboard views that combine feed-backed operator metrics with chain-backed collection pages
- one preserved historical manual run plus newer validated benchmark runs under ignored
runs/
Historical evidence lives in historical-runs/2026-03-04-initial-manual.
The active repo direction is documented here:
- docs/roadmap.md
- docs/work-log.md
- docs/benchmark-design.md
- schemas/benchmark-run-result.schema.json
- dashboard/schema.json
Local execution now goes through bin/run-benchmark.sh, which creates a fresh temporary workspace, launches Codex, captures outer timing, and writes run artifacts under ignored runs/<run-id>/ directories.
Each run now produces:
run-summary.json: harness-normalized result recordvalidation-result.json: validator output derived from structured evidencedashboard-records.json: dashboard-ready records derived from the validated runartifact-publish-result.json: Filecoin artifact publication result for the bundle, artifact index, and individually published evidence filesworkspace-output.tgz: captured workspace bundle for later publishingdocs-snapshot.htmlanddocs-snapshot.sha256when the guide can be fetched at run start
Prompt versions now live under prompts/, and the harness selects a version from the benchmark MODE.
Current benchmark modes:
fresh-follow-docs: agent must generate and fund a fresh walletinherited-key-follow-docs: agent inheritsPRIVATE_KEYfrom the harness environment and uses that wallet for the guide flowscripted-regression: deterministic harness sanity mode
Current environment assumptions:
- artifact publishing backend target will be the Filecoin storage service under test
- dashboard development can run locally first against Filecoin Calibration
- scheduling can remain a locally invoked script until the harness output stabilizes
Local operations now support:
npm run benchmark:cycle: run benchmark, publish the bundle plus artifact index/evidence files to Filecoin, rebuild dashboard feed, then evaluate alertsnpm run benchmark:publish: publish the latest run bundle to Filecoin Cloud using the dev walletnpm run benchmark:alerts: evaluate local alert thresholds without running a new benchmarkPUBLISH_DASHBOARD_APP=1 npm run benchmark:cycle: also publish the finalized run record into the deployed Token Host registry
The benchmark dashboard is defined as a Token Host Builder app schema in dashboard/schema.json.
Current dashboard behavior:
- the homepage and
/run?id=<runId>use the local validated benchmark feed for operator metrics and artifact inspection - the generated collection routes such as
/BenchmarkRunand/BenchmarkIncidentread live records from the deployed Calibration app - the dashboard feed is scoped to the current deployed registry by default so wiped/redeployed registries do not mix with stale local history
- artifact buttons in the custom run view point at Filecoin-hosted evidence URLs, not localhost routes
Useful commands:
npm run dashboard:dev: start the local Next dashboard with the current manifest, ABI, and feed copied intopublic/npm run dashboard:ui-sync: refresh only the generated dashboard UI from the current schema, compiled ABI, manifest, and UI overrides without redeploying the registry contractnpm run dashboard:up: deploy the dashboard schema to Filecoin Calibration and refresh the generated manifestnpm run dashboard:publish: publish the latest finalized run into the deployedBenchmarkRunregistrynpm run dashboard:republish-history: republish all localruns/*/dashboard-records.jsoninto the current deploymentnpm run dashboard:reset: deploy a fresh registry and republish local benchmark history into it
Latest validated end-to-end inherited-key run:
20260306T000054Z-f0eb33- on-chain
BenchmarkRunrecord#1at deployment0x10ce50e27b9dc9b333345aa4ce6cc5f39a2160ce - artifact index:
https://calib.ezpdpz.net/piece/bafkzcibdrenqqtcqh2dgfjb5bkxo7rfsp62thiakhg6hgd3c54tm5rjghtsff2qj