Agent Harness Environment is a flight recorder, eval harness, and policy-comparison surface for coding agents. This starter repo is designed to open directly in Cursor and give the agent enough structure to start building without repeatedly re-explaining the product.
Documentation map: docs/INDEX.md — demo path, verification, eval design, architecture, runner/MCP, backlog.
Static/demo milestone handoff (no git tag created in-repo):
| Step | Link / command |
|---|---|
| What v0.1 includes (and excludes) | docs/RELEASE_NOTES_v0.1.md |
| 5–10 min hosted walkthrough | docs/DEPLOYMENT.md § Post-deploy smoke checklist |
| Risks, backlog, non-claims | docs/FINAL_AUDIT.md |
| Deploy settings + HTML smoke | docs/DEPLOYMENT.md |
| Pre-tag checklist | docs/RELEASE_CHECKLIST.md §8 |
pnpm install && pip install -r requirements-dev.txt
pnpm dev # hosted demo
pnpm eval:ci && python -m pytest && pnpm build # verification
pnpm deploy:check && pnpm smoke:hosted:local # optional; preview must be running- Static hosted demo (Lovable-style numbered narrative): sticky nav, premise → protocol → primitives → cockpit → failure taxonomy → eval comparison → router → implementation evidence → takeaways. 3 task classes (bugfix, adversarial, multi-agent) with precomputed traces; no live LLM in the browser.
- Deterministic Python scorers, static eval suite + CI gate, and synthetic policy-comparison fixture for the hosted table.
- Local runner MVP, MCP tools, and Braintrust/Weave export adapters (dry-run by default; optional live upload).
- Cursor rules, skills, MCP config, and product/UX/eval docs.
For external reviewers or portfolio walkthroughs:
| Goal | Command / link |
|---|---|
| Run hosted demo locally | pnpm install → pnpm dev → docs/DEPLOYMENT.md manual checklist or jump to #cockpit, #evals, #architecture |
| Deploy hosted demo (review) | docs/DEPLOYMENT.md — Vercel recommended; pnpm deploy:check then pnpm build |
| Manual browser checklist | docs/DEPLOYMENT.md § Post-deploy smoke checklist |
| Pre-release verification | docs/RELEASE_CHECKLIST.md |
| Final audit + backlog | docs/FINAL_AUDIT.md |
| Score full static trace suite | pnpm eval:suite (table) or pnpm eval:ci (CI gate) |
| Audit metric source drift | pnpm eval:audit |
| Check local generated artifacts | pnpm repo:status |
| Score one trace locally | pnpm eval (guarded) · pnpm eval:baseline |
| Local runner (optional) | python services/runner/run_task.py guarded_recovery |
| Promote runner trace to candidate | python scripts/promote_run_trace.py runs/<run_id>.json |
| Export shape previews | pnpm export:braintrust:dry-run · pnpm export:weave:dry-run |
| Optional Braintrust upload | pip install -r requirements-braintrust.txt + BRAINTRUST_API_KEY → pnpm export:braintrust:live |
| Optional Weave upload | pip install -r requirements-weave.txt + WANDB_API_KEY → pnpm export:weave:live |
Hosted page: replays static fixtures only — no live LLM, runner, or external APIs in the browser.
Eval table metrics: synthetic portfolio fixture, not production telemetry.
Adapters: dry-run JSON locally by default. Live Braintrust/Weave upload is opt-in (--live + API key + optional requirements files).
pnpm install
python -m venv .venv && source .venv/bin/activate
pip install -r requirements-dev.txt
pnpm devOpen the URL Next.js prints. Default cockpit: bugfix task, baseline policy (rejected).
Quick local eval smoke:
pnpm validate:fixtures
pnpm eval:ci
pnpm compareOpen this folder in Cursor and start with START_HERE_CURSOR.md and docs/INDEX.md. The repository includes .cursor/rules/*.mdc and .cursor/skills/*/SKILL.md for repeatable agent behavior.
Suggested first Cursor prompt:
Read docs/INDEX.md, docs/FINAL_AUDIT.md (backlog §12), and .cursor/rules/architecture.mdc.
Pick one backlog item and implement the smallest vertical slice with tests. Keep the hosted demo static and CI deterministic.
pnpm dev # Hosted static demo (Next.js)
pnpm validate:fixtures # Trace fixture validation
pnpm eval:ci # Full suite + gates (matches CI)
pnpm eval:suite # Same scoring + human-readable table
pnpm eval # Score one trace (guarded date-parser)
pnpm eval:baseline # Score one trace (baseline date-parser)
pnpm compare # Synthetic policy comparison table
pnpm eval:audit # Metric drift report (hosted vs trace scorers)
pnpm repo:status # Local generated-artifact hygiene (read-only)
pnpm router:decision -- bugfix_date_parser_001
pnpm router:train # Train local contextual-bandit state under runs/router/
pnpm router:export-fixture # Export learned decisions to data/router_decisions.json
pnpm deploy:check # Hosted demo deployment readiness (local)
pnpm preview # Serve production build locally (after pnpm build)
pnpm smoke:hosted:local # HTML smoke vs http://localhost:3000 (needs preview/dev)
pnpm smoke:hosted -- --url URL # HTML smoke vs local or deployed demo URL
pnpm export:braintrust:dry-run
pnpm export:braintrust:live # optional; requires braintrust + BRAINTRUST_API_KEY
pnpm export:weave:dry-run
pnpm export:weave:live # optional; requires weave + WANDB_API_KEYNot part of pnpm eval:ci or GitHub Actions. Install the optional SDK, set an API key, then pass --live explicitly:
pip install -r requirements-braintrust.txt
export BRAINTRUST_API_KEY=your_key
export BRAINTRUST_PROJECT=agent-harness-environment # optional
pnpm export:braintrust:liveDry-run (pnpm export:braintrust:dry-run) prints the same compact JSON as before — no SDK import, no network. Live mode uploads static task datasets, trace fixture examples, and the suite summary experiment from local fixtures only; it does not run the runner or claim production eval coverage.
Not part of pnpm eval:ci or GitHub Actions. Requires the weave package (install via requirements-weave.txt; wandb alone is not enough):
pip install -r requirements-weave.txt
export WANDB_API_KEY=your_key
export WANDB_PROJECT=agent-harness-environment # optional
export WANDB_ENTITY=your-team # optional
pnpm export:weave:liveDry-run (pnpm export:weave:dry-run) is unchanged. Live mode uploads static trace spans and suite scorer feedback only.
Batch every supported task/policy pair (6 runs: 3 tasks × baseline + guarded_recovery), score each trace, and print a compact JSON summary:
pnpm runner:batch
pnpm runner:batch:promote # also promote to data/trace_candidates/ (idempotent)Not part of CI. Use after runner or promotion logic changes to smoke all deterministic paths locally.
After services/runner/ writes a trace under runs/, promote it to a reviewable candidate without touching curated data/traces/:
python services/runner/run_task.py guarded_recovery multi_agent_contract_001
python scripts/promote_run_trace.py runs/run_local_guarded_recovery_multi_agent_contract_001.json
python packages/evals/run_eval.py data/trace_candidates/run_local_guarded_recovery_multi_agent_contract_001.jsonPromotion validates the trace, scores it, normalizes transient fields (timestamps, sandbox paths, long terminal output), writes data/trace_candidates/<run_id>.json, and appends metadata to data/datasets/generated_candidates.jsonl (idempotent by source_run_id). Both directories are gitignored by default. To copy into curated fixtures, pass --write-fixture --fixture-name <name>.json explicitly (refuses overwrite).
Run from the repo root after pnpm install and pip install -r requirements-dev.txt.
CI command contract — GitHub Actions (.github/workflows/ci.yml) runs the same deterministic gates on every push and pull request, in this order:
pnpm validate:fixtures
pnpm eval:ci
pnpm eval:baseline
pnpm eval
pnpm compare
python -m pytest
pnpm typecheck
pnpm buildNo secrets, live LLM calls, or external eval services are required. pnpm eval:ci scores the full static trace suite and enforces fixture gate expectations (see docs/EVAL_DESIGN.md).
Local before PR — run the CI contract above, then optionally:
pnpm eval:suite # full suite table + JSON summary (same scoring as eval:ci)pnpm test is an alias for python -m pytest. pnpm typecheck does not require a prior build. Generated TypeScript metadata (*.tsbuildinfo) is gitignored.
| Surface | Data source | Use |
|---|---|---|
| Hosted cockpit / eval table / router | data/traces/, data/evals/ JSON |
Demo replay; no network |
pnpm eval / pnpm eval:ci |
Same fixtures | Real deterministic scorer output |
Hosted eval table (#evals) |
data/evals/policy_comparison.json |
Synthetic portfolio fixture — not eval:ci output |
| Local RL-lite router | runs/router/history.jsonl, runs/router/state.json |
Contextual-bandit policy learning; explicit local command |
| Hosted router fixture | data/router_decisions.json |
Static export from learned router state; no browser training |
python …/audit_metric_drift.py |
Compares sources above | Drift/ambiguity report; does not change fixtures |
services/runner/ |
Toy repos → runs/ |
Local execution; not used by hosted page |
| Adapter dry-runs | Fixtures → export JSON | Shape preview; no network |
| Braintrust live upload | pnpm export:braintrust:live |
Opt-in; static fixtures only; not in CI |
| W&B Weave live upload | pnpm export:weave:live |
Opt-in; static traces only; not in CI |
Details: docs/INDEX.md · docs/EVAL_DESIGN.md · docs/RELEASE_CHECKLIST.md
apps/web Hosted interactive demo shell
packages/harness TypeScript trace + policy primitives
packages/evals Python deterministic and heuristic scorers
packages/reward Contextual-bandit router and reward formula
services/runner Local runner MVP (sandbox + trace emission)
services/trace-store SQLite trace-store starter
tools/mcp_server.py Cursor MCP tool surface
.cursor Cursor rules, skills, MCP config, review rules
data Static hosted-demo fixtures
toy_repos Minimal repos for local eval tasks
docs Product, UX, eval, and demo docs
The hosted page should use static fixtures by default. Live agent execution belongs behind the local runner and should not be required for the portfolio/demo surface.