Skip to content

fix: sync remaining harness + README work into main#7

Merged
forkadarshp merged 4 commits into
mainfrom
fix/sync-remaining-work
May 29, 2026
Merged

fix: sync remaining harness + README work into main#7
forkadarshp merged 4 commits into
mainfrom
fix/sync-remaining-work

Conversation

@forkadarshp

Copy link
Copy Markdown
Owner

What

main fell behind because PRs #5 and #6 were merged at older commits — four commits never landed. This single PR brings main fully up to date, self-contained (harness files + the README that links them ship together, so no broken link and no merge-order dependency).

Brings in

  • iterate.py — prompt-optimization sweep (score trajectory across cumulative prompt revisions), scenario-agnostic
  • ops_routing scenario — harder multi-tool routing fixture (6 tools, multi-arg, compound/ambiguous cases); tests now cover every scenario
  • Tuned enhanced prompts + expanded simulator markers (longer, realistic optimization path)
  • README: benchmarking elevated to a co-headline ("ship... then prove it" + built-in evals)
  • README: "Run it yourself" harness blurb linking harness/README.md

Verified

  • validate_skill.py → valid
  • markdownlint-cli2 "**/*.md" → 0 errors
  • harness tests → pass (both scenarios)
  • README → links harness/README.md, target present in this branch

After merge, the stale branches feat/benchmark-harness and docs/behavioral-migration-messaging can be deleted.

iterate.py runs cumulative enhanced-prompt revisions through the benchmark and
prints the score trajectory. On the bundled scenario the curve climbs from
composite 0.64 to 0.75 over eight steps (task/contract/tool 64% -> 93%, failing
cases 6 -> 1), then plateaus — the last two steps add cost without lifting the
capped scores, and the final failing case needs a fallback/validator, not more
prompting.

- expand SimProvider explicitness markers for a granular optimization path
- set the scenario's enhanced prompt to the pre-plateau optimum (adds arg
  extraction): ModelPort arm now 93/93/93, skill delta +0.14, net +0.16
- document the sweep + plateau insight in harness/README
- scenarios/ops_routing.json — multi-tool routing (6 tools, multi-arg, several
  compound/ambiguous cases). Tuned via iterate.py: ModelPort arm 79/79/79,
  skill delta +0.20, net +0.22. Plateaus lower than triage (93%) because a few
  cases need a fallback/validator, not more prompting.
- iterate.py now derives its tool/schema/example clauses from the scenario, so
  the sweep runs on any fixture (not just triage).
- tests now cover every scenario in scenarios/ via subTest.
- add a "role" field to scenarios for the sweep's prompt base.
Migration + evals are the two pillars; benchmarking was buried as "optional".
- Hero now leads with both: "ship... then prove it" + built-in evals framing
- Benchmark section retitled "Benchmark the upgrade — built-in evals"
- Repo description updated to foreground benchmarking
Add a "Run it yourself" blurb under the benchmarking section pointing to
harness/run.py + harness/iterate.py and the two bundled scenarios, linking
harness/README.md. Update the repo description to mention the eval harness.

Note: the harness lands via the harness PR — merge that first so the link
resolves.
@forkadarshp forkadarshp merged commit b1334be into main May 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant