fix: sync remaining harness + README work into main#7
Merged
Conversation
iterate.py runs cumulative enhanced-prompt revisions through the benchmark and prints the score trajectory. On the bundled scenario the curve climbs from composite 0.64 to 0.75 over eight steps (task/contract/tool 64% -> 93%, failing cases 6 -> 1), then plateaus — the last two steps add cost without lifting the capped scores, and the final failing case needs a fallback/validator, not more prompting. - expand SimProvider explicitness markers for a granular optimization path - set the scenario's enhanced prompt to the pre-plateau optimum (adds arg extraction): ModelPort arm now 93/93/93, skill delta +0.14, net +0.16 - document the sweep + plateau insight in harness/README
- scenarios/ops_routing.json — multi-tool routing (6 tools, multi-arg, several compound/ambiguous cases). Tuned via iterate.py: ModelPort arm 79/79/79, skill delta +0.20, net +0.22. Plateaus lower than triage (93%) because a few cases need a fallback/validator, not more prompting. - iterate.py now derives its tool/schema/example clauses from the scenario, so the sweep runs on any fixture (not just triage). - tests now cover every scenario in scenarios/ via subTest. - add a "role" field to scenarios for the sweep's prompt base.
Migration + evals are the two pillars; benchmarking was buried as "optional". - Hero now leads with both: "ship... then prove it" + built-in evals framing - Benchmark section retitled "Benchmark the upgrade — built-in evals" - Repo description updated to foreground benchmarking
Add a "Run it yourself" blurb under the benchmarking section pointing to harness/run.py + harness/iterate.py and the two bundled scenarios, linking harness/README.md. Update the repo description to mention the eval harness. Note: the harness lands via the harness PR — merge that first so the link resolves.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
mainfell behind because PRs #5 and #6 were merged at older commits — four commits never landed. This single PR bringsmainfully up to date, self-contained (harness files + the README that links them ship together, so no broken link and no merge-order dependency).Brings in
iterate.py— prompt-optimization sweep (score trajectory across cumulative prompt revisions), scenario-agnosticops_routingscenario — harder multi-tool routing fixture (6 tools, multi-arg, compound/ambiguous cases); tests now cover every scenarioharness/README.mdVerified
validate_skill.py→ validmarkdownlint-cli2 "**/*.md"→ 0 errorsharness/README.md, target present in this branchAfter merge, the stale branches
feat/benchmark-harnessanddocs/behavioral-migration-messagingcan be deleted.