fix: sync remaining harness + README work into main by forkadarshp · Pull Request #7 · forkadarshp/MPort

forkadarshp · 2026-05-29T06:12:49Z

What

main fell behind because PRs #5 and #6 were merged at older commits — four commits never landed. This single PR brings main fully up to date, self-contained (harness files + the README that links them ship together, so no broken link and no merge-order dependency).

Brings in

iterate.py — prompt-optimization sweep (score trajectory across cumulative prompt revisions), scenario-agnostic
ops_routing scenario — harder multi-tool routing fixture (6 tools, multi-arg, compound/ambiguous cases); tests now cover every scenario
Tuned enhanced prompts + expanded simulator markers (longer, realistic optimization path)
README: benchmarking elevated to a co-headline ("ship... then prove it" + built-in evals)
README: "Run it yourself" harness blurb linking harness/README.md

Verified

validate_skill.py → valid
markdownlint-cli2 "**/*.md" → 0 errors
harness tests → pass (both scenarios)
README → links harness/README.md, target present in this branch

After merge, the stale branches feat/benchmark-harness and docs/behavioral-migration-messaging can be deleted.

iterate.py runs cumulative enhanced-prompt revisions through the benchmark and prints the score trajectory. On the bundled scenario the curve climbs from composite 0.64 to 0.75 over eight steps (task/contract/tool 64% -> 93%, failing cases 6 -> 1), then plateaus — the last two steps add cost without lifting the capped scores, and the final failing case needs a fallback/validator, not more prompting. - expand SimProvider explicitness markers for a granular optimization path - set the scenario's enhanced prompt to the pre-plateau optimum (adds arg extraction): ModelPort arm now 93/93/93, skill delta +0.14, net +0.16 - document the sweep + plateau insight in harness/README

- scenarios/ops_routing.json — multi-tool routing (6 tools, multi-arg, several compound/ambiguous cases). Tuned via iterate.py: ModelPort arm 79/79/79, skill delta +0.20, net +0.22. Plateaus lower than triage (93%) because a few cases need a fallback/validator, not more prompting. - iterate.py now derives its tool/schema/example clauses from the scenario, so the sweep runs on any fixture (not just triage). - tests now cover every scenario in scenarios/ via subTest. - add a "role" field to scenarios for the sweep's prompt base.

Migration + evals are the two pillars; benchmarking was buried as "optional". - Hero now leads with both: "ship... then prove it" + built-in evals framing - Benchmark section retitled "Benchmark the upgrade — built-in evals" - Repo description updated to foreground benchmarking

Add a "Run it yourself" blurb under the benchmarking section pointing to harness/run.py + harness/iterate.py and the two bundled scenarios, linking harness/README.md. Update the repo description to mention the eval harness. Note: the harness lands via the harness PR — merge that first so the link resolves.

forkadarshp added 4 commits May 29, 2026 11:42

forkadarshp merged commit b1334be into main May 29, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: sync remaining harness + README work into main#7

fix: sync remaining harness + README work into main#7
forkadarshp merged 4 commits into
mainfrom
fix/sync-remaining-work

forkadarshp commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

forkadarshp commented May 29, 2026

What

Brings in

Verified

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant