feat: add migration benchmarking + terminal-forward README by forkadarshp · Pull Request #2 · forkadarshp/MPort

forkadarshp · 2026-05-28T23:50:02Z

What

Adds an optional migration benchmarking capability to the skill and gives the README a terminal-forward refresh.

Stacked on #1. Base is docs/readme-opus-4-8-launch so this PR shows only the benchmarking + README-revamp diff. Merge #1 first; GitHub will retarget this to main automatically.

Benchmarking feature

Opt-in, asked once in Phase 0. Measures the migration with three arms on the same eval set so the model delta and the skill's value are attributed separately:

Arm	Config	Captured	Isolates
A — baseline	old model + old prompts	Phase 0 (pre-edit)	starting point
B — naive swap	new model + old prompts	mid Phase 4	raw model delta
C — enhanced	new model + enhanced system	Phase 5	skill's added value

New references/benchmarking.md — dimensions (quality, output-contract, tool-calling, latency, cost, safety, robustness), composite scoring, honesty rules, leaderboard format.
SKILL.md — opt-in input, per-phase capture points, optional Benchmark Results report section, new anti-patterns, checklist item.
examples/example-output.md — filled three-arm leaderboard (labelled illustrative).
scripts/validate_skill.py — requires the new doc.

README

Terminal-style hero session replacing the bland Problem/Solution prose.
New "Benchmark your migration" section + leaderboard; benchmark threaded into the feature table, expected-output list, repo structure, and roadmap.

Honesty note

All benchmark numbers in the docs are illustrative and labelled as such. The pattern shown (naive swap regresses output-contract/tool-calling, enhanced arm recovers them) is the documented effect of literal instruction following + retokenization; latency and cost are described as measured-not-assumed.

Validation

python3 scripts/validate_skill.py . → valid
markdownlint-cli2 "**/*.md" → 0 errors (25 files)
All three leaderboard tables verified byte-identical and border-aligned; ASCII uses width-safe glyphs only.

Add an optional, opt-in benchmarking capability that quantifies a migration with measured evidence instead of assertions. Skill changes: - Ask once in Phase 0 whether the user wants pre/post benchmark stats in the evidence report; capturing the baseline later is impossible, so it is gated up front. - Three-arm methodology wired into the existing phase boundaries: - Arm A (baseline: old model + old prompts) captured in Phase 0, pre-edit - Arm B (naive swap: new model + old prompts) captured mid Phase 4, after runtime edits but before prompt tuning -- the honest raw-model control - Arm C (new model + enhanced system) captured in Phase 5 - New references/benchmarking.md: dimensions (quality, output-contract, tool calling, latency, cost, safety, robustness), composite scoring, honesty rules, and leaderboard format. - Phase 6 report gains an optional Benchmark Results section; anti-patterns and the validation checklist updated; validate_skill.py now requires the new doc. - examples/example-output.md shows a filled three-arm leaderboard (clearly labelled illustrative). README: - Replace the bland Problem/Solution prose with a terminal-style hero session contrasting the naive find/replace swap with the ModelPort migration. - New "Benchmark your migration" section with an illustrative leaderboard and a link to the methodology; benchmark added to the feature table, why-teams list, expected-output list, repo structure, and roadmap. All benchmark numbers in docs are illustrative and labelled as such; latency and cost are described as measured-not-assumed. validate_skill.py passes and markdownlint is clean across all 25 markdown files.

The borderless hero session had unequal prefix padding, so the inline comment markers and the ✓/x columns drifted out of alignment when rendered. Pad to fixed columns: comment markers at col 32, the three status columns at 2 / 26 / 48.

forkadarshp added 2 commits May 29, 2026 05:19

fix: align README hero terminal block

db0b645

The borderless hero session had unequal prefix padding, so the inline comment markers and the ✓/x columns drifted out of alignment when rendered. Pad to fixed columns: comment markers at col 32, the three status columns at 2 / 26 / 48.

forkadarshp merged commit d17770b into docs/readme-opus-4-8-launch May 28, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add migration benchmarking + terminal-forward README#2

feat: add migration benchmarking + terminal-forward README#2
forkadarshp merged 2 commits into
docs/readme-opus-4-8-launchfrom
feat/benchmarking

forkadarshp commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

forkadarshp commented May 28, 2026

What

Benchmarking feature

README

Honesty note

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant