From 451a7e419d595f10dff578c67c603d31cd1c4124 Mon Sep 17 00:00:00 2001
From: forkadarshp <foradarshpandita@gmail.com>
Date: Fri, 29 May 2026 05:19:22 +0530
Subject: [PATCH] feat: add migration benchmarking + terminal-forward README

Add an optional, opt-in benchmarking capability that quantifies a migration
with measured evidence instead of assertions.

Skill changes:
- Ask once in Phase 0 whether the user wants pre/post benchmark stats in the
  evidence report; capturing the baseline later is impossible, so it is gated
  up front.
- Three-arm methodology wired into the existing phase boundaries:
  - Arm A (baseline: old model + old prompts) captured in Phase 0, pre-edit
  - Arm B (naive swap: new model + old prompts) captured mid Phase 4, after
    runtime edits but before prompt tuning -- the honest raw-model control
  - Arm C (new model + enhanced system) captured in Phase 5
- New references/benchmarking.md: dimensions (quality, output-contract, tool
  calling, latency, cost, safety, robustness), composite scoring, honesty
  rules, and leaderboard format.
- Phase 6 report gains an optional Benchmark Results section; anti-patterns and
  the validation checklist updated; validate_skill.py now requires the new doc.
- examples/example-output.md shows a filled three-arm leaderboard (clearly
  labelled illustrative).

README:
- Replace the bland Problem/Solution prose with a terminal-style hero session
  contrasting the naive find/replace swap with the ModelPort migration.
- New "Benchmark your migration" section with an illustrative leaderboard and a
  link to the methodology; benchmark added to the feature table, why-teams
  list, expected-output list, repo structure, and roadmap.

All benchmark numbers in docs are illustrative and labelled as such; latency and
cost are described as measured-not-assumed. validate_skill.py passes and
markdownlint is clean across all 25 markdown files.
---
 README.md                  |  59 ++++++++++++++++++--
 SKILL.md                   |  27 +++++++++-
 examples/example-output.md |  24 +++++++++
 references/benchmarking.md | 107 +++++++++++++++++++++++++++++++++++++
 scripts/validate_skill.py  |   1 +
 5 files changed, 214 insertions(+), 4 deletions(-)
 create mode 100644 references/benchmarking.md

diff --git a/README.md b/README.md
index f250dd6..ca3aedc 100644
--- a/README.md
+++ b/README.md
@@ -12,8 +12,19 @@
 
 An AI agent skill for safe, production-grade LLM model migrations. Automate behavior-preserving upgrades across prompts, agents, API callers, and tests. Universal and plug-and-play across Codex, Claude Code, Cursor, and more.
 
-**The Problem:** Most model migrations fail after you change the model ID because search-and-replace breaks tool calling, parser behavior, and orchestration.  
-**The Solution:** ModelPort treats migration as a behavior-preservation problem: it maps callers, separates required compatibility fixes from optional tuning, and mandates validation before finishing.
+**The model ID is one line. The behavior around it is everything else.**
+Most migrations break because search-and-replace silently shatters tool calling,
+parsers, and orchestration. ModelPort treats a swap as behavior preservation:
+
+```text
+  # the naive migration                       ┄ what a find/replace does
+  $ grep -rl 'opus-4-7' . | xargs sed -i 's/4-7/4-8/g'
+    x tool calls break     x parser drift     x silent prod regressions
+
+  # the ModelPort migration                   ┄ behavior-preserving
+  > migrate this repo to claude-opus-4-8, keep behavior, prove it
+    ✓ callers mapped       ✓ contract held    ✓ tests + rollback evidence
+```
 
 ## Now shipping: Claude Opus 4.8 (released 2026-05-28)
 
@@ -77,6 +88,43 @@ contracts, no proof, and no way back.
 ╰───────────────────────────────────┴───────────────────────────────────╯
 ```
 
+## Benchmark your migration (optional)
+
+Opt in at the start and ModelPort ends with **measured evidence, not vibes**. It
+runs the same eval set against three configurations so the raw model delta and
+the skill's added value are attributed separately:
+
+- **Baseline** — old model + old prompts (where you started)
+- **Naive swap** — new model + old prompts (what a find/replace would get you)
+- **ModelPort-enhanced** — new model + skill-enhanced system (where you land)
+
+A raw swap tends to regress exactly the things that don't survive a model change
+on their own — output contracts, tool calls, and parse-able format — because
+newer models follow instructions more literally and re-tokenize differently.
+The enhanced arm is where the skill's fixes earn their keep. Illustrative
+leaderboard (your run produces the actual measured numbers):
+
+```text
+╭──────────────────────┬────────────┬────────────┬────────────╮
+│ metric               │ baseline   │ naive swap │ ModelPort  │
+│                      │ (old/old)  │ (new/old)  │ (new/enh.) │
+├──────────────────────┼────────────┼────────────┼────────────┤
+│  task success        │        82% │        84% │        91% │
+│  output contract     │        93% │        89% │        98% │
+│  tool-call accuracy  │        90% │        83% │        96% │
+│  p95 latency         │       3.9s │       2.4s │       2.6s │
+│  cost / req          │     $0.052 │     $0.047 │     $0.045 │
+│  refusal / halluc.   │       2.8% │       2.1% │       1.4% │
+├──────────────────────┼────────────┼────────────┼────────────┤
+│  composite (50/30/20)│       0.80 │       0.83 │       0.91 │
+╰──────────────────────┴────────────┴────────────┴────────────╯
+```
+
+Numbers above are illustrative, not measured results — latency and cost in
+particular move with the target model and are always reported as measured, never
+assumed. Methodology, metric definitions, and composite scoring live in
+[references/benchmarking.md](references/benchmarking.md).
+
 ## Why teams use it
 
 - Replace deprecated model IDs without breaking runtime calls.
@@ -84,6 +132,7 @@ contracts, no proof, and no way back.
 - Separate required compatibility fixes from optional prompt tuning.
 - Choose direct swap, adapter, shadow mode, canary, or phased rollout.
 - Produce validation evidence and rollback notes.
+- Benchmark pre vs. post (and raw-swap control) to quantify the upgrade.
 
 ## Quick start
 
@@ -146,6 +195,7 @@ For detailed migration requests, start from
 | Tools | Schemas, descriptions, validation rules, guardrails |
 | Config | Model registries, routers, feature flags, rollout controls |
 | Tests and docs | Fixtures, assertions, examples, migration notes |
+| Benchmarks | Optional pre/post stats: quality, latency, cost, contract drift |
 
 ## Built for real migrations
 
@@ -185,6 +235,7 @@ The skill guides Codex to produce a migration report with:
 - Proof evidence
 - Risks and follow-up recommendations
 - Rollback notes
+- Benchmark results (when benchmarking was requested)
 
 See [examples/example-output.md](examples/example-output.md) for a sample report.
 
@@ -206,6 +257,7 @@ See [examples/example-output.md](examples/example-output.md) for a sample report
 ├── templates/
 │   └── migration-starter-prompt.md
 ├── references/
+│   ├── benchmarking.md
 │   ├── migration-patterns.md
 │   ├── migration-playbook.md
 │   ├── output-style.md
@@ -251,7 +303,8 @@ npx markdownlint-cli2 "**/*.md"
 - Per-model migration guides for Claude Opus 4.8 / 4.6, Sonnet 4.5, and GPT-5.5
   ship today in `references/models/`. Next: Gemini and self-hosted models.
 - Add regression fixtures for prompt, tool, and structured-output migrations.
-- Add benchmark examples for latency, token usage, and output contract drift.
+- Ship a runnable benchmark harness to auto-collect the three-arm leaderboard
+  (the methodology lands today in `references/benchmarking.md`).
 - Add real-world before/after migration case studies.
 
 ## Contributing
diff --git a/SKILL.md b/SKILL.md
index c12ca8d..6f0ada4 100644
--- a/SKILL.md
+++ b/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: modelport-skill
-description: Use when the user asks to migrate, upgrade, or port prompts, agents, tools, or codebases from one LLM model or provider version to another. Also use when auditing or continuing migration work started by another agent. Covers API calls, prompt assets, agent configs, tool definitions, tests, docs, runtime settings, rollout plans, and handoff coordination.
+description: Use when the user asks to migrate, upgrade, or port prompts, agents, tools, or codebases from one LLM model or provider version to another. Also use when auditing or continuing migration work started by another agent. Covers API calls, prompt assets, agent configs, tool definitions, tests, docs, runtime settings, rollout plans, and handoff coordination. Can also benchmark pre/post migration systems (old model and prompts vs new model and enhanced system) and report the stats as evidence.
 ---
 
 # ModelPort Skill
@@ -85,11 +85,17 @@ Collect before editing:
 - Constraints: latency, cost, output format, safety/compliance
 - Success criteria: quality, speed, token/cost, compatibility
 - Evidence requirements: unit tests, evals, live smoke checks, rollout metrics
+- Benchmark report: does the user want pre/post benchmark stats in the final
+  report (baseline old/old vs new/enhanced, plus the raw-swap control)? Opt-in.
+  Ask once, up front — capturing the baseline later is impossible.
 
 If source or target matches a model with a dedicated guide in
 `references/models/`, read that guide before proceeding. Model guides contain
 changelogs, breaking changes, migration steps, and version-specific quirks.
 
+If benchmarking is requested, read `references/benchmarking.md` and capture the
+baseline (Arm A) in Phase 0 before any edit — it cannot be reconstructed later.
+
 Scope not explicit → ask once, wait.
 
 ## Workflow
@@ -103,6 +109,8 @@ Define exactly what will be migrated. Narrow down to a targeted section. Establi
 - Separate: done vs. needs-validation vs. in-progress vs. not-started.
 - Identify work that appears unrelated, incomplete, or owned by another agent.
 - Do not edit until scope is confirmed, targeted, and current state is understood.
+- Benchmarking opted in → capture **Arm A** (baseline: old model + old prompts)
+  on the eval set now, before any edit. See `references/benchmarking.md`.
 
 ### Phase 1: Discovery scan
 
@@ -127,6 +135,8 @@ Read:
   adapters, feature flags, or side-by-side support
 - `references/validation-proof.md` — when planning proof gates, evals,
   smoke tests, or diagnose-patch-retry loops
+- `references/benchmarking.md` — when the user opted into a pre/post benchmark
+  (three-arm methodology, metrics, composite scoring, leaderboard format)
 
 ### Phase 2: Classify before editing
 
@@ -173,6 +183,10 @@ Apply in order:
 
 Precise edits only. No repo-wide blind search-and-replace.
 
+Benchmarking opted in → after step 1 (runtime edits) and before step 2 (prompt
+tuning), capture **Arm B** (naive swap: new model + old prompts). This is the
+honest control for the raw model delta.
+
 ### Phase 5: Validation
 
 Run:
@@ -190,6 +204,10 @@ Verify:
 - Output shape/format still satisfied
 - Observability signals defined for staged rollout when production risk exists
 
+Benchmarking opted in → capture **Arm C** (new model + enhanced system) on the
+same eval set, then compile the three-arm leaderboard and attribution (model
+delta B−A, skill delta C−B, net C−A) per `references/benchmarking.md`.
+
 Failure → diagnose root cause → patch smallest responsible surface → retry
 failed check before claiming completion.
 
@@ -206,6 +224,8 @@ Hyper-concise report. These sections, nothing else:
 7. Proof Evidence
 8. Risks and Follow-up Recommendations
 9. Rollback Notes
+10. Benchmark Results — only if benchmarking was requested. Three-arm
+    leaderboard + attribution. Mark sample/illustrative numbers as such.
 
 Format per `examples/example-output.md`.
 
@@ -220,6 +240,9 @@ Format per `examples/example-output.md`.
 - Skipping validation and claiming complete
 - Treating a synthetic test as proof without a real smoke check, eval, trace,
   or before/after comparison
+- Benchmarking arms on different eval sets, settings, or cache states
+- Presenting illustrative/sample benchmark numbers as measured results
+- Skipping the Arm A baseline capture, then editing — it cannot be recovered
 
 ## References
 
@@ -230,6 +253,7 @@ Format per `examples/example-output.md`.
 - Prompt design research: `references/prompt-template-research.md`
 - Migration patterns: `references/migration-patterns.md`
 - Validation proof gates: `references/validation-proof.md`
+- Migration benchmarking: `references/benchmarking.md` (optional pre/post stats)
 - Starter prompt: `templates/migration-starter-prompt.md`
 - Example prompts: `examples/example-prompts.md`
 - Example report: `examples/example-output.md`
@@ -244,4 +268,5 @@ Format per `examples/example-output.md`.
 - [ ] Optional tuning clearly marked as optional
 - [ ] Tests and smoke checks executed
 - [ ] Proof evidence captured or gap stated
+- [ ] Benchmark opt-in asked; if yes, three arms captured and reported
 - [ ] Report returned in concise format
diff --git a/examples/example-output.md b/examples/example-output.md
index fdea783..e71dc22 100644
--- a/examples/example-output.md
+++ b/examples/example-output.md
@@ -50,3 +50,27 @@
 
 - Toggle `USE_NEW_MODEL=false`
 - Legacy prompts in `prompts/legacy/`
+
+## 10) Benchmark Results (optional)
+
+Requested: yes. Eval set `n=120`. Numbers below are **illustrative**.
+
+```text
+╭──────────────────────┬────────────┬────────────┬────────────╮
+│ metric               │ baseline   │ naive swap │ ModelPort  │
+│                      │ (old/old)  │ (new/old)  │ (new/enh.) │
+├──────────────────────┼────────────┼────────────┼────────────┤
+│  task success        │        82% │        84% │        91% │
+│  output contract     │        93% │        89% │        98% │
+│  tool-call accuracy  │        90% │        83% │        96% │
+│  p95 latency         │       3.9s │       2.4s │       2.6s │
+│  cost / req          │     $0.052 │     $0.047 │     $0.045 │
+│  refusal / halluc.   │       2.8% │       2.1% │       1.4% │
+├──────────────────────┼────────────┼────────────┼────────────┤
+│  composite (50/30/20)│       0.80 │       0.83 │       0.91 │
+╰──────────────────────┴────────────┴────────────┴────────────╯
+```
+
+- Model delta (B−A): +0.03 composite — faster, cheaper, but contract/tool regressions
+- Skill delta (C−B): +0.08 composite — recovers contract + tool accuracy
+- Net (C−A): +0.11 composite
diff --git a/references/benchmarking.md b/references/benchmarking.md
new file mode 100644
index 0000000..f17b003
--- /dev/null
+++ b/references/benchmarking.md
@@ -0,0 +1,107 @@
+# Migration Benchmarking
+
+Optional. Produces a pre/post evidence report that quantifies a migration —
+both the raw model delta and the value the skill's enhancements added.
+
+Read this when the user opts into benchmarking (asked once in Phase 0). If they
+decline, skip all benchmark steps and omit the Benchmark Results report section.
+
+## The three arms
+
+A migration changes two things at once: the model *and* the system around it
+(prompts, tool specs, params). To attribute gains correctly, measure three
+configurations on the **same input set and same metrics**:
+
+| Arm | Config | Isolates |
+| :-- | :----- | :------- |
+| A — baseline | old model + old prompts/system | starting point |
+| B — naive swap | new model + old prompts/system | raw model delta |
+| C — ModelPort-enhanced | new model + skill-enhanced system | skill's added value |
+
+Arm B is the honest control: it shows what a search-and-replace upgrade alone
+would have produced. C − B is the migration work's contribution.
+
+## When to capture each arm
+
+The phase pipeline produces the arms for free — capture at the boundaries:
+
+- **Arm A** — Phase 0, *before any edit*. Once the repo is touched, the baseline
+  is gone. If not captured here, it cannot be reconstructed.
+- **Arm B** — Phase 4, after runtime/API compatibility edits (step 1) but
+  *before* prompt/agent tuning (step 2). At this point it is new model + old
+  prompts.
+- **Arm C** — Phase 5, after all tuning is complete.
+
+## Dimensions
+
+Pick the subset that matters for the workload; do not pad with irrelevant
+metrics. Default migration-relevant set:
+
+| Dimension | Metric | How |
+| :-------- | :----- | :-- |
+| Quality / correctness | task success rate, eval pass@k | fixed eval set; LLM-as-judge or unit checks |
+| Output contract | schema-valid / parse-success rate | run outputs through the real parser/validator |
+| Tool calling | right-tool rate, valid-args rate, invocation rate | compare tool calls vs expected per case |
+| Latency | TTFT, end-to-end p50/p95, tokens/sec | ≥20 runs/case; report percentiles, not means |
+| Cost | input/output tokens, cost per request | token counts × current per-token price |
+| Safety / behavior | refusal rate, hallucination rate | adversarial + factual probe set |
+| Robustness | adversarial / injection / long-context pass rate | stress subset of the eval set |
+
+LLM-as-a-judge: use a fixed rubric and a separate judge model; score all arms
+with the *same* judge and rubric in the same run to keep scores comparable.
+
+## Composite score
+
+Optional single number for a leaderboard. Normalize each dimension to [0,1]
+(higher = better; invert latency/cost/refusal), then weight by business
+priority. Default weighting:
+
+```text
+composite = 0.50 * quality + 0.30 * cost_efficiency + 0.20 * speed
+```
+
+Always print the weights next to the score. Never report a composite without
+the per-dimension rows behind it.
+
+## Method (keep comparisons honest)
+
+- Same eval set, same cases, same order across all three arms.
+- Fix sampling: identical effort/decoding settings per arm where the API allows;
+  note any forced differences (e.g. removed `temperature` on Opus 4.7+).
+- Disable or warm caches identically; cold-cache latency and warm-cache latency
+  are different measurements — label which one.
+- Latency: multiple runs, report p50/p95, exclude the first (cold) run or label
+  it.
+- State sample size `n`. Small `n` → mark deltas as indicative, not significant.
+- Numbers in this repo's examples are **illustrative**, not measured. Never
+  present sample numbers as real results.
+
+## Presentation
+
+Leaderboard table, arms as columns, dimensions as rows, best cell per row called
+out. Example shape (illustrative numbers):
+
+```text
+╭──────────────────────┬────────────┬────────────┬────────────╮
+│ metric               │ baseline   │ naive swap │ ModelPort  │
+│                      │ (old/old)  │ (new/old)  │ (new/enh.) │
+├──────────────────────┼────────────┼────────────┼────────────┤
+│  task success        │        82% │        84% │        91% │
+│  output contract     │        93% │        89% │        98% │
+│  tool-call accuracy  │        90% │        83% │        96% │
+│  p95 latency         │       3.9s │       2.4s │       2.6s │
+│  cost / req          │     $0.052 │     $0.047 │     $0.045 │
+│  refusal / halluc.   │       2.8% │       2.1% │       1.4% │
+├──────────────────────┼────────────┼────────────┼────────────┤
+│  composite (50/30/20)│       0.80 │       0.83 │       0.91 │
+╰──────────────────────┴────────────┴────────────┴────────────╯
+```
+
+Follow the table with one line of attribution: model delta (B − A), skill delta
+(C − B), net (C − A).
+
+## Cross-links
+
+- Proof gates and diagnose-patch-retry: `validation-proof.md`
+- Rollout metrics for staged deploys: `migration-patterns.md`
+- Per-model cost/tokenizer notes for the cost dimension: `models/`
diff --git a/scripts/validate_skill.py b/scripts/validate_skill.py
index 60a22c2..0bbbc2f 100644
--- a/scripts/validate_skill.py
+++ b/scripts/validate_skill.py
@@ -17,6 +17,7 @@
     "references/prompt-template-research.md",
     "references/provider-checklists.md",
     "references/validation-proof.md",
+    "references/benchmarking.md",
     "references/output-style.md",
     "templates/migration-starter-prompt.md",
     "examples/example-prompts.md",