Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 56 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,19 @@

An AI agent skill for safe, production-grade LLM model migrations. Automate behavior-preserving upgrades across prompts, agents, API callers, and tests. Universal and plug-and-play across Codex, Claude Code, Cursor, and more.

**The Problem:** Most model migrations fail after you change the model ID because search-and-replace breaks tool calling, parser behavior, and orchestration.
**The Solution:** ModelPort treats migration as a behavior-preservation problem: it maps callers, separates required compatibility fixes from optional tuning, and mandates validation before finishing.
**The model ID is one line. The behavior around it is everything else.**
Most migrations break because search-and-replace silently shatters tool calling,
parsers, and orchestration. ModelPort treats a swap as behavior preservation:

```text
# the naive migration ┄ what a find/replace does
$ grep -rl 'opus-4-7' . | xargs sed -i 's/4-7/4-8/g'
x tool calls break x parser drift x silent prod regressions

# the ModelPort migration ┄ behavior-preserving
> migrate this repo to claude-opus-4-8, keep behavior, prove it
✓ callers mapped ✓ contract held ✓ tests + rollback evidence
```

## Now shipping: Claude Opus 4.8 (released 2026-05-28)

Expand Down Expand Up @@ -77,13 +88,51 @@ contracts, no proof, and no way back.
╰───────────────────────────────────┴───────────────────────────────────╯
```

## Benchmark your migration (optional)

Opt in at the start and ModelPort ends with **measured evidence, not vibes**. It
runs the same eval set against three configurations so the raw model delta and
the skill's added value are attributed separately:

- **Baseline** — old model + old prompts (where you started)
- **Naive swap** — new model + old prompts (what a find/replace would get you)
- **ModelPort-enhanced** — new model + skill-enhanced system (where you land)

A raw swap tends to regress exactly the things that don't survive a model change
on their own — output contracts, tool calls, and parse-able format — because
newer models follow instructions more literally and re-tokenize differently.
The enhanced arm is where the skill's fixes earn their keep. Illustrative
leaderboard (your run produces the actual measured numbers):

```text
╭──────────────────────┬────────────┬────────────┬────────────╮
│ metric │ baseline │ naive swap │ ModelPort │
│ │ (old/old) │ (new/old) │ (new/enh.) │
├──────────────────────┼────────────┼────────────┼────────────┤
│ task success │ 82% │ 84% │ 91% │
│ output contract │ 93% │ 89% │ 98% │
│ tool-call accuracy │ 90% │ 83% │ 96% │
│ p95 latency │ 3.9s │ 2.4s │ 2.6s │
│ cost / req │ $0.052 │ $0.047 │ $0.045 │
│ refusal / halluc. │ 2.8% │ 2.1% │ 1.4% │
├──────────────────────┼────────────┼────────────┼────────────┤
│ composite (50/30/20)│ 0.80 │ 0.83 │ 0.91 │
╰──────────────────────┴────────────┴────────────┴────────────╯
```

Numbers above are illustrative, not measured results — latency and cost in
particular move with the target model and are always reported as measured, never
assumed. Methodology, metric definitions, and composite scoring live in
[references/benchmarking.md](references/benchmarking.md).

## Why teams use it

- Replace deprecated model IDs without breaking runtime calls.
- Preserve output contracts, parser behavior, and tool calling.
- Separate required compatibility fixes from optional prompt tuning.
- Choose direct swap, adapter, shadow mode, canary, or phased rollout.
- Produce validation evidence and rollback notes.
- Benchmark pre vs. post (and raw-swap control) to quantify the upgrade.

## Quick start

Expand Down Expand Up @@ -146,6 +195,7 @@ For detailed migration requests, start from
| Tools | Schemas, descriptions, validation rules, guardrails |
| Config | Model registries, routers, feature flags, rollout controls |
| Tests and docs | Fixtures, assertions, examples, migration notes |
| Benchmarks | Optional pre/post stats: quality, latency, cost, contract drift |

## Built for real migrations

Expand Down Expand Up @@ -185,6 +235,7 @@ The skill guides Codex to produce a migration report with:
- Proof evidence
- Risks and follow-up recommendations
- Rollback notes
- Benchmark results (when benchmarking was requested)

See [examples/example-output.md](examples/example-output.md) for a sample report.

Expand All @@ -206,6 +257,7 @@ See [examples/example-output.md](examples/example-output.md) for a sample report
├── templates/
│ └── migration-starter-prompt.md
├── references/
│ ├── benchmarking.md
│ ├── migration-patterns.md
│ ├── migration-playbook.md
│ ├── output-style.md
Expand Down Expand Up @@ -251,7 +303,8 @@ npx markdownlint-cli2 "**/*.md"
- Per-model migration guides for Claude Opus 4.8 / 4.6, Sonnet 4.5, and GPT-5.5
ship today in `references/models/`. Next: Gemini and self-hosted models.
- Add regression fixtures for prompt, tool, and structured-output migrations.
- Add benchmark examples for latency, token usage, and output contract drift.
- Ship a runnable benchmark harness to auto-collect the three-arm leaderboard
(the methodology lands today in `references/benchmarking.md`).
- Add real-world before/after migration case studies.

## Contributing
Expand Down
27 changes: 26 additions & 1 deletion SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: modelport-skill
description: Use when the user asks to migrate, upgrade, or port prompts, agents, tools, or codebases from one LLM model or provider version to another. Also use when auditing or continuing migration work started by another agent. Covers API calls, prompt assets, agent configs, tool definitions, tests, docs, runtime settings, rollout plans, and handoff coordination.
description: Use when the user asks to migrate, upgrade, or port prompts, agents, tools, or codebases from one LLM model or provider version to another. Also use when auditing or continuing migration work started by another agent. Covers API calls, prompt assets, agent configs, tool definitions, tests, docs, runtime settings, rollout plans, and handoff coordination. Can also benchmark pre/post migration systems (old model and prompts vs new model and enhanced system) and report the stats as evidence.
---

# ModelPort Skill
Expand Down Expand Up @@ -85,11 +85,17 @@ Collect before editing:
- Constraints: latency, cost, output format, safety/compliance
- Success criteria: quality, speed, token/cost, compatibility
- Evidence requirements: unit tests, evals, live smoke checks, rollout metrics
- Benchmark report: does the user want pre/post benchmark stats in the final
report (baseline old/old vs new/enhanced, plus the raw-swap control)? Opt-in.
Ask once, up front — capturing the baseline later is impossible.

If source or target matches a model with a dedicated guide in
`references/models/`, read that guide before proceeding. Model guides contain
changelogs, breaking changes, migration steps, and version-specific quirks.

If benchmarking is requested, read `references/benchmarking.md` and capture the
baseline (Arm A) in Phase 0 before any edit — it cannot be reconstructed later.

Scope not explicit → ask once, wait.

## Workflow
Expand All @@ -103,6 +109,8 @@ Define exactly what will be migrated. Narrow down to a targeted section. Establi
- Separate: done vs. needs-validation vs. in-progress vs. not-started.
- Identify work that appears unrelated, incomplete, or owned by another agent.
- Do not edit until scope is confirmed, targeted, and current state is understood.
- Benchmarking opted in → capture **Arm A** (baseline: old model + old prompts)
on the eval set now, before any edit. See `references/benchmarking.md`.

### Phase 1: Discovery scan

Expand All @@ -127,6 +135,8 @@ Read:
adapters, feature flags, or side-by-side support
- `references/validation-proof.md` — when planning proof gates, evals,
smoke tests, or diagnose-patch-retry loops
- `references/benchmarking.md` — when the user opted into a pre/post benchmark
(three-arm methodology, metrics, composite scoring, leaderboard format)

### Phase 2: Classify before editing

Expand Down Expand Up @@ -173,6 +183,10 @@ Apply in order:

Precise edits only. No repo-wide blind search-and-replace.

Benchmarking opted in → after step 1 (runtime edits) and before step 2 (prompt
tuning), capture **Arm B** (naive swap: new model + old prompts). This is the
honest control for the raw model delta.

### Phase 5: Validation

Run:
Expand All @@ -190,6 +204,10 @@ Verify:
- Output shape/format still satisfied
- Observability signals defined for staged rollout when production risk exists

Benchmarking opted in → capture **Arm C** (new model + enhanced system) on the
same eval set, then compile the three-arm leaderboard and attribution (model
delta B−A, skill delta C−B, net C−A) per `references/benchmarking.md`.

Failure → diagnose root cause → patch smallest responsible surface → retry
failed check before claiming completion.

Expand All @@ -206,6 +224,8 @@ Hyper-concise report. These sections, nothing else:
7. Proof Evidence
8. Risks and Follow-up Recommendations
9. Rollback Notes
10. Benchmark Results — only if benchmarking was requested. Three-arm
leaderboard + attribution. Mark sample/illustrative numbers as such.

Format per `examples/example-output.md`.

Expand All @@ -220,6 +240,9 @@ Format per `examples/example-output.md`.
- Skipping validation and claiming complete
- Treating a synthetic test as proof without a real smoke check, eval, trace,
or before/after comparison
- Benchmarking arms on different eval sets, settings, or cache states
- Presenting illustrative/sample benchmark numbers as measured results
- Skipping the Arm A baseline capture, then editing — it cannot be recovered

## References

Expand All @@ -230,6 +253,7 @@ Format per `examples/example-output.md`.
- Prompt design research: `references/prompt-template-research.md`
- Migration patterns: `references/migration-patterns.md`
- Validation proof gates: `references/validation-proof.md`
- Migration benchmarking: `references/benchmarking.md` (optional pre/post stats)
- Starter prompt: `templates/migration-starter-prompt.md`
- Example prompts: `examples/example-prompts.md`
- Example report: `examples/example-output.md`
Expand All @@ -244,4 +268,5 @@ Format per `examples/example-output.md`.
- [ ] Optional tuning clearly marked as optional
- [ ] Tests and smoke checks executed
- [ ] Proof evidence captured or gap stated
- [ ] Benchmark opt-in asked; if yes, three arms captured and reported
- [ ] Report returned in concise format
24 changes: 24 additions & 0 deletions examples/example-output.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,27 @@

- Toggle `USE_NEW_MODEL=false`
- Legacy prompts in `prompts/legacy/`

## 10) Benchmark Results (optional)

Requested: yes. Eval set `n=120`. Numbers below are **illustrative**.

```text
╭──────────────────────┬────────────┬────────────┬────────────╮
│ metric │ baseline │ naive swap │ ModelPort │
│ │ (old/old) │ (new/old) │ (new/enh.) │
├──────────────────────┼────────────┼────────────┼────────────┤
│ task success │ 82% │ 84% │ 91% │
│ output contract │ 93% │ 89% │ 98% │
│ tool-call accuracy │ 90% │ 83% │ 96% │
│ p95 latency │ 3.9s │ 2.4s │ 2.6s │
│ cost / req │ $0.052 │ $0.047 │ $0.045 │
│ refusal / halluc. │ 2.8% │ 2.1% │ 1.4% │
├──────────────────────┼────────────┼────────────┼────────────┤
│ composite (50/30/20)│ 0.80 │ 0.83 │ 0.91 │
╰──────────────────────┴────────────┴────────────┴────────────╯
```

- Model delta (B−A): +0.03 composite — faster, cheaper, but contract/tool regressions
- Skill delta (C−B): +0.08 composite — recovers contract + tool accuracy
- Net (C−A): +0.11 composite
107 changes: 107 additions & 0 deletions references/benchmarking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Migration Benchmarking

Optional. Produces a pre/post evidence report that quantifies a migration —
both the raw model delta and the value the skill's enhancements added.

Read this when the user opts into benchmarking (asked once in Phase 0). If they
decline, skip all benchmark steps and omit the Benchmark Results report section.

## The three arms

A migration changes two things at once: the model *and* the system around it
(prompts, tool specs, params). To attribute gains correctly, measure three
configurations on the **same input set and same metrics**:

| Arm | Config | Isolates |
| :-- | :----- | :------- |
| A — baseline | old model + old prompts/system | starting point |
| B — naive swap | new model + old prompts/system | raw model delta |
| C — ModelPort-enhanced | new model + skill-enhanced system | skill's added value |

Arm B is the honest control: it shows what a search-and-replace upgrade alone
would have produced. C − B is the migration work's contribution.

## When to capture each arm

The phase pipeline produces the arms for free — capture at the boundaries:

- **Arm A** — Phase 0, *before any edit*. Once the repo is touched, the baseline
is gone. If not captured here, it cannot be reconstructed.
- **Arm B** — Phase 4, after runtime/API compatibility edits (step 1) but
*before* prompt/agent tuning (step 2). At this point it is new model + old
prompts.
- **Arm C** — Phase 5, after all tuning is complete.

## Dimensions

Pick the subset that matters for the workload; do not pad with irrelevant
metrics. Default migration-relevant set:

| Dimension | Metric | How |
| :-------- | :----- | :-- |
| Quality / correctness | task success rate, eval pass@k | fixed eval set; LLM-as-judge or unit checks |
| Output contract | schema-valid / parse-success rate | run outputs through the real parser/validator |
| Tool calling | right-tool rate, valid-args rate, invocation rate | compare tool calls vs expected per case |
| Latency | TTFT, end-to-end p50/p95, tokens/sec | ≥20 runs/case; report percentiles, not means |
| Cost | input/output tokens, cost per request | token counts × current per-token price |
| Safety / behavior | refusal rate, hallucination rate | adversarial + factual probe set |
| Robustness | adversarial / injection / long-context pass rate | stress subset of the eval set |

LLM-as-a-judge: use a fixed rubric and a separate judge model; score all arms
with the *same* judge and rubric in the same run to keep scores comparable.

## Composite score

Optional single number for a leaderboard. Normalize each dimension to [0,1]
(higher = better; invert latency/cost/refusal), then weight by business
priority. Default weighting:

```text
composite = 0.50 * quality + 0.30 * cost_efficiency + 0.20 * speed
```

Always print the weights next to the score. Never report a composite without
the per-dimension rows behind it.

## Method (keep comparisons honest)

- Same eval set, same cases, same order across all three arms.
- Fix sampling: identical effort/decoding settings per arm where the API allows;
note any forced differences (e.g. removed `temperature` on Opus 4.7+).
- Disable or warm caches identically; cold-cache latency and warm-cache latency
are different measurements — label which one.
- Latency: multiple runs, report p50/p95, exclude the first (cold) run or label
it.
- State sample size `n`. Small `n` → mark deltas as indicative, not significant.
- Numbers in this repo's examples are **illustrative**, not measured. Never
present sample numbers as real results.

## Presentation

Leaderboard table, arms as columns, dimensions as rows, best cell per row called
out. Example shape (illustrative numbers):

```text
╭──────────────────────┬────────────┬────────────┬────────────╮
│ metric │ baseline │ naive swap │ ModelPort │
│ │ (old/old) │ (new/old) │ (new/enh.) │
├──────────────────────┼────────────┼────────────┼────────────┤
│ task success │ 82% │ 84% │ 91% │
│ output contract │ 93% │ 89% │ 98% │
│ tool-call accuracy │ 90% │ 83% │ 96% │
│ p95 latency │ 3.9s │ 2.4s │ 2.6s │
│ cost / req │ $0.052 │ $0.047 │ $0.045 │
│ refusal / halluc. │ 2.8% │ 2.1% │ 1.4% │
├──────────────────────┼────────────┼────────────┼────────────┤
│ composite (50/30/20)│ 0.80 │ 0.83 │ 0.91 │
╰──────────────────────┴────────────┴────────────┴────────────╯
```

Follow the table with one line of attribution: model delta (B − A), skill delta
(C − B), net (C − A).

## Cross-links

- Proof gates and diagnose-patch-retry: `validation-proof.md`
- Rollout metrics for staged deploys: `migration-patterns.md`
- Per-model cost/tokenizer notes for the cost dimension: `models/`
1 change: 1 addition & 0 deletions scripts/validate_skill.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
"references/prompt-template-research.md",
"references/provider-checklists.md",
"references/validation-proof.md",
"references/benchmarking.md",
"references/output-style.md",
"templates/migration-starter-prompt.md",
"examples/example-prompts.md",
Expand Down
Loading