Ship model upgrades without breaking prod — then prove it.
Two things in one agent skill: behavior-preserving migrations across prompts, agents, tools, API callers, and tests — and built-in evals that score your old vs. new system so every upgrade is backed by numbers, not vibes. Works in Claude Code, Codex, and Cursor.
Install in one line:
npx skills add forkadarshp/MPort— then tell your agent to migrate. See Quickstart.
The model ID is one line. The behavior around it is everything else. Most migrations break because search-and-replace silently shatters tool calling, parsers, and orchestration. ModelPort treats a swap as behavior preservation:
# the naive migration ┄ what a find/replace does
$ grep -rl 'opus-4-7' . | xargs sed -i 's/4-7/4-8/g'
x tool calls break x parser drift x silent prod regressions
# the ModelPort migration ┄ behavior-preserving
> migrate this repo to claude-opus-4-8, keep behavior, prove it
✓ callers mapped ✓ contract held ✓ tests + rollback evidence
Star it if it's saved you a migration headache — it helps other teams find it.
Opus 4.8 is live, and ModelPort already ships a dedicated migration guide for it:
references/models/claude-opus-4-8.md.
It covers the 4.7 → 4.8 drop-in path, the 4.6 → 4.7 breaking changes you
must apply first (non-default sampling params now 400, manual extended
thinking → {type: "adaptive"} + effort, new tokenizer, removed assistant
prefills), and effort re-baselining. Point the skill at your repo:
Use $modelport-skill to migrate this codebase to claude-opus-4-8.
Apply required compatibility fixes first, keep tool + output behavior stable,
and return validation evidence plus rollback notes.
ModelPort treats a model swap as a behavior-preservation problem, not a search-and-replace. The model ID is one line; the behavior around it — tool calls, parsers, output contracts, prompts, orchestration — is everything else. The skill drives a strict, phase-gated pipeline that maps every caller, classifies each hit before editing, keeps required fixes apart from optional tuning, and refuses to claim "done" without proof.
╭──────────────────────────────────────────────────────────────────────╮
│ │
│ [ Opus 4.7 ] ═══> ( ModelPort ) ═══> [ Opus 4.8 ] │
│ drop-in upgrade behavior-preserving GA · 2026-05-28 │
│ │
│ » phase 0 lock scope, read current state ················· [ OK ]│
│ » phase 1 discover callers, prompts, agents, tools ······· [ OK ]│
│ » phase 2 classify every hit before editing ·············· [ OK ]│
│ » phase 3 design: required fixes vs optional tuning ······ [ OK ]│
│ » phase 4 surgical patches, no blind search/replace ······ [ OK ]│
│ » phase 5 validate: tests + live smoke + contract ········ [ OK ]│
│ » phase 6 report: proof evidence + rollback notes ········ [ OK ]│
│ │
╰──────────────────────────────────────────────────────────────────────╯
ModelPort doesn't just rewrite model="...". It updates prompts, tool
descriptions, and guardrails to match how the target model actually behaves —
including the undocumented quirks that never make the changelog.
Take chess: GPT-3.5 played badly, GPT-4o was a leap ahead — not from a rules change, but because 4o's training data held far more chess games. That kind of capability shift never appears in the release notes. ModelPort's research surfaces these advertised and unadvertised behaviors — from observed runs, prior migrations, web research, and provider changelogs — and folds the matching prompt and tooling changes into the migration, so the new model is actually used to its strengths instead of just dropped in.
Skipping the discipline is exactly where migrations look fine in review and then silently regress in production — broken tool calls, drifted output contracts, no proof, and no way back.
╭───────────────────────────────────┬───────────────────────────────────╮
│ WITHOUT ModelPort │ WITH ModelPort │
├───────────────────────────────────┼───────────────────────────────────┤
│ x find/replace the model ID │ ✓ scope-locked discovery │
│ x tool calls break silently │ ✓ callers mapped, not guessed │
│ x parser / output drift │ ✓ output contract preserved │
│ x prompt hacks carry over │ ✓ fixes vs tuning kept apart │
│ x "looks fine" — no proof │ ✓ tests + live smoke evidence │
│ x no way back on breakage │ ✓ rollback notes included │
├───────────────────────────────────┼───────────────────────────────────┤
│ → silent prod regressions │ → behavior-preserving upgrade │
╰───────────────────────────────────┴───────────────────────────────────╯
Migrations shouldn't be a leap of faith. Opt in and ModelPort ends with measured evidence, not vibes — it runs the same eval set against three configurations so the raw model delta and the skill's added value are attributed separately:
- Baseline — old model + old prompts (where you started)
- Naive swap — new model + old prompts (what a find/replace would get you)
- ModelPort-enhanced — new model + skill-enhanced system (where you land)
A raw swap tends to regress exactly the things that don't survive a model change on their own — output contracts, tool calls, and parse-able format — because newer models follow instructions more literally and re-tokenize differently. The enhanced arm is where the skill's fixes earn their keep. Illustrative leaderboard (your run produces the actual measured numbers):
╭──────────────────────┬────────────┬────────────┬────────────╮
│ metric │ baseline │ naive swap │ ModelPort │
│ │ (old/old) │ (new/old) │ (new/enh.) │
├──────────────────────┼────────────┼────────────┼────────────┤
│ task success │ 82% │ 84% │ 91% │
│ output contract │ 93% │ 89% │ 98% │
│ tool-call accuracy │ 90% │ 83% │ 96% │
│ p95 latency │ 3.9s │ 2.4s │ 2.6s │
│ cost / req │ $0.052 │ $0.047 │ $0.045 │
│ refusal / halluc. │ 2.8% │ 2.1% │ 1.4% │
├──────────────────────┼────────────┼────────────┼────────────┤
│ composite (50/30/20)│ 0.80 │ 0.83 │ 0.91 │
╰──────────────────────┴────────────┴────────────┴────────────╯
Numbers above are illustrative, not measured results — latency and cost in particular move with the target model and are always reported as measured, never assumed. Methodology, metric definitions, and composite scoring live in references/benchmarking.md.
Run it yourself. A bundled harness turns this into a closed loop:
python3 harness/run.py scores the three arms on an eval set, and
harness/iterate.py sweeps prompt revisions so you watch the score climb until
it plateaus. Two scenarios ship (support triage + multi-tool routing); it runs
offline by default, or --provider anthropic for measured numbers. See
harness/README.md.
- Replace deprecated model IDs without breaking runtime calls.
- Preserve output contracts, parser behavior, and tool calling.
- Separate required compatibility fixes from optional prompt tuning.
- Choose direct swap, adapter, shadow mode, canary, or phased rollout.
- Produce validation evidence and rollback notes.
- Benchmark pre vs. post (and raw-swap control) to quantify the upgrade.
Install once, use everywhere — the universal skills CLI wires ModelPort into Codex, Claude Code, Cursor, and more:
npx skills add forkadarshp/MPortInstalling for every agent on your machine? Add --agent '*':
npx skills add forkadarshp/MPort --agent '*'If you prefer to fork and modify the skill locally:
Clone:
git clone https://github.com/forkadarshp/MPort.git
cd MPortInstall as a local skill with a symlink (Codex example):
mkdir -p ~/.codex/skills
ln -s "$(pwd)" ~/.codex/skills/modelport-skillOr copy directly:
mkdir -p ~/.codex/skills/modelport-skill
cp -R . ~/.codex/skills/modelport-skillThen ask Codex:
Use $modelport-skill and the starter prompt template to migrate
services/assistant from Model A to Model B. Keep tool behavior stable, separate
required fixes from optional tuning, and include validation evidence plus
rollback notes.
For detailed migration requests, start from templates/migration-starter-prompt.md.
| Area | Migration support |
|---|---|
| API callers | Model IDs, request parameters, response contracts, SDK usage |
| Prompts | System prompts, templates, output contracts, brittle prompt hacks |
| Agents | Tool policy, orchestration behavior, autonomy, handoff instructions |
| Tools | Schemas, descriptions, validation rules, guardrails |
| Config | Model registries, routers, feature flags, rollout controls |
| Tests and docs | Fixtures, assertions, examples, migration notes |
| Benchmarks | Optional pre/post stats: quality, latency, cost, contract drift |
- Scope lock before edits to avoid accidental repo-wide churn.
- Discovery pass across API calls, prompts, agents, tools, tests, and docs.
- File classification before modification.
- Migration-pattern selection across direct swap, adapters, shadow mode, canary, and expand-contract changes.
- Required fixes and optional tuning reported separately.
- Explicit coordination checks for work another agent has already started.
- Validation-first completion criteria with tests, smoke checks, and rollback notes.
- "Migrate all model usage in
src/agents/from our old model config to the new release, keep tool behavior stable, and give me a rollback plan." - "Upgrade this codebase from provider A to provider B and adjust prompts so output format remains identical."
- "Replace deprecated model parameters in our API callers and verify tests still pass."
- "Do a phased migration in
services/searchwith side-by-side support for old and new models."
More examples are in examples/example-prompts.md.
The skill guides Codex to produce a migration report with:
- Migration summary
- Scope and files changed
- Existing work and remaining work
- Required compatibility fixes
- Prompt and agent behavior tuning
- Validation results
- Proof evidence
- Risks and follow-up recommendations
- Rollback notes
- Benchmark results (when benchmarking was requested)
See examples/example-output.md for a sample report.
.
├── .markdownlint-cli2.yaml
├── SKILL.md
├── README.md
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── SECURITY.md
├── LICENSE
├── assets/
│ └── social-preview.svg
├── scripts/
│ └── validate_skill.py
├── templates/
│ └── migration-starter-prompt.md
├── references/
│ ├── benchmarking.md
│ ├── migration-patterns.md
│ ├── migration-playbook.md
│ ├── output-style.md
│ ├── prompt-template-research.md
│ ├── provider-checklists.md
│ ├── validation-proof.md
│ └── models/
│ ├── claude-opus-4-8.md
│ ├── claude-opus-4-6.md
│ ├── claude-sonnet-4-5.md
│ └── gpt-5-5.md
├── examples/
│ ├── example-prompts.md
│ └── example-output.md
├── agents/
│ └── openai.yaml
├── docs/
│ └── PUBLISHING.md
└── .github/
├── ISSUE_TEMPLATE/
├── pull_request_template.md
└── workflows/ci.yml
python3 scripts/validate_skill.py .
npx markdownlint-cli2 "**/*.md"- MIT license included.
- CI validates Markdown and skill structure.
- Issue templates cover bugs, features, and migration case studies.
- PR template asks contributors to classify migration relevance.
- Security reporting process included via GitHub Security Advisories.
- Skill UI metadata included in
agents/openai.yaml.
- Per-model migration guides for Claude Opus 4.8 / 4.6, Sonnet 4.5, and GPT-5.5
ship today in
references/models/. Next: Gemini and self-hosted models. - Add regression fixtures for prompt, tool, and structured-output migrations.
- Ship a runnable benchmark harness to auto-collect the three-arm leaderboard
(the methodology lands today in
references/benchmarking.md). - Add real-world before/after migration case studies.
Install ModelPort, then ask your agent: "migrate this repo to
claude-opus-4-8." It reads the
Opus 4.8 guide, applies the required
compatibility fixes, preserves tool calling and output contracts, and returns
validation evidence plus rollback notes.
4.7 → 4.8 is a drop-in upgrade with no breaking API changes — the main task is
re-baselining any tuned effort settings. ModelPort makes the model-ID change,
flags the settings to retest, and validates that behavior is unchanged. Coming
from 4.6 or earlier, it applies the 4.7 breaking changes first (sampling params,
adaptive thinking, tokenizer, removed prefills).
Often, yes — and that's the problem ModelPort exists to solve. Newer models follow instructions more literally and re-tokenize differently, so a raw find/replace can silently break tool calls and parsers. ModelPort goes beyond the model ID: it updates prompts, tool descriptions, and guardrails to match the target model's actual behavior — researched from observed runs, web research, and provider changelogs, including unadvertised quirks — then validates the contract before finishing.
Opt into benchmarking at the start and ModelPort runs a three-arm comparison — baseline, naive swap, and the enhanced system — on the same eval set, then reports a leaderboard with quality, latency, cost, and contract metrics. See references/benchmarking.md.
Yes. It's a universal agent skill — install it once and invoke it from Claude Code, Codex, Cursor, or any compatible agent.
Any model or provider swap (Anthropic, OpenAI, and others). It ships dedicated migration guides for Claude Opus 4.8 / 4.6, Claude Sonnet 4.5, and GPT-5.5, and a generic playbook for everything else.
See CONTRIBUTING.md.
See SECURITY.md.
MIT - see LICENSE.