fix: reset checkpoint state on resume for fresh retry budgets by jafreck · Pull Request #229 · jafreck/AAMF

jafreck · 2026-03-29T01:46:27Z

Summary

When resuming a migration that previously failed (e.g. due to terminal exhaustion in Phase 4), the checkpoint retains stale failure markers that cause the resumed run to immediately re-fail or skip tasks instead of giving them fresh retry budgets.

This PR adds a prepareForResume() method that cleans up transient failure state on checkpoint reload, and fixes several related issues.

Changes

Checkpoint resume preparation (`src/core/checkpoint.ts`)

prepareForResume() — new private method called on every checkpoint reload (resume) that:
- Clears terminalExhaustion so the parity-gate loop re-enters instead of immediately raising TerminalExhaustionError
- Resets failedTasks retry counters so code-migrator and parity-failure-resolver get fresh attempts
- Clears blockedTasks so previously-blocked tasks re-enter the scheduling pool
- Filters __phase4FlowCheckpoint.completedExecutionIds to only retain substeps for fully-completed tasks — failed/in-flight tasks re-enter from scratch (fresh code-migrator run, not just parity retry)
- Resets both __flowCheckpoint and __phase4FlowCheckpoint status from 'failed' → 'running' so the Cadre flow runner re-enters correctly
- Clears Phase 4 per-task cursor state for non-completed tasks

`resetFromPhase()` fix

Reset flow checkpoint status from 'failed' to 'running' and clear error field so --from-phase works correctly with the Cadre runner

Copilot token usage fallback (`src/core/agent-launcher.ts`)

Accumulate outputTokens from assistant.message events as a fallback when the usage summary block is missing from Copilot JSONL output

Agent prompt improvements (`agents/templates/_partials/lore-index-first-principle.md`)

Stronger guidance to prefer Lore MCP tools over view/bash for reading source and target code
Explicit tool name table so agents use exact MCP tool names

Test fixture config

maxParallelAgents: 12 → 8
resume: true → false

Testing

Added 5 new tests covering resume preparation scenarios (terminal exhaustion clearing, failed task reset, flow checkpoint status reset, Phase 4 execution ID filtering, no-op for non-failed checkpoints)
Updated existing terminalExhaustion persistence test to verify raw JSON instead of re-loading through CheckpointManager (which now clears it on reload)
Added test for resetFromPhase flow checkpoint status reset

- Add prepareForResume() to clear terminalExhaustion, failed/blocked tasks, and stale Phase 4 flow checkpoint entries on reload - Reset __flowCheckpoint and __phase4FlowCheckpoint status from 'failed' to 'running' so the Cadre runner re-enters correctly - Filter Phase 4 completedExecutionIds to only retain substeps for fully-completed tasks; failed/in-flight tasks re-enter from scratch - Reset flow checkpoint error field in resetFromPhase() - Accumulate outputTokens from assistant.message events in Copilot JSONL parser as fallback when usage summary is missing - Improve Lore MCP tool documentation in agent prompt partial with explicit tool names and stronger guidance to prefer Lore over view - Tune zstd fixture config: maxParallelAgents 12→8, resume false

jafreck merged commit 1eed4ea into main Mar 29, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reset checkpoint state on resume for fresh retry budgets#229

fix: reset checkpoint state on resume for fresh retry budgets#229
jafreck merged 1 commit intomainfrom
fix/checkpoint-resume-reset

jafreck commented Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jafreck commented Mar 29, 2026

Summary

Changes

Checkpoint resume preparation (src/core/checkpoint.ts)

resetFromPhase() fix

Copilot token usage fallback (src/core/agent-launcher.ts)

Agent prompt improvements (agents/templates/_partials/lore-index-first-principle.md)

Test fixture config

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Checkpoint resume preparation (`src/core/checkpoint.ts`)

`resetFromPhase()` fix

Copilot token usage fallback (`src/core/agent-launcher.ts`)

Agent prompt improvements (`agents/templates/_partials/lore-index-first-principle.md`)