Skip to content

fix: reset checkpoint state on resume for fresh retry budgets#229

Merged
jafreck merged 1 commit intomainfrom
fix/checkpoint-resume-reset
Mar 29, 2026
Merged

fix: reset checkpoint state on resume for fresh retry budgets#229
jafreck merged 1 commit intomainfrom
fix/checkpoint-resume-reset

Conversation

@jafreck
Copy link
Copy Markdown
Owner

@jafreck jafreck commented Mar 29, 2026

Summary

When resuming a migration that previously failed (e.g. due to terminal exhaustion in Phase 4), the checkpoint retains stale failure markers that cause the resumed run to immediately re-fail or skip tasks instead of giving them fresh retry budgets.

This PR adds a prepareForResume() method that cleans up transient failure state on checkpoint reload, and fixes several related issues.

Changes

Checkpoint resume preparation (src/core/checkpoint.ts)

  • prepareForResume() — new private method called on every checkpoint reload (resume) that:
    • Clears terminalExhaustion so the parity-gate loop re-enters instead of immediately raising TerminalExhaustionError
    • Resets failedTasks retry counters so code-migrator and parity-failure-resolver get fresh attempts
    • Clears blockedTasks so previously-blocked tasks re-enter the scheduling pool
    • Filters __phase4FlowCheckpoint.completedExecutionIds to only retain substeps for fully-completed tasks — failed/in-flight tasks re-enter from scratch (fresh code-migrator run, not just parity retry)
    • Resets both __flowCheckpoint and __phase4FlowCheckpoint status from 'failed''running' so the Cadre flow runner re-enters correctly
    • Clears Phase 4 per-task cursor state for non-completed tasks

resetFromPhase() fix

  • Reset flow checkpoint status from 'failed' to 'running' and clear error field so --from-phase works correctly with the Cadre runner

Copilot token usage fallback (src/core/agent-launcher.ts)

  • Accumulate outputTokens from assistant.message events as a fallback when the usage summary block is missing from Copilot JSONL output

Agent prompt improvements (agents/templates/_partials/lore-index-first-principle.md)

  • Stronger guidance to prefer Lore MCP tools over view/bash for reading source and target code
  • Explicit tool name table so agents use exact MCP tool names

Test fixture config

  • maxParallelAgents: 12 → 8
  • resume: true → false

Testing

  • Added 5 new tests covering resume preparation scenarios (terminal exhaustion clearing, failed task reset, flow checkpoint status reset, Phase 4 execution ID filtering, no-op for non-failed checkpoints)
  • Updated existing terminalExhaustion persistence test to verify raw JSON instead of re-loading through CheckpointManager (which now clears it on reload)
  • Added test for resetFromPhase flow checkpoint status reset

- Add prepareForResume() to clear terminalExhaustion, failed/blocked
  tasks, and stale Phase 4 flow checkpoint entries on reload
- Reset __flowCheckpoint and __phase4FlowCheckpoint status from
  'failed' to 'running' so the Cadre runner re-enters correctly
- Filter Phase 4 completedExecutionIds to only retain substeps for
  fully-completed tasks; failed/in-flight tasks re-enter from scratch
- Reset flow checkpoint error field in resetFromPhase()
- Accumulate outputTokens from assistant.message events in Copilot
  JSONL parser as fallback when usage summary is missing
- Improve Lore MCP tool documentation in agent prompt partial with
  explicit tool names and stronger guidance to prefer Lore over view
- Tune zstd fixture config: maxParallelAgents 12→8, resume false
@jafreck jafreck merged commit 1eed4ea into main Mar 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant