Skip to content

feat(orchestration): resumable + migratable workflow checkpoints (Workflow Phase 4)#54

Merged
ZhiXiao-Lin merged 1 commit into
mainfrom
feat/orchestration-checkpoint
May 29, 2026
Merged

feat(orchestration): resumable + migratable workflow checkpoints (Workflow Phase 4)#54
ZhiXiao-Lin merged 1 commit into
mainfrom
feat/orchestration-checkpoint

Conversation

@ZhiXiao-Lin
Copy link
Copy Markdown
Contributor

Phase 4 of the Workflow integration (builds on #51#53). Makes orchestration resumable and migratable.

What

  • WorkflowCheckpoint { schema_version, workflow_id, steps, checkpoint_ms } + ensure_loadable() (rejects future schema versions — mirrors LoopCheckpoint, the step-boundary analogue one level up).
  • SessionStore gains save/load/delete_workflow_checkpoint (default no-ops). The file store writes crash-atomically (temp + fsync + rename) and rejects future versions on load; the memory store mirrors it.
  • execute_steps_parallel_resumable: loads prior progress, skips completed steps (reusing cached outcomes), runs only the rest, rewrites the checkpoint at each step boundary, merges in spec order, and clears on full success. Records only successful steps — a failed step retries on resume (its effect didn't complete) while a succeeded step's work is never redone.

Because the checkpoint is serializable and the executor is a parameter, a host can resume an interrupted workflow on a different node by passing that node's executor — the migration path, exercised in tests by resuming with a fresh executor.

Verification

  • Checkpoint: round-trip, future-version rejection, pre-v1 default.
  • Resumable: skips-completed + clears-on-success (fresh executor = migration), and retains a successes-only checkpoint on partial failure (failed step retries).
  • Real-LLM #[ignore] resumable workflow passing against .a3s/config.acl.
  • Store suite (36) + orchestration (14) green; cargo fmt --all --check + cargo clippy --lib --bins -D warnings clean.

…se 4)

A workflow journals its completed steps so an interrupted run picks up from
the last completed step — on this node or, since the checkpoint is
serializable and the executor is a parameter, on another one (host-driven
migration). Step-boundary analogue of LoopCheckpoint (which checkpoints at
tool-round boundaries one level down).

- WorkflowCheckpoint { schema_version, workflow_id, steps, checkpoint_ms } +
  ensure_loadable() (rejects future schema versions, mirrors LoopCheckpoint).
- SessionStore gains save/load/delete_workflow_checkpoint (default no-ops);
  file store writes crash-atomically (temp+fsync+rename) and rejects future
  versions on load; memory store mirrors it.
- execute_steps_parallel_resumable: loads prior progress, skips completed
  steps (reusing cached outcomes), runs only the rest, rewrites the checkpoint
  at each step boundary, merges in spec order, and clears on full success.
  Records ONLY successful steps — a failed step retries on resume (its effect
  didn't complete) while a succeeded step's work is never redone.

Tests: checkpoint round-trip + future-version rejection + pre-v1 default;
resumable skips-completed + clears-on-success (with a fresh executor = the
migration path) + retains a successes-only checkpoint on partial failure; a
real-LLM \#[ignore] resumable workflow passing against .a3s/config.acl. Store
suite (36) + orchestration (14) green; fmt + clippy --lib --bins clean.
@ZhiXiao-Lin ZhiXiao-Lin merged commit 53425d1 into main May 29, 2026
1 check passed
@ZhiXiao-Lin ZhiXiao-Lin deleted the feat/orchestration-checkpoint branch May 29, 2026 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants