feat(orchestration): resumable + migratable workflow checkpoints (Workflow Phase 4)#54
Merged
Merged
Conversation
…se 4)
A workflow journals its completed steps so an interrupted run picks up from
the last completed step — on this node or, since the checkpoint is
serializable and the executor is a parameter, on another one (host-driven
migration). Step-boundary analogue of LoopCheckpoint (which checkpoints at
tool-round boundaries one level down).
- WorkflowCheckpoint { schema_version, workflow_id, steps, checkpoint_ms } +
ensure_loadable() (rejects future schema versions, mirrors LoopCheckpoint).
- SessionStore gains save/load/delete_workflow_checkpoint (default no-ops);
file store writes crash-atomically (temp+fsync+rename) and rejects future
versions on load; memory store mirrors it.
- execute_steps_parallel_resumable: loads prior progress, skips completed
steps (reusing cached outcomes), runs only the rest, rewrites the checkpoint
at each step boundary, merges in spec order, and clears on full success.
Records ONLY successful steps — a failed step retries on resume (its effect
didn't complete) while a succeeded step's work is never redone.
Tests: checkpoint round-trip + future-version rejection + pre-v1 default;
resumable skips-completed + clears-on-success (with a fresh executor = the
migration path) + retains a successes-only checkpoint on partial failure; a
real-LLM \#[ignore] resumable workflow passing against .a3s/config.acl. Store
suite (36) + orchestration (14) green; fmt + clippy --lib --bins clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 4 of the Workflow integration (builds on #51–#53). Makes orchestration resumable and migratable.
What
WorkflowCheckpoint{ schema_version, workflow_id, steps, checkpoint_ms }+ensure_loadable()(rejects future schema versions — mirrorsLoopCheckpoint, the step-boundary analogue one level up).SessionStoregainssave/load/delete_workflow_checkpoint(default no-ops). The file store writes crash-atomically (temp + fsync + rename) and rejects future versions on load; the memory store mirrors it.execute_steps_parallel_resumable: loads prior progress, skips completed steps (reusing cached outcomes), runs only the rest, rewrites the checkpoint at each step boundary, merges in spec order, and clears on full success. Records only successful steps — a failed step retries on resume (its effect didn't complete) while a succeeded step's work is never redone.Because the checkpoint is serializable and the executor is a parameter, a host can resume an interrupted workflow on a different node by passing that node's executor — the migration path, exercised in tests by resuming with a fresh executor.
Verification
#[ignore]resumable workflow passing against.a3s/config.acl.cargo fmt --all --check+cargo clippy --lib --bins -D warningsclean.