-
Notifications
You must be signed in to change notification settings - Fork 57
LongRunningAgentServer: durable resume via heartbeat + CAS claim #416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
dhruv0811
wants to merge
40
commits into
main
Choose a base branch
from
dhruv0811/durable-execution-resume
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
40 commits
Select commit
Hold shift + click to select a range
afd2d25
LongRunningAgentServer: add durable resume via heartbeat + CAS claim
dhruv0811 466a859
Update test_creates_schema_and_tables for new ADD COLUMN migration
dhruv0811 e7cfedd
Fix AttributeError when injecting conversation_id into a pydantic req…
dhruv0811 19df055
Tolerate InsufficientPrivilege in ADD COLUMN migrations
dhruv0811 97a5dcb
Include attempt_number in retrieve response for observability
dhruv0811 d3adee7
Add opt-in debug-kill endpoint for testing crash-resume on deployed apps
dhruv0811 d7c33b7
Apply ruff format + fix ty diagnostic on request_data dump
dhruv0811 f9f8a73
Log Background response created at INFO so response_id is visible in …
dhruv0811 d5666b2
Tag every SSE frame in stream retrieve with top-level response_id
dhruv0811 f2ffb6e
Self-heal open streams: call _try_claim_and_resume from _stream_retrieve
dhruv0811 6ee9f6c
Tighten heartbeat_stale_threshold_seconds default from 15s to 10s
dhruv0811 c2383f2
Add [durable] INFO-level lifecycle logs across the resume path
dhruv0811 cbd2b0b
Add public durable-resume repair helpers for openai + langchain
dhruv0811 5d70dde
Add pre_model_hook factory; WARN on skipped durability migrations
dhruv0811 62df014
Rename pre_model_hook factory to middleware factory
dhruv0811 4af26f0
Remove unused Callable import in checkpoint.py
dhruv0811 bc573fc
LongRunning: rotate conv_id on resume + full-history input sanitizer
dhruv0811 91360a9
Checkpoint saver: read-time repair so middleware stays optional
dhruv0811 5da7dbd
Session: auto-repair on get_items so middleware-free templates are safe
dhruv0811 51f0a4c
Stamp custom_inputs.attempt_number on resume so handlers can see retries
dhruv0811 f3f8eb3
Synthetic-output text: informative, scoped, nudges against re-running…
dhruv0811 40d7e09
Resume inherits prior attempt's completed tool outputs
dhruv0811 77cd8a8
Resume inheritance: include completed assistant message items
dhruv0811 7fecacd
Resume inheritance: reassemble mid-stream partial text from deltas
dhruv0811 6ef968f
Resume inheritance: also reassemble reasoning + function_call arg str…
dhruv0811 2f26e26
docs: update server.py docstrings for rotate+replay+inherit resume be…
dhruv0811 68ce276
Strip PR to bare minimum essentials for final durable-resume contract
dhruv0811 c23d9b6
Make build_tool_resume_repair internal (rename to _build_tool_resume_…
dhruv0811 8bd0718
server: add asyncio.sleep(0) yield point in stream loop
dhruv0811 4d0756e
Drop event: line from durable SSE frames
dhruv0811 07ded9d
Inheritance: drop completed message items, preserve only tool pairs
dhruv0811 7bcb1f3
Inheritance: hoist narrative messages after tool pairs instead of dro…
dhruv0811 8881563
Stable state — durable execution verified end-to-end
dhruv0811 ef8f86a
Refactor: extract shared sanitize_tool_items helper
dhruv0811 933333c
Consolidate [INTERRUPTED] synthetic output + simplify session API
dhruv0811 3f4cbe4
Merge remote-tracking branch 'origin/main' into pr-416
dhruv0811 d42ceb2
Apply ruff format + drop unused import
dhruv0811 3c1ca10
Move tool_repair.py out of long_running/
dhruv0811 e3db589
Add AGENTS.md for LongRunningAgentServer design
dhruv0811 7b9ae32
Revert "Add AGENTS.md for LongRunningAgentServer design"
dhruv0811 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need these synthetic events?
from our discussion in person, i think it should be sufficient for us to serialize the entire message history from the crashed conversations into a single user message with a prompt to the LLM about "the agent task runnig this crashed, contniue from here"
(maybe can say recovering from crash in the text also to explain why there is duplicatd content like two tool calls in a row)
pros/cons list from talking to claude:
imo just prose recovery is much cleaner and doesn't rely on agent authors to know to use these specific APIs