fix(cli): auto-detect server backend + herdr reconcile fixes#309
Conversation
herdr 0.6.8 replays its entire pane_closed history on every fresh events.subscribe and reuses compact pane_ids when a tab is killed and a new tab takes the same index. CAO triggers a fresh subscribe on every terminal registration via _force_reconnect(), so a replayed close for an old incarnation of a pane arrives mapped to the live terminal now occupying that reused index. _handle_lifecycle_event then deleted a live terminal's DB record. Two observed symptoms from this one cause: - terminal deleted before its handoff message was sent -> message never delivered - terminal deleted milliseconds after send -> agent ran fine but the handoff poller hit 404 every second for minutes (log spam) Neither pane was actually closed; both stayed alive in herdr. Fix: before acting on pane_closed, confirm the terminal's tab label (tmux_window, unique per incarnation and already persisted) is genuinely gone from herdr. Live label means the close is a replayed event for an older incarnation -> skip. Missing label, including user-initiated close, means a real close -> delete. If herdr cannot be queried, fall toward delete so a possibly-closed terminal is never stranded. This mirrors the label-based liveness check _reconcile() already uses. Adds regression tests covering the reused-pane_id skip, the genuine-close delete, and the herdr-unreachable fallback. Assisted-by: Amazon Q Developer
When no response marker was found after escalating through all scrollback steps, get_output always returned [PARTIAL RESPONSE], conflating two distinct cases: - overflow: the marker was produced but pushed past the scrollback cap - no response: the agent completed without producing any text (tool-only turn, crash, or silent timeout) Use buffer fullness as the discriminator. If the retrieved buffer is at least 90% of the last escalation cap, treat it as overflow and keep the [PARTIAL RESPONSE] label. Otherwise emit [NO RESPONSE], which lets the caller distinguish "retry/fetch more" from "agent failed to respond, investigate why" instead of mislabeling a silent turn as partial. Updates tests to cover both the sparse no-response buffer and the near-full overflow buffer. Assisted-by: Amazon Q Developer
HerdrInboxService._reconcile() treated herdr's ephemeral compact pane_ids as stable terminal identity. herdr renumbers pane_ids when a sibling tab in the workspace closes, so a live terminal's stored pane_id falls out of `herdr pane list`. The stale-pane set-diff then deleted the still-running terminal. Every new-worker register_terminal forces a socket reconnect, which runs reconcile over all siblings, so the bug triggered routinely in multi-pane sessions. Reconcile now treats the durable tab label (tmux_window) as identity: - live label -> re-resolve the current pane_id via get_pane_id and re-map both _pane_to_terminal and _terminal_to_pane (no delete) - absent label (or re-resolve failure) -> prune and delete as before Workspace teardown is likewise gated on workspace-label liveness rather than the pane diff, so a workspace whose panes were merely renumbered is never killed out from under working agents. Secondary fix: the workspace.closed handler looked up _workspace_to_session, which is populated only by _reconcile. A workspace that closed before any reconcile cached it produced a silent no-op and leaked the session's terminals as orphan DB rows. Added _resolve_session_from_herdr() to resolve the session from live herdr state when the map misses, treating the event as unresolvable (no destructive action) only after the live query also fails. Removed now-dead _resolve_pane_id() and invalidate_cache() from HerdrBackend (the reconcile path no longer resolves pane_ids by scanning the list) and their tests. Adds 5 regression tests covering pane re-map, delete-on-absent-label, live-workspace protection, uncached workspace.closed resolution, and the unresolvable safe no-op. Assisted by AI
) When cao-server --terminal herdr is used without terminal_backend in config.json, CLI processes (cao launch, cao terminal restore, cao info) defaulted to tmux because they resolve get_backend() independently. - /health endpoint now reports terminal_backend field - New sync_backend_from_server() helper queries /health and aligns the CLI's backend singleton before attach/create_window calls - cao info checks CAO_SESSION_NAME env var (set by herdr) before falling back to tmux display-message
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #309 +/- ##
=======================================
Coverage ? 87.30%
=======================================
Files ? 92
Lines ? 11033
Branches ? 0
=======================================
Hits ? 9632
Misses ? 1401
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR improves CLI/server interoperability by letting the CLI auto-detect the server’s active terminal backend (tmux vs herdr), hardens herdr inbox reconciliation against pane-id renumbering/replay behavior, and makes get_output(..., LAST) differentiate “no response” from “buffer overflow”.
Changes:
- Add
/health.terminal_backendand a CLI helper (sync_backend_from_server) to align the CLI backend with the running server before attach/restore flows. - Fix herdr inbox reconciliation and lifecycle handling to avoid deleting live terminals when compact pane IDs are reused/renumbered.
- Refine
get_output(OutputMode.LAST)messaging to distinguish sparse buffers (no text response) from near-full buffers (likely overflow), with tests.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/utils/test_terminal.py | Adds tests for sync_backend_from_server() behavior and failure handling. |
| test/services/test_terminal_service_full.py | Updates/extends get_output(LAST) tests for “no response” vs overflow. |
| test/cli/commands/test_launch.py | Adds regression tests ensuring backend sync occurs before attach on non-headless launch. |
| test/cli/commands/test_info.py | Adds tests for CAO_SESSION_NAME precedence over tmux detection. |
| test/backends/test_herdr_inbox_service.py | Adds coverage for pane-id reuse/renumbering reconciliation and workspace.closed resolution. |
| test/backends/test_herdr_backend.py | Removes tests for deleted internal cache/pane resolution helpers. |
| test/api/test_api_endpoints.py | Adds assertions that /health reports terminal backend correctly. |
| src/cli_agent_orchestrator/utils/terminal.py | Introduces sync_backend_from_server() helper using /health. |
| src/cli_agent_orchestrator/services/terminal_service.py | Enhances get_output(LAST) to return distinct overflow vs no-response messages. |
| src/cli_agent_orchestrator/services/herdr_inbox_service.py | Prevents stale close/reconcile from deleting live terminals; adds live-label checks and workspace resolution fallback. |
| src/cli_agent_orchestrator/cli/commands/terminal.py | Calls backend sync prior to restore window creation. |
| src/cli_agent_orchestrator/cli/commands/launch.py | Calls backend sync prior to waiting/attaching (non-headless). |
| src/cli_agent_orchestrator/cli/commands/info.py | Uses CAO_SESSION_NAME env var before falling back to tmux session detection. |
| src/cli_agent_orchestrator/backends/herdr_backend.py | Removes unused internal pane-resolution + cache invalidation methods; documents cache usage via get_pane_id(). |
| src/cli_agent_orchestrator/api/main.py | Adds terminal_backend field to /health response. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # response marker was likely produced but pushed past the scrollback | ||
| # limit (overflow). If the buffer is mostly empty, the agent never | ||
| # produced a text response (e.g. only tool calls, crash, or timeout). | ||
| actual_lines = len(full_output.splitlines()) |
| try: | ||
| result = subprocess.run( | ||
| ["herdr", "--session", self._herdr_session, "tab", "list"], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=10, | ||
| ) | ||
| if result.returncode != 0: | ||
| logger.warning( | ||
| "_label_still_live: herdr tab list failed (rc=%s): %s", | ||
| result.returncode, | ||
| result.stderr.strip(), | ||
| ) | ||
| return False | ||
| tab_data = json.loads(result.stdout) | ||
| tabs = tab_data.get("result", {}).get("tabs", []) | ||
| live_labels = {tab.get("label", "") for tab in tabs} | ||
| return window_name in live_labels | ||
| except (subprocess.SubprocessError, json.JSONDecodeError, KeyError, OSError) as e: | ||
| logger.warning("_label_still_live: could not query herdr (%s)", e) | ||
| return False |
Avoids allocating a full line list on large scrollback buffers. Addresses Copilot review comment on PR awslabs#309.
|
Addressed both Copilot comments:
|
|
@anilkmr-a2z Great PR, I found one issue in the remap path that I think should be fixed before merge. In The problem is that So the sequence can be:
This no longer deletes the live terminal, so I would classify it as P2 rather than data-loss severity. But it can still leave event-driven inbox/status delivery pointed at the wrong pane until a later reconnect after cache expiry. Could we make the reconcile remap path bypass/invalidate the pane cache and force the fresh label-based lookup? Focused changed-area tests passed with isolated HOME=$(mktemp -d /tmp/cao-pr309-home.XXXXXX) uv run pytest \
test/backends/test_herdr_inbox_service.py \
test/services/test_terminal_service_full.py \
test/utils/test_terminal.py \
test/cli/commands/test_launch.py \
test/cli/commands/test_info.py \
test/api/test_api_endpoints.py -q
# 284 passedRunning the same focused suite without isolating |
Prevents get_pane_id() from returning the same stale pane_id that reconcile just proved is no longer live, which could leave event delivery pointed at the wrong pane until the cache TTL expires. Addresses review comment by haofeif on PR awslabs#309.
|
Good catch -- the reconcile remap path now invalidates The sequence you described (stale cache -> re-subscribe to dead pane) is no longer possible: the cache entry is popped before the re-resolve call runs. |
haofeif
left a comment
There was a problem hiding this comment.
Thanks @anilkmr-a2z this looks great to me !
Summary
cao launch,cao terminal restore,cao info) now auto-detect the server's terminal backend instead of defaulting to tmuxget_outputdistinguishes buffer overflow from no-responseCloses #308.
Changes
Backend auto-detection (closes #308):
/healthendpoint reportsterminal_backendfield ("tmux"or"herdr")sync_backend_from_server()helper queries/healthand aligns the CLI's backend singleton before attach/create_window callscao infochecksCAO_SESSION_NAMEenv var (set by herdr) before falling back totmux display-messageherdr reconcile fix (pane renumbering):
_label_still_livebefore deletingworkspace.closedhandler gained_resolve_session_from_herdr()fallbackOutput mode fix:
get_outputreturns distinct error for buffer overflow vs empty/no-responseTest plan
uv run pytest --no-cov -p no:cacheprovider -q test/cli/ test/utils/test_terminal.py test/api/test_api_endpoints.py::TestHealthCheck test/backends/test_herdr_inbox_service.py-- all pass