Skip to content

fix(cli): auto-detect server backend + herdr reconcile fixes#309

Merged
haofeif merged 7 commits into
awslabs:mainfrom
anilkmr-a2z:fix/308-cli-backend-detection
Jun 19, 2026
Merged

fix(cli): auto-detect server backend + herdr reconcile fixes#309
haofeif merged 7 commits into
awslabs:mainfrom
anilkmr-a2z:fix/308-cli-backend-detection

Conversation

@anilkmr-a2z

@anilkmr-a2z anilkmr-a2z commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • CLI commands (cao launch, cao terminal restore, cao info) now auto-detect the server's terminal backend instead of defaulting to tmux
  • herdr inbox reconciliation no longer deletes live terminals when pane IDs get renumbered
  • get_output distinguishes buffer overflow from no-response

Closes #308.

Changes

Backend auto-detection (closes #308):

  • /health endpoint reports terminal_backend field ("tmux" or "herdr")
  • New sync_backend_from_server() helper queries /health and aligns the CLI's backend singleton before attach/create_window calls
  • cao info checks CAO_SESSION_NAME env var (set by herdr) before falling back to tmux display-message

herdr reconcile fix (pane renumbering):

  • Reconcile now checks durable tab label via _label_still_live before deleting
  • Live label -> re-resolve pane_id + re-map (continue, no delete); absent label -> prune as before
  • workspace.closed handler gained _resolve_session_from_herdr() fallback

Output mode fix:

  • get_output returns distinct error for buffer overflow vs empty/no-response

Test plan

  • 72 new/modified tests pass across affected areas
  • uv run pytest --no-cov -p no:cacheprovider -q test/cli/ test/utils/test_terminal.py test/api/test_api_endpoints.py::TestHealthCheck test/backends/test_herdr_inbox_service.py -- all pass
  • mypy clean on all modified source files
  • black formatted

herdr 0.6.8 replays its entire pane_closed history on every fresh
events.subscribe and reuses compact pane_ids when a tab is killed and a
new tab takes the same index. CAO triggers a fresh subscribe on every
terminal registration via _force_reconnect(), so a replayed close for an
old incarnation of a pane arrives mapped to the live terminal now
occupying that reused index. _handle_lifecycle_event then deleted a live
terminal's DB record.

Two observed symptoms from this one cause:
- terminal deleted before its handoff message was sent -> message never
  delivered
- terminal deleted milliseconds after send -> agent ran fine but the
  handoff poller hit 404 every second for minutes (log spam)

Neither pane was actually closed; both stayed alive in herdr.

Fix: before acting on pane_closed, confirm the terminal's tab label
(tmux_window, unique per incarnation and already persisted) is genuinely
gone from herdr. Live label means the close is a replayed event for an
older incarnation -> skip. Missing label, including user-initiated close,
means a real close -> delete. If herdr cannot be queried, fall toward
delete so a possibly-closed terminal is never stranded. This mirrors the
label-based liveness check _reconcile() already uses.

Adds regression tests covering the reused-pane_id skip, the genuine-close
delete, and the herdr-unreachable fallback.

Assisted-by: Amazon Q Developer
When no response marker was found after escalating through all scrollback
steps, get_output always returned [PARTIAL RESPONSE], conflating two
distinct cases:
- overflow: the marker was produced but pushed past the scrollback cap
- no response: the agent completed without producing any text (tool-only
  turn, crash, or silent timeout)

Use buffer fullness as the discriminator. If the retrieved buffer is at
least 90% of the last escalation cap, treat it as overflow and keep the
[PARTIAL RESPONSE] label. Otherwise emit [NO RESPONSE], which lets the
caller distinguish "retry/fetch more" from "agent failed to respond,
investigate why" instead of mislabeling a silent turn as partial.

Updates tests to cover both the sparse no-response buffer and the
near-full overflow buffer.

Assisted-by: Amazon Q Developer
HerdrInboxService._reconcile() treated herdr's ephemeral compact
pane_ids as stable terminal identity. herdr renumbers pane_ids when a
sibling tab in the workspace closes, so a live terminal's stored
pane_id falls out of `herdr pane list`. The stale-pane set-diff then
deleted the still-running terminal. Every new-worker register_terminal
forces a socket reconnect, which runs reconcile over all siblings, so
the bug triggered routinely in multi-pane sessions.

Reconcile now treats the durable tab label (tmux_window) as identity:
- live label -> re-resolve the current pane_id via get_pane_id and
  re-map both _pane_to_terminal and _terminal_to_pane (no delete)
- absent label (or re-resolve failure) -> prune and delete as before

Workspace teardown is likewise gated on workspace-label liveness rather
than the pane diff, so a workspace whose panes were merely renumbered is
never killed out from under working agents.

Secondary fix: the workspace.closed handler looked up _workspace_to_session,
which is populated only by _reconcile. A workspace that closed before any
reconcile cached it produced a silent no-op and leaked the session's
terminals as orphan DB rows. Added _resolve_session_from_herdr() to
resolve the session from live herdr state when the map misses, treating
the event as unresolvable (no destructive action) only after the live
query also fails.

Removed now-dead _resolve_pane_id() and invalidate_cache() from
HerdrBackend (the reconcile path no longer resolves pane_ids by scanning
the list) and their tests.

Adds 5 regression tests covering pane re-map, delete-on-absent-label,
live-workspace protection, uncached workspace.closed resolution, and the
unresolvable safe no-op.

Assisted by AI
)

When cao-server --terminal herdr is used without terminal_backend in
config.json, CLI processes (cao launch, cao terminal restore, cao info)
defaulted to tmux because they resolve get_backend() independently.

- /health endpoint now reports terminal_backend field
- New sync_backend_from_server() helper queries /health and aligns the
  CLI's backend singleton before attach/create_window calls
- cao info checks CAO_SESSION_NAME env var (set by herdr) before
  falling back to tmux display-message
@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.73585% with 13 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@bc4d837). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...agent_orchestrator/services/herdr_inbox_service.py 81.94% 13 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #309   +/-   ##
=======================================
  Coverage        ?   87.30%           
=======================================
  Files           ?       92           
  Lines           ?    11033           
  Branches        ?        0           
=======================================
  Hits            ?     9632           
  Misses          ?     1401           
  Partials        ?        0           
Flag Coverage Δ
unittests 87.30% <87.73%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves CLI/server interoperability by letting the CLI auto-detect the server’s active terminal backend (tmux vs herdr), hardens herdr inbox reconciliation against pane-id renumbering/replay behavior, and makes get_output(..., LAST) differentiate “no response” from “buffer overflow”.

Changes:

  • Add /health.terminal_backend and a CLI helper (sync_backend_from_server) to align the CLI backend with the running server before attach/restore flows.
  • Fix herdr inbox reconciliation and lifecycle handling to avoid deleting live terminals when compact pane IDs are reused/renumbered.
  • Refine get_output(OutputMode.LAST) messaging to distinguish sparse buffers (no text response) from near-full buffers (likely overflow), with tests.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/utils/test_terminal.py Adds tests for sync_backend_from_server() behavior and failure handling.
test/services/test_terminal_service_full.py Updates/extends get_output(LAST) tests for “no response” vs overflow.
test/cli/commands/test_launch.py Adds regression tests ensuring backend sync occurs before attach on non-headless launch.
test/cli/commands/test_info.py Adds tests for CAO_SESSION_NAME precedence over tmux detection.
test/backends/test_herdr_inbox_service.py Adds coverage for pane-id reuse/renumbering reconciliation and workspace.closed resolution.
test/backends/test_herdr_backend.py Removes tests for deleted internal cache/pane resolution helpers.
test/api/test_api_endpoints.py Adds assertions that /health reports terminal backend correctly.
src/cli_agent_orchestrator/utils/terminal.py Introduces sync_backend_from_server() helper using /health.
src/cli_agent_orchestrator/services/terminal_service.py Enhances get_output(LAST) to return distinct overflow vs no-response messages.
src/cli_agent_orchestrator/services/herdr_inbox_service.py Prevents stale close/reconcile from deleting live terminals; adds live-label checks and workspace resolution fallback.
src/cli_agent_orchestrator/cli/commands/terminal.py Calls backend sync prior to restore window creation.
src/cli_agent_orchestrator/cli/commands/launch.py Calls backend sync prior to waiting/attaching (non-headless).
src/cli_agent_orchestrator/cli/commands/info.py Uses CAO_SESSION_NAME env var before falling back to tmux session detection.
src/cli_agent_orchestrator/backends/herdr_backend.py Removes unused internal pane-resolution + cache invalidation methods; documents cache usage via get_pane_id().
src/cli_agent_orchestrator/api/main.py Adds terminal_backend field to /health response.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# response marker was likely produced but pushed past the scrollback
# limit (overflow). If the buffer is mostly empty, the agent never
# produced a text response (e.g. only tool calls, crash, or timeout).
actual_lines = len(full_output.splitlines())
Comment on lines +590 to +610
try:
result = subprocess.run(
["herdr", "--session", self._herdr_session, "tab", "list"],
capture_output=True,
text=True,
timeout=10,
)
if result.returncode != 0:
logger.warning(
"_label_still_live: herdr tab list failed (rc=%s): %s",
result.returncode,
result.stderr.strip(),
)
return False
tab_data = json.loads(result.stdout)
tabs = tab_data.get("result", {}).get("tabs", [])
live_labels = {tab.get("label", "") for tab in tabs}
return window_name in live_labels
except (subprocess.SubprocessError, json.JSONDecodeError, KeyError, OSError) as e:
logger.warning("_label_still_live: could not query herdr (%s)", e)
return False
Avoids allocating a full line list on large scrollback buffers.
Addresses Copilot review comment on PR awslabs#309.
@anilkmr-a2z

Copy link
Copy Markdown
Contributor Author

Addressed both Copilot comments:

  1. terminal_service.py splitlines() -- Fixed: switched to full_output.count('\n') + 1 to avoid allocating the full line list on large scrollback buffers. Pushed in 1361e17.

  2. herdr_inbox_service.py tab-label caching -- Acknowledged, but deferring to a follow-up. Adding a TTL cache would introduce new state management and testing surface that's out of scope for this PR. The current implementation is correct (calls herdr tab list per stale pane), just not optimal for large reconcile batches. Will track separately.

@haofeif

haofeif commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

@anilkmr-a2z Great PR, I found one issue in the remap path that I think should be fixed before merge.

In _reconcile(), once we detect that the stored pane_id is missing from herdr's live pane list, we confirm the tab label is still live and then call get_backend().get_pane_id(...) to re-resolve the current pane.

The problem is that HerdrBackend.get_pane_id() first returns _pane_cache[terminal_id] while the 5s TTL is valid. That cache was populated during create/register, so in a fast pane-renumber/reconnect path it can return the same stale pane id that _reconcile() just proved is no longer live.

So the sequence can be:

  1. detect pane-old is stale
  2. confirm the tab label is still live
  3. call get_pane_id(...)
  4. get cached pane-old back
  5. write pane-old back into _pane_to_terminal
  6. subscribe to a dead pane id

This no longer deletes the live terminal, so I would classify it as P2 rather than data-loss severity. But it can still leave event-driven inbox/status delivery pointed at the wrong pane until a later reconnect after cache expiry.

Could we make the reconcile remap path bypass/invalidate the pane cache and force the fresh label-based lookup?


## Validation

Checked formatting:

```bash
git diff --check origin/main...HEAD
# passed

Focused changed-area tests passed with isolated HOME so local CAO config did not force herdr during tmux-oriented utility tests:

HOME=$(mktemp -d /tmp/cao-pr309-home.XXXXXX) uv run pytest \
  test/backends/test_herdr_inbox_service.py \
  test/services/test_terminal_service_full.py \
  test/utils/test_terminal.py \
  test/cli/commands/test_launch.py \
  test/cli/commands/test_info.py \
  test/api/test_api_endpoints.py -q

# 284 passed

Running the same focused suite without isolating HOME failed three legacy wait_for_shell tests because this machine's user config resolves the lazy backend to herdr while those tests assume tmux. I did not treat that as a PR blocker for #309, but it is a test hermeticity issue worth fixing separately.

Prevents get_pane_id() from returning the same stale pane_id that
reconcile just proved is no longer live, which could leave event
delivery pointed at the wrong pane until the cache TTL expires.

Addresses review comment by haofeif on PR awslabs#309.
@anilkmr-a2z

Copy link
Copy Markdown
Contributor Author

Good catch -- the reconcile remap path now invalidates _pane_cache[terminal_id] before calling get_pane_id(), forcing the fresh label-based lookup. Pushed in 4b6dcd1.

The sequence you described (stale cache -> re-subscribe to dead pane) is no longer possible: the cache entry is popped before the re-resolve call runs.

@haofeif haofeif left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anilkmr-a2z this looks great to me !

@haofeif haofeif merged commit 1057d5d into awslabs:main Jun 19, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: cao launch tries tmux attach when server uses --terminal herdr

4 participants