Skip to content

Release v3.3.0 — cluster-grade runtime + review hardening#47

Merged
ZhiXiao-Lin merged 27 commits into
mainfrom
release/v3.3.0
May 29, 2026
Merged

Release v3.3.0 — cluster-grade runtime + review hardening#47
ZhiXiao-Lin merged 27 commits into
mainfrom
release/v3.3.0

Conversation

@ZhiXiao-Lin
Copy link
Copy Markdown
Contributor

Summary

a3s-code v3.3.0 — cluster-grade runtime features for hosting long-lived
agent sessions across nodes, plus an adversarial-review hardening pass.
All additions are backward compatible (new methods, new optional
fields, SessionStore trait methods with default no-op impls) → minor bump.

Full details in CHANGELOG.md. Highlights:

Added

  • Lifecycle: real AgentSession::close() (cancels run + subagent tasks +
    HITL), agent-side session registry (list_sessions/close_session/close),
    session-level CancellationToken hierarchy.
  • Identity labels: tenant_id/principal/agent_template_id/correlation_id
    on SessionOptions+SessionData (opaque, host-driven), exposed on both SDKs.
  • BudgetGuard cost/quota contract wired at the LLM call site (Deny →
    BudgetExhausted, SoftLimit → event). SDK bridges (Python class, Node
    setBudgetGuard, fail-closed).
  • HostEnv (IdGenerator + Clock) injection for deterministic replay.
  • Loop checkpoints + resume_run: per-tool-round persistence (crash-atomic
    file writes), resume from the last boundary on any node sharing the store.
  • SessionRetentionLimits: FIFO caps on run/event/trace/terminal-subagent
    state. MCP idle disconnect. Cluster AgentEvent variants. Subagent
    tracker persistence.

Fixed (review hardening)

  • Loop-checkpoint leak (unbounded growth) — now cleared on terminal.
  • event_count corruption on restore. Node BudgetGuard fail-open → fail-closed.
  • MCP timestamp leak. Session-registry Weak pruning. Eviction TOCTOU
    (atomic multi-lock).

Testing

  • 1705 lib + all integration files green; Node 27 + Python 19 cargo tests;
    all SDK smokes; clippy clean (core + both SDKs); cargo fmt --check clean.
  • Real-LLM end-to-end: 5 #[ignore] tests pass against
    openai/MiniMax-M2.7-highspeed (BudgetGuard real usage, deny-blocks-call,
    resume_run metric carry-forward, checkpoint clear, identity labels).
  • Version alignment green at 3.3.0 (scripts/check_release_versions.sh).

Release

Merging this, then pushing the v3.3.0 tag, triggers release.yml
(crates.io → npm → GH Release wheels → PyPI bootstrap → GitHub Release).

Known limitation

Node BudgetGuard callbacks must not throw (napi return-conversion aborts the
process); wrap in try/catch and return a decision. Hangs are fail-closed.
Python is unaffected.

claude added 27 commits May 28, 2026 10:57
Turn AgentSession::close() from a thin cancel() wrapper into a real
cleanup path:

- New `closed: AtomicBool` state; `is_closed()` accessor.
- send / stream / send_with_attachments / stream_with_attachments
  fast-fail with new `CodeError::SessionClosed { session_id }` once
  closed.
- close() is idempotent and runs the full sequence: cancel current
  run -> cancel all in-flight delegated subagent tasks -> cancel
  pending HITL confirmations.

Tests cover the closed-state lifecycle (idempotency, post-close
send/stream fast-fail, close-mid-flight cancellation).
Make AgentSession the parent of all in-flight cancellation tokens so
session shutdown cascades automatically and races between close() and
new operations are safe.

- New `session_cancel: CancellationToken` field on AgentSession;
  public `session_cancel_token()` accessor lets embedders observe or
  fire it directly.
- BlockingRunContext and StreamRunContext derive their per-run token
  via `session.session_cancel.child_token()` instead of fresh tokens.
- Run lifecycles classify a failed run as Cancelled vs Failed by
  sampling the per-run token's `is_cancelled()` flag before clearing
  it — so runs killed via session_cancel land as Cancelled in the
  run store, not Failed.
- close() fires session_cancel before per-run bookkeeping so any
  concurrently-spawned child inherits an already-cancelled token.

Tests cover the token lifecycle and propagation to in-flight runs.
Make agent-side proactive shutdown a first-class API: enumerate live
sessions, close one by ID, or close the whole agent (and its global
MCP) in a single call.

- New `agent_api/session_close.rs` module: `SessionCloseHandle` is an
  Arc-shareable bundle of every field needed to perform the close
  sequence. AgentSession holds an `Arc<SessionCloseHandle>`;
  Agent stores `Weak<SessionCloseHandle>` in its registry.
- Agent gains `sessions: Mutex<HashMap<id, Weak<SessionCloseHandle>>>`
  and `closed: AtomicBool`. New public methods: `list_sessions()`,
  `close_session(id)`, `close()`, `is_closed()`.
- AgentSession::close() refactored to a single-line delegate to the
  shared handle, so `agent.close_session(id)` and `session.close()`
  can never drift in behaviour.
- `Agent::close()` also walks `global_mcp.list_connected()` and
  disconnects every server so background workers exit.
- `Agent::session()` / `resume_session()` fail fast with
  `CodeError::SessionClosed` once the agent is closed.
- Dead `Weak` entries are pruned lazily on `list_sessions()` /
  `close_session()` access — no background sweeper needed.

Tests cover live-tracking with drop-driven pruning, close-by-id,
agent-wide close, and post-close session() rejection.
Propagate the new core close-and-registry surface to the Node and
Python SDKs.

Node (napi):
- Agent: listSessions / closeSession / close / isClosed.
- Session: isClosed.

Python (pyo3):
- Agent: list_sessions / close_session / close + is_closed getter.
- Session: is_closed getter.

README updated to show the agent-side lifecycle calls alongside the
existing session.cancel() / session.close() examples.
Cover the surfaces unit tests can't reach:

- core/tests/test_session_close_lifecycle.rs (3 tests):
  * close_with_subagent_in_flight_marks_task_cancelled_and_resists_regression
    — verifies the session-close → subagent-tracker cross-module path,
    including the "late SubagentEnd must not regress Cancelled" contract.
  * agent_close_handles_global_mcp_branch_and_is_idempotent
    — exercises Agent::close() both with and without global_mcp, and
    confirms post-close session() rejection.
  * session_drop_prunes_registry_under_concurrency
    — multi-thread runtime spawns 32 sessions, drops half, asserts the
    Weak-ref registry converges to exactly the held set.

- sdk/python/tests/test_session_close.py
- sdk/node/test_session_close.mjs
  Smoke tests for the SDK wrappers: session.is_closed,
  session.close() idempotency, agent.list_sessions(),
  agent.close_session(id), agent.close() rejecting later session().
  Verified runnable locally (maturin develop / napi build).

Adds `AgentSession::subagent_tracker()` accessor so embedders with a
custom subagent execution path can feed lifecycle events through the
same close() pipeline as the built-in `task` tool — also unblocks the
in-flight integration test.

generated.d.ts is regenerated by `napi build` and now reflects the
step-4 Agent / Session close methods.
Close the framework gap that blocked 书安OS from migrating sessions:
session.save() now also persists the InMemorySubagentTaskTracker
snapshots, and resume_session restores them.

- SessionStore trait: new save_subagent_tasks / load_subagent_tasks
  with no-op defaults (backwards compatible for custom backends).
- MemorySessionStore + FileSessionStore implement them; FileSessionStore
  writes to <dir>/subagent_tasks/<id>.json and cleans up on delete().
- InMemorySubagentTaskTracker::replace_snapshots() — restoration entry
  point. Cancellers are intentionally NOT restored (runtime-only).
- session_persistence: wired into both save() and
  restore_persisted_session_state().

Integration test (test_session_close_lifecycle::subagent_tasks_persist
_across_save_and_resume) covers the full matrix of terminal states
(Completed/Failed/Cancelled) and verifies the post-restore semantics:
late cancel attempts on terminal tasks safely return false.

Before this, restoring a session lost its delegated child run history —
which is the materialized view 书安OS needs to make scheduling /
billing / drain decisions on a session about to migrate.
…emplate / correlation (P5)

Add four optional, framework-opaque string slots to SessionOptions ->
AgentSession -> SessionData so the host (书安OS) can drive multi-tenancy
/ accounting / distributed tracing without string-hacking session_id.

- SessionOptions builder helpers: with_tenant_id / with_principal /
  with_agent_template_id / with_correlation_id.
- AgentSession accessors: tenant_id() / principal() /
  agent_template_id() / correlation_id().
- Round-trip via SessionData (#[serde(default, skip_serializing_if =
  "Option::is_none")]) — forward-compatible for stores written by older
  versions.
- apply_persisted_runtime_options restores labels on resume but caller-
  supplied opts take precedence (so a resume can intentionally relabel).

Framework only transports these strings — it never interprets tenant
boundaries, enforces quotas, or routes by template. That belongs to
custom HookExecutor / PermissionChecker / (future) BudgetGuard impls
the host plugs in.

Tests: unit (defaults + round-trip via SessionOptions) + integration
(persist & restore through MemorySessionStore, including caller-side
override).
…PassivationRequested / PeerInvocation (P6)

Extend AgentEvent with three variants that the host (书安OS) emits via
HookExecutor to surface platform-level decisions inside the session
loop:

- BudgetThresholdHit { resource, kind, consumed, limit, message? } —
  host's BudgetGuard fired a soft/hard threshold. In-session policy
  decides what to do (compact, refuse next LLM call, etc).
- PassivationRequested { reason, deadline_ms? } — host wants the
  session to release in-memory caches before close/migration. Framework
  does not act on this; in-session hooks can flush derived state.
- PeerInvocation { from_session_id, from_tenant_id?, correlation_id? }
  — marks a send/stream that originated from a peer session in the
  cluster, so hooks can distinguish human-driven from peer-driven work.

These variants are pure schema additions. The framework never produces
them itself — only the host's transport / control plane does. They
exist so cluster events flow through the same hook/trace pipeline as
agent-loop events, without coupling the framework to a transport.

#[non_exhaustive] on AgentEvent (already present) keeps minor releases
non-breaking. New variants use #[serde(default, skip_serializing_if =
"Option::is_none")] for optional fields to allow forward-compatible
producers.

Test: schema lock unit test verifies stable JSON tags (used by
external producers) and round-trip with minimal payloads.
Define the cluster-grade cost/quota contract a3s-code owns at the
framework level, with enforcement wired into the LLM call path. Real
implementations (per-tenant Redis budgets, per-day USD caps, etc) live
in 书安OS.

- core/budget.rs:
  * BudgetGuard trait with default no-op methods (check_before_llm,
    record_after_llm, check_before_tool).
  * BudgetDecision { Allow, SoftLimit, Deny } — Allow proceeds silently,
    SoftLimit emits a BudgetThresholdHit("soft") event and continues,
    Deny bails with CodeError::BudgetExhausted.
  * NoopBudgetGuard as the framework default.
- SessionOptions::with_budget_guard + AgentSession routes it through
  AgentConfig.budget_guard.
- agent/llm_turn.rs::call_llm_with_circuit_breaker now consults the
  guard *once per turn* (not per retry):
  * Deny  -> emit BudgetThresholdHit("hard") + return Err with
            "Budget exhausted on '{resource}': {reason}"
  * Soft  -> emit BudgetThresholdHit("soft") + proceed
  * Allow -> proceed silently
  And calls record_after_llm with the real provider usage on success.
- New CodeError::BudgetExhausted variant for SDK-level type matching.
- Internal estimate_prompt_tokens helper (char/4 heuristic) — impls
  needing precision should use record_after_llm instead.

Tests: NoopBudgetGuard fast-path; custom guard returning Deny; full
integration via AgentSession::send verifies the LLM is never reached,
history stays clean, record_after_llm doesn't fire on Deny.

The framework still does not interpret tenants / cost / time itself —
the trait is the boundary between framework mechanism and host policy.
Lift the framework's ambient capabilities — fresh IDs and current time —
out of `uuid::Uuid::new_v4()` / `SystemTime::now()` direct calls and
into a host-provided pair. Unlocks two cluster-grade features:

  1. Deterministic replay of a run on another node (book-keep the seed,
     replay bit-identical).
  2. Time-bending tests without monkey-patching std::time.

- core/host_env.rs:
  * `IdGenerator` + `Clock` traits (Send+Sync+Debug).
  * `HostEnv { id_generator, clock }` bundle so a single Arc slot
    plumbs both.
  * Defaults: `SystemIdGenerator` (UUID v4) + `SystemClock` (wall
    time) — observably identical to pre-P2 behaviour.
  * Replay helpers: `SequentialIdGenerator { prefix, counter }` +
    `FixedClock { atomic now_ms }`. Public so external host crates
    (e.g. 书安OS replay) can use without copying the pattern.

- AgentConfig gains `host_env: Arc<HostEnv>` defaulting to
  `HostEnv::system()`. SessionOptions::with_host_env() lets the host
  swap it.

- Migrated two highest-leverage call sites (proof-of-pattern; the rest
  is leaf work that can move incrementally):
  * session_id generation in session_builder::prepare_session_options
  * run_id  generation in RunControlState::start_run via new
    `InMemoryRunStore::create_run_with_id` overload (back-compat alias
    `create_run` kept).

Tests: trait roundtrips (Seq/FixedClock determinism), and a full
integration that wires SequentialIdGenerator into a fresh Agent and
verifies the resulting sessions have session_id "test-0", "test-1".

The framework still does not interpret IDs — it just consults the
generator. Replay infrastructure (P3) lives on top of this contract.
Land the data contract and persistence wiring for crash-tolerant
runs. After each completed tool round the agent loop snapshots the
boundary into a LoopCheckpoint keyed by run_id; 书安OS picks this up
to migrate / replay a run on another node.

Boundary policy: checkpoints are taken **only between** tool rounds,
never mid-tool. If a process dies mid-tool the work of that round is
lost — the LLM re-deliberates from the previous checkpoint. This
trades retry cost for correctness; re-executing a non-idempotent
tool (write, bash) across the boundary is worse than re-asking.

What this cut delivers:

- core/loop_checkpoint.rs:
  * `LoopCheckpoint` struct (serde, schema_version=1 with forward-
    compatible `#[serde(default)]`).
  * `LoopCheckpointSink` trait — save_checkpoint / load_latest.
  * `SessionStoreCheckpointSink` default adapter — forwards into a
    `SessionStore`; failures are warn-logged but never halt a run.

- `SessionStore` trait: save_loop_checkpoint / load_loop_checkpoint
  with no-op defaults (backwards compatible). MemorySessionStore +
  FileSessionStore implement them; FileSessionStore writes under
  `<dir>/loop_checkpoints/<run_id>.json` (keyed by run_id, not
  session_id — multiple runs per session).

- `AgentLoop`: new `checkpoint_sink` + `checkpoint_run_id` fields;
  builder helpers `with_checkpoint_sink` + `set_checkpoint_run`.
  `build_agent_loop` auto-wires a sink from `session.session_store`
  when one is configured. BlockingRunContext + StreamRunContext bind
  the run id via `set_checkpoint_run` after `start_run` returns.

- `loop_runtime::execute_loop_inner` calls
  `persist_loop_checkpoint` after each successful `execute_tool_turn`,
  capturing `(turn, messages, total_usage, tool_calls_count,
  verification_reports, host_env.now_ms())`. Checkpoint write is
  fire-and-forget — sink errors don't propagate to the loop.

Tests:
- core/src/loop_checkpoint.rs: JSON round-trip + forward-compat
  (missing schema_version defaults to 0).
- integration:
  * loop_checkpoint_round_trips_through_session_store — full store
    contract: save → load → identical, second save overwrites,
    unknown run_id → None.
  * send_without_tool_calls_does_not_emit_loop_checkpoint —
    contractual negative: no tool rounds → no checkpoint pollution.

Cut 2 (separate commit) will add `AgentSession::resume_run(run_id)`
to actually pick up from a checkpoint, plus the end-to-end "crash
mid-run on node A, finish on node B" integration test that needs a
tool-using mock LLM. Data contract lands now so 书安OS can start
building the surrounding control plane.
…3 cut 2)

Land the resume side of the loop-checkpoint contract. After P3 cut 1
the framework persisted per-tool-round boundaries; this cut lets a
new process pick up that boundary and continue.

API:
- `AgentSession::resume_run(checkpoint_run_id: &str) -> Result<AgentResult>`
  loads the latest `LoopCheckpoint` for the given run id from the
  session's `SessionStore` and replays the agent loop from the
  checkpoint's `messages`.

Semantics:
- A **new** run id is allocated for the resumed work — the framework
  does not pretend the old run continues. The relationship between
  old and new run is host metadata (e.g. 书安OS tracks it).
- The new run also writes per-tool-round checkpoints (P3 cut 1
  wiring), so a session can survive multiple node failures.
- Two distinguishable error paths so cluster scheduling code can
  branch:
  * "resume_run requires a session_store" — host should fall back
    to a fresh session.
  * "no loop checkpoint found for run 'X'" — host can retry later
    (race against checkpoint write) or treat the run as lost.

Implementation:
- `conversation_runtime::resume_run` mirrors `send_with_attachments`'
  structure (BlockingRunContext + execute_from_messages) but feeds
  `checkpoint.messages` as the prebuilt message list, so the loop
  resumes at exactly the boundary state instead of appending a new
  user prompt.

Integration test (`resume_run_error_paths_are_distinguishable`)
covers both error branches so the strings stay stable for host-side
matching.

Together with cut 1, P3 now provides the full mechanism 书安OS needs:
the framework persists boundary state automatically during a live
run, and a fresh node can replay from any persisted boundary by
calling resume_run on a freshly-created session bound to the same
store. The framework still does not handle node selection, drain
choreography, or run-graph metadata — those remain 书安OS concerns.
Mirror the P5 (identity labels) and P3 cut 2 (resume_run) framework
additions through both SDKs so host code can drive them from JS/TS
or Python without reaching into the Rust core.

Node (napi):
- SessionOptions gains optional `tenantId / principal /
  agentTemplateId / correlationId` fields. js_session_options_to_rust
  forwards each via the matching `with_*` builder.
- Session gets four read-only getters with the same names plus an
  async `resumeRun(checkpointRunId)` that returns the AgentResult of
  the resumed run.
- generated.d.ts regenerated by `napi build` so TypeScript callers
  see the new types.

Python (pyo3):
- PySessionOptions gains `tenant_id / principal /
  agent_template_id / correlation_id` with paired getter/setters
  (matching the existing session_id pattern). build_rust_session_options
  forwards each.
- PySession exposes the same four labels as `@getter` properties plus
  a `resume_run(checkpoint_run_id)` method that raises RuntimeError
  on missing store / missing checkpoint (the framework's two
  distinguishable error strings stay intact for host-side branching).

Verification: Node 27 unit tests + Python 19 unit tests still green;
both SDKs pass `cargo clippy --lib -- -D warnings`.
Add a quick-reference example to the "Main APIs At A Glance" snippets
showing the new identity labels (tenant_id / principal /
agent_template_id / correlation_id) and `resume_run` flow. Same in
both the Python and TypeScript mirror sections so SDK callers see
the surface immediately.

Detailed semantics live in apps/docs/content/docs/{en,cn}/code/
api-contract.mdx ("Cluster-grade extension points").
Long-running sessions accumulated three classes of in-memory state
without bound: run records + per-run event buffers in
InMemoryRunStore, trace events in InMemoryTraceSink, and terminal
subagent task snapshots in InMemorySubagentTaskTracker. Fine for
short sessions, a memory leak for the cluster workloads 书安OS is
expected to host (hours / days / thousands per node).

This commit lands FIFO eviction caps that the host opts into per
session — defaults stay unbounded so existing callers see no
behaviour change.

API:
- New `retention::SessionRetentionLimits` struct with four optional
  caps: max_runs_retained, max_events_per_run, max_trace_events,
  max_terminal_subagent_tasks.
- `SessionOptions::with_retention_limits(...)` plumbs them into
  AgentConfig via session_builder + capabilities.

Eviction policy (oldest-first, idempotent):
- InMemoryRunStore: parallel VecDeque tracks insertion order; on
  create_run past the cap, oldest run + its events are dropped.
  Per-run events are FIFO-trimmed in record_event so the buffer
  stays at most max_events_per_run. `event_count` on RunSnapshot
  remains the cumulative total — not decremented on eviction.
  replace_records (resume path) rebuilds the queue in creation
  order so restored sessions honour the same cap.
- InMemoryTraceSink: drain front past cap; preserves the most
  recent (most useful for debugging) events. drain(..excess) is
  O(n) per push at cap; switch to VecDeque if the trace path
  becomes a perf bottleneck.
- InMemorySubagentTaskTracker: new terminal_order VecDeque records
  every Running → terminal transition (Completed / Failed /
  Cancelled). Running tasks are **never** evicted — only terminal
  snapshots. cancel() and the SubagentEnd record_event path both
  participate idempotently (a late SubagentEnd after cancel
  doesn't double-push).

Tests:
- Three unit-test blocks (one per store) cover: cap enforcement,
  FIFO ordering, default-unbounded behaviour, and the
  running-task-immunity property of the subagent tracker.
- Integration test
  `retention_limits_are_plumbed_into_subagent_tracker` exercises
  the public SessionOptions → AgentSession → tracker path end-to-end
  via the same accessor 书安OS would use.

Test count: 1692 unit (+9 new) + 9 integration (+1 new), clippy
clean on `--lib -- -D warnings`.
The P3 cut-2 resume_run API only had error-path tests
(missing store / missing checkpoint). Lock the happy-path
contract too — write a LoopCheckpoint, call resume_run, verify
the loop picks up from the checkpointed messages.

test_resume_run_picks_up_from_persisted_checkpoint:
- Seed a LoopCheckpoint in MemorySessionStore with messages
  representing one prior tool round.
- Build a session bound to the same store with a StaticStreaming
  mock LLM that produces a final-answer text.
- Call session.resume_run(seeded_run_id).
- Assert:
  * AgentResult.text matches the mock's final response — proving
    the loop fed the checkpoint's messages to the LLM via
    execute_from_messages and ran to completion.
  * runs() contains exactly one new run whose id is NOT the
    seeded run id — framework allocates fresh, never pretends to
    continue the old run.
  * The seeded checkpoint stays in the store under the old run
    id — resume does not delete; retention is the host's call.

Crash simulation is reduced to a manual checkpoint seed because
the in-process agent loop has no "die mid-round" affordance
suitable for unit testing. The contract surface that 书安OS will
sit on (write on node A, hand run id to node B, resume) is fully
exercised through the public API.

1693 lib tests + 9 integration green; clippy clean.
Hosts running thousands of long-lived sessions accumulate MCP
subprocesses + transport connections even when individual servers
fall quiet. Add a periodic-sweep API the host calls (e.g. every 60s)
to drop servers whose last activity is older than a threshold.

McpManager:
- New `last_used_at_ms: HashMap<String, u64>` parallel to `clients`.
  Stamped on `connect` (initial use) and on every successful
  `call_tool` (active use). Cleared on `disconnect`.
- `pub async fn last_used_at_ms(name) -> Option<u64>` for host-side
  observability (e.g. dashboards / Prometheus scrapes).
- `pub async fn touch(name)` so hosts can mark a server as warm
  out-of-band when activity comes via a side channel.
- `pub async fn disconnect_idle(threshold_ms) -> Vec<String>` —
  iterates connected clients, drops every one whose timestamp is
  older than `now - threshold_ms` (or has no timestamp at all —
  treated as infinitely idle). Per-server disconnect failures are
  warn-logged but never panic; the result vec lists every name
  attempted.

Agent facade:
- New `Agent::disconnect_idle_mcp(threshold_ms) -> Vec<String>`
  forwards to the global manager when present, else returns empty.
  This is the entry point a host's idle-reaper will call.

Tests:
- Three unit tests covering touch monotonicity, idle-sweep no-op
  with no live clients, and disconnect cleanup of the timestamp
  entry.
- One Agent-level test covering the no-global-mcp short-circuit
  (contract surface for hosts).

Uses a local `now_epoch_ms` helper rather than HostEnv (the manager
predates HostEnv wiring; threading the host's Clock through to MCP
is a separate change if needed for replay).

1697 unit + 9 integration tests green; clippy clean.
…Guard

Propagate the framework-side BudgetGuard contract through pyo3 so
Python hosts (the most likely 书安OS surface) can plug in cluster-
wide cost / quota policy without writing Rust.

Bridge:
- `PyBudgetGuard { inner: Py<PyAny> }` implements
  a3s_code_core::budget::BudgetGuard via the async-trait. Each
  method acquires the GIL (Python::with_gil), looks up the named
  method on the held PyObject by `getattr`, and calls it. Missing
  methods behave as Allow / no-op so user classes only need to
  define what they actually want to govern.
- `parse_py_budget_decision` parses the return value:
    None  | {"decision": "allow"}                                 → Allow
    {"decision": "soft", "resource", "consumed", "limit", "message"?} → SoftLimit
    {"decision": "deny",  "resource", "reason"}                       → Deny
  Unknown / malformed shapes default to Allow (fail-safe).
- Python callback raising never propagates: warning is printed to
  stderr, decision falls back to Allow / record_after_llm becomes
  no-op. A misbehaving guard cannot halt a live session.

SDK surface:
- PySessionOptions gains `budget_guard: Option<PyObject>` with
  matching `@getter` / `@setter`. build_rust_session_options wraps
  the held callable as `Arc<dyn BudgetGuard>` and calls
  with_budget_guard.

Test (sdk/python/tests/test_budget_guard.py):
- DenyingGuard returns {"decision":"deny", ...} from
  check_before_llm; session.send raises RuntimeError mentioning
  "Budget exhausted" / "llm_tokens"; check_before_llm fires
  exactly once; record_after_llm does not fire; session_id
  propagates correctly.
- AllowingGuard with only no-op methods constructs a session
  cleanly — proves missing methods are tolerated.
- SessionOptions.budget_guard round-trips through getter/setter
  including None reset.

GIL acquisition blocks the tokio worker thread briefly per call.
Acceptable here because BudgetGuard fires at most once per LLM
turn / tool call, not on hot tool execution paths.

Python SDK: 19 cargo tests + smoke test green, clippy clean.
Propagate the framework-side BudgetGuard contract through napi-rs so
JS hosts can plug in cluster-wide cost / quota policy. The bridge
follows the same pattern as Python's PyBudgetGuard but uses
ThreadsafeFunction for cross-thread JS calls.

Framework support (one small core addition):
- AgentSession::set_budget_guard(Option<Arc<dyn BudgetGuard>>) +
  budget_guard() — runtime-mutable override slot. Read by
  build_agent_loop at every send/stream, takes precedence over
  config.budget_guard. Needed because JsFunction values can't live
  inside the value-typed SessionOptions.

Node SDK surface:
- `session.setBudgetGuard({checkBeforeLlm, recordAfterLlm,
   checkBeforeTool})` — all three handlers optional; missing methods
   fall back to Allow / no-op. Pass `null` for the whole arg to
   clear the guard.
- Returns `BudgetDecision`:
    `null` | `{decision:'allow'}` → Allow
    `{decision:'soft', resource, consumed, limit, message?}` → emits
      `BudgetThresholdHit('soft')`, proceeds
    `{decision:'deny',  resource, reason}` → aborts with
      "Budget exhausted" error
- `BudgetGuardHandlers` napi(object) shape — typed in
  generated.d.ts.

Bridge internals (NodeBudgetGuard):
- ThreadsafeFunction<serde_json::Value, ErrorStrategy::Fatal> per
  handler. `Fatal` (not CalleeHandled) so the JS callback receives
  *only* the positional args — no leading Node-style `err`.
- Args fanned out as `serde_json::Value::Array(...)`; the transformer
  closure unpacks the array into multiple positional JsUnknowns.
- Rust async trait method blocks via `tokio::task::block_in_place`
  on a sync `mpsc::sync_channel` waiting for the JS callback's
  return value. 5-second timeout falls back to Allow.
- Decisions parsed with the same shape as Python:
  - `parse_js_budget_decision(JsUnknown) -> BudgetDecision`
  - Unknown / malformed shapes fall back to Allow (fail-safe).

Tests:
- Unit: `test_runtime_budget_guard_overrides_session_options_value`
  verifies the runtime override slot wins over config.budget_guard
  and that clearing it (`set_budget_guard(None)`) restores the
  unguarded path.
- Smoke: `sdk/node/test_budget_guard.mjs` installs a JS guard whose
  checkBeforeLlm denies, asserts `session.send` throws
  `Budget exhausted`, checkBeforeLlm fires exactly once, and
  recordAfterLlm doesn't fire. Then verifies `setBudgetGuard(null)`
  is accepted without error.

generated.d.ts regenerated by `napi build:debug` — `setBudgetGuard`
and `BudgetGuardHandlers` show up with the documented TS shape.

Core: 1698 unit tests pass; Node: 27 cargo tests + smoke green;
clippy clean across core and Node SDK.
Extend the Python and TypeScript "Main APIs at a Glance" snippets
with two new sections so the recent additions are discoverable:

- (15) Long-running session ops — SessionRetentionLimits +
  agent.disconnectIdleMcp / disconnect_idle_mcp.
- (16) Budget / cost governance — Python opts.budget_guard
  class-shape and Node session.setBudgetGuard({...}) handler-shape,
  both showing Deny → "Budget exhausted" and the implicit Allow
  fallthrough.

Detailed semantics live in apps/docs/content/docs/{en,cn}/code/
api-contract.mdx (covered in the docs-site commit on the parent
repo).
Propagate the framework-side retention caps through both SDKs so
host code can stop long-running cluster sessions from accumulating
in-memory state. Each cap is optional; missing fields keep the
framework's unbounded default for that store.

Python (pyo3):
- PySessionOptions gains `retention_limits: Option<PyObject>` with
  matching @Getter / @Setter.
- `parse_py_retention_limits` accepts a dict shape with optional
  integer keys: max_runs_retained / max_events_per_run /
  max_trace_events / max_terminal_subagent_tasks. Unknown / non-int
  values are ignored (no error), missing keys keep the default.
- Forwarded via SessionOptions::with_retention_limits.

Node (napi):
- New `#[napi(object)] RetentionLimitsObject` with four optional
  `u32` fields (TypeScript-friendly), generated into the .d.ts as
  `RetentionLimitsObject` and surfaced on SessionOptions as
  `retentionLimits?: RetentionLimitsObject`.
- js_session_options_to_rust converts each present field to usize
  and forwards via with_retention_limits.

Verification:
- Python SDK 19 cargo tests pass; clippy clean.
- Node SDK 27 cargo tests pass; clippy clean.
- generated.d.ts exposes the new types with the documented
  camelCase shape (retentionLimits + maxRunsRetained etc.).
Add a single integration test that exercises the **full** cluster-
grade API surface in one realistic two-node lifecycle. This is the
reference flow 书安OS-side scheduling code targets — every new piece
shipped in the cluster pillars work participates.

`cluster_ops_consolidated_session_lifecycle`:

  Node A (one Agent):
    - Creates a session with identity labels (tenant_id / principal /
      agent_template_id / correlation_id), retention caps, and a
      shared MemorySessionStore.
    - Injects a Completed subagent task into the tracker.
    - Persists state via session.save().
    - Seeds a LoopCheckpoint for an "in-flight" run.
    - Drops the agent (simulating node failure / drain).

  Node B (a *different* Agent on the same store):
    - agent_b.resume_session("cluster-ops-target", opts) — Node B
      hydrates the session from the shared store.
    - Asserts identity labels survive verbatim.
    - Asserts the subagent task history (Completed status) survives.
    - Asserts the LoopCheckpoint persists across the migration with
      run id, turn count, message vec, and token usage intact —
      this is the contract resume_run reads from.

The test deliberately doesn't *call* resume_run() because the test
config has no real LLM credentials — that flow is covered by
test_resume_run_picks_up_from_persisted_checkpoint in unit tests
(uses build_session + mock streaming client). Here we lock the
**data contract** for cross-node migration: identity, subagent
view, and checkpoint state must round-trip through the store with
no framework-level interpretation.

Test count: 1698 lib + 10 integration (+1 new), clippy clean.
Adversarial multi-dimension review of the cluster-pillars batch
surfaced 11 confirmed issues; this commit fixes every core-side one.

HIGH:
- H4 Checkpoint leak (unbounded growth). Loop checkpoints were written
  after every tool round and NEVER deleted — the dominant memory/disk
  leak for long-running hosts. Added SessionStore::delete_loop_checkpoint
  (memory + file impls); the run lifecycle now clears the checkpoint on
  any in-process terminal (complete/cancel/fail) in BlockingRunLifecycle
  ::complete and StreamRunLifecycle::wrap. Only a true crash (loop never
  returns) leaves a checkpoint for crash-recovery resume.
  Also: FileSessionStore::save_loop_checkpoint is now crash-atomic
  (temp + fsync + rename) — a checkpoint exists to survive a crash, so a
  half-written one (plain fs::write) that fails to parse defeats the
  purpose. (completeness-critic finding folded in.)
- H3 event_count corruption. replace_records overwrote the persisted
  CUMULATIVE event_count with the (possibly trimmed) buffer length,
  corrupting audit counts for any restored run whose buffer hit
  max_events_per_run. Deleted the offending line; trust the snapshot.
- H2 resume_run dropped checkpoint metrics. execute_from_messages built
  a fresh ExecutionLoopState (zeroed total_usage / tool_calls_count), so
  resumed runs under-reported cumulative cost. Added
  ExecutionSeed + ExecutionLoopState::new_seeded and threaded it through
  execute_from_messages_seeded -> BlockingRunContext -> resume_run, so a
  resumed run continues accounting from the checkpoint.

MEDIUM:
- M1 subagent eviction TOCTOU. mark_terminal_and_evict took
  terminal_order/tasks/cancellers locks separately, letting a concurrent
  record_event re-insert an evicted victim. Now holds all three together
  in one canonical order (callers drop their guards first → no deadlock).
- M2 run-store eviction TOCTOU + lock-ordering. create_run_with_id locked
  runs/events separately (and in the opposite nesting order from the
  rest); now holds order+events+runs together for insert+evict in one
  canonical order.
- M3 MCP timestamp leak. touch() records a timestamp unconditionally
  (even for a never-connected name) and disconnect_idle only scanned
  clients.keys(); orphan timestamps leaked. disconnect_idle now retains
  last_used_at_ms to live clients.

LOW:
- L1 Session registry dangling Weaks. close_agent now retain()s the
  registry before snapshotting handles.

Tests: store delete + crash-atomic (no temp leftovers); lifecycle
clears checkpoint on completion (deterministic run id); event_count
preserved across replace_records after trim; resume_run carries
non-zero checkpoint metrics (1002 not 2); two multi-thread concurrency
stress tests guard the eviction lock-ordering changes against deadlock;
MCP orphan-timestamp purge. 1705 lib + 10 integration green; clippy
clean.
…/L2)

Closes the SDK-side findings from the cluster-pillars review.

H1 — Node BudgetGuard fail-OPEN hole (was: hung/slow guard silently
ALLOWED, disabling budget enforcement):
- call_decision now fails CLOSED: a guard that does not respond within
  timeoutMs -> Deny{budget_guard_timeout}; an unreadable return value
  -> Deny{budget_guard_error}. Previously both defaulted to Allow.
- timeoutMs is configurable per guard (BudgetGuardHandlers.timeoutMs,
  default 5000).
- Callbacks now receive a single context object — checkBeforeLlm({
  sessionId, estimatedTokens }), recordAfterLlm({ sessionId, usage }),
  checkBeforeTool({ sessionId, toolName }).
- Documented napi-rs constraint: a JS throw from a guard callback
  aborts the host process at return-value conversion (true under BOTH
  Fatal and CalleeHandled in this napi version — empirically verified),
  so callbacks MUST NOT throw; wrap in try/catch and return a decision.
  The fail-closed timeout still covers hangs. Python is unaffected
  (PyBudgetGuard catches exceptions).

L2 — Python BudgetGuard re-entrancy doc: warn that calling session/agent
APIs from inside a guard callback risks GIL re-entrancy deadlock.

M4 — disconnect_idle_mcp now exposed in BOTH SDKs (was core-only, yet
already referenced by the docs):
- Node: agent.disconnectIdleMcp(idleThresholdMs) -> Promise<string[]>
- Python: agent.disconnect_idle_mcp(idle_threshold_ms) -> list[str]
The docs that referenced these now describe real methods.

Verification: Node 27 + Python 19 cargo tests, clippy clean on both;
Node smokes (budget deny path + fail-closed config + disconnectIdleMcp)
and all three Python smokes (budget guard, session close +
disconnect_idle_mcp, subagent query) pass against a fresh maturin build.
generated.d.ts regenerated (setBudgetGuard ctx shape + timeoutMs +
disconnectIdleMcp).
Reflow long lines in the BudgetGuard / disconnect_idle_mcp additions
(281dc58) that the core-scoped pre-commit fmt hook did not check.
Formatting only — no logic change (verified: only line wrapping).
Bump all packages 3.2.1 -> 3.3.0 and add the CHANGELOG entry. No push,
no tag — release prep only.

Version sync (scripts/check_release_versions.sh green at 3.3.0):
- core/Cargo.toml, sdk/node/Cargo.toml (+ core dep pin),
  sdk/python/Cargo.toml (+ core dep pin)
- sdk/node/package.json (+ @a3s-lab/code-* optionalDependencies)
- sdk/python/pyproject.toml
- sdk/python-bootstrap/pyproject.toml + _bootstrap.py __version__
- Cargo.lock, sdk/node/package-lock.json, sdk/node/examples/package-lock.json

CHANGELOG 3.3.0 documents the cluster-grade runtime batch (session/agent
lifecycle, identity labels, BudgetGuard, HostEnv, loop checkpoints +
resume_run, retention caps, MCP idle disconnect, cluster AgentEvents,
subagent persistence) plus the adversarial-review hardening fixes, and
notes the Node BudgetGuard no-throw limitation.

Minor bump per semver: all additions are backward compatible (new
methods, new optional fields, SessionStore trait methods with default
no-op impls).
Add #[ignore] integration tests (core/tests/test_real_llm_cluster_features.rs)
that exercise the 3.3.0 LLM-loop features against a real model from
.a3s/config.acl — validating paths mock clients cannot:

- real_budget_guard_allow_records_actual_usage: record_after_llm receives
  the provider's ACTUAL non-zero token usage (mocks return fixed/zero).
- real_budget_guard_deny_blocks_llm_call: Deny aborts before the provider
  is contacted; no usage, no history.
- real_resume_run_carries_checkpoint_metrics_forward: resume_run against
  the live model continues cumulative metrics from the checkpoint.
- real_run_with_store_leaves_no_dangling_checkpoint: completed real run
  clears its loop checkpoint (leak-fix lifecycle, end-to-end).
- real_identity_labels_survive_live_run: tenant/principal/template/
  correlation intact through a live run.

Run with:
  A3S_CONFIG_FILE=/abs/.a3s/config.acl \
    cargo test -p a3s-code-core --test test_real_llm_cluster_features \
      -- --ignored --nocapture

Verified locally: 5 passed against openai/MiniMax-M2.7-highspeed (155s).
@ZhiXiao-Lin ZhiXiao-Lin merged commit d2ed389 into main May 29, 2026
1 check passed
@ZhiXiao-Lin ZhiXiao-Lin deleted the release/v3.3.0 branch May 29, 2026 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants