From 7cdc598aad1b38bcdf804ad631fd8c71fee70cf6 Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Thu, 28 May 2026 11:15:12 +0800 Subject: [PATCH 01/18] docs(code): document Agent / Session close surface MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Update both en and cn API contract pages with the full graceful-close contract: session.close() / isClosed semantics, agent.listSessions(), agent.closeSession(id), agent.close() (which also disconnects global MCP), and the SessionClosed error returned after agent.close(). Bumps the crates/code submodule pointer to include the new close surface across core (steps 1–3) and the Node/Python SDKs (step 4). --- .../content/docs/cn/code/api-contract.mdx | 12 ++++++++++- .../content/docs/en/code/api-contract.mdx | 21 ++++++++++++++++++- crates/code | 2 +- 3 files changed, 32 insertions(+), 3 deletions(-) diff --git a/apps/docs/content/docs/cn/code/api-contract.mdx b/apps/docs/content/docs/cn/code/api-contract.mdx index 269179e..9039228 100644 --- a/apps/docs/content/docs/cn/code/api-contract.mdx +++ b/apps/docs/content/docs/cn/code/api-contract.mdx @@ -343,7 +343,17 @@ const resumed = agent.resumeSession('docs-contract', { console.log(resumed.history()); ``` -Node 进程需要及时释放 session 级后台资源时,调用 `session.close()`。 +Node 进程需要及时释放 session 级后台资源时,调用 `session.close()`。`close()` 是完整的优雅停止入口:把 `session.isClosed` 翻成 `true`(之后 `send` / `stream` 会以 `Session closed` 错误立即返回),fire session 级 `CancellationToken` 让所有 in-flight run、委派子代理任务、HITL 待确认全部中止,并对当前活跃 run emit AHP `recordRunCancelled` 钩子。重复调用 `close()` 是 no-op。 + +控制面只持有 session ID 时,可以从 Agent 侧触发同样的清理: + +```ts +await agent.listSessions(); // ['session-a', 'session-b'] +await agent.closeSession('session-a'); // 若原本是 open,返回 true +await agent.close(); // 关闭所有活 session + 断开全局 MCP +``` + +`agent.close()` 之后,再调 `agent.session(...)` / `agent.resumeSession(...)` 会立即抛 `Session closed`。幂等。建议在进程退出 handler 中调用,保证没有 session 级 worker 比 agent 活得更久。 ## Delegation diff --git a/apps/docs/content/docs/en/code/api-contract.mdx b/apps/docs/content/docs/en/code/api-contract.mdx index 04fa163..38dbfef 100644 --- a/apps/docs/content/docs/en/code/api-contract.mdx +++ b/apps/docs/content/docs/en/code/api-contract.mdx @@ -351,7 +351,26 @@ console.log(resumed.history()); ``` Use `session.close()` when a Node process should release session-scoped -background resources promptly. +background resources promptly. `close()` is the full graceful-stop entry +point: it flips `session.isClosed` to `true` (further `send` / `stream` +calls reject with a `Session closed` error), fires the session-level +`CancellationToken` so every in-flight run, delegated subagent task, and +HITL confirmation aborts, and emits the AHP `recordRunCancelled` hook for +the currently active run. Subsequent `close()` calls are no-ops. + +For control-plane callers that only know the session ID, the same cleanup +is reachable from the agent: + +```ts +await agent.listSessions(); // ['session-a', 'session-b'] +await agent.closeSession('session-a'); // true if it was open +await agent.close(); // close every live session + disconnect global MCP +``` + +After `agent.close()`, subsequent `agent.session(...)` and +`agent.resumeSession(...)` calls reject with a `Session closed` error. +Idempotent. Use this in process-shutdown handlers to guarantee no +session-scoped workers outlive the agent. ## Delegation diff --git a/crates/code b/crates/code index 6499123..3326a9c 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit 6499123f2b693d6602397dfcd71336bcc5f8f41c +Subproject commit 3326a9c6388858d26373607d5c984ed3c8b81f21 From 9f874a63d7b1fa6e95858ea13fb8b265744d5913 Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Thu, 28 May 2026 13:40:53 +0800 Subject: [PATCH 02/18] test(code): bump submodule for session-close integration tests Picks up the cross-module integration test (core/tests/test_session_close_lifecycle.rs) and SDK smoke tests (sdk/python/tests/test_session_close.py, sdk/node/test_session_close.mjs) plus the AgentSession::subagent_tracker() accessor that unblocks them. --- crates/code | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/code b/crates/code index 3326a9c..f56b216 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit 3326a9c6388858d26373607d5c984ed3c8b81f21 +Subproject commit f56b21684c19a8bc654e19d3ecd93eacc3520bf4 From bbb4b9cc0d61cb65d6d39a1dd4c9900b76c5cb91 Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Thu, 28 May 2026 14:55:44 +0800 Subject: [PATCH 03/18] chore(code): bump submodule for framework cluster-pillars P1+P5+P6+P4+P2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Picks up the five framework-only mechanisms 书安OS needs as prerequisites for ultra-scale agent cluster operation. Boundaries respected — no scheduler / placement / transport in core; those remain 书安OS responsibilities. - P1 (e0b7e9b): SessionStore persists subagent task tracker across save/resume — unblocks session migration. - P5 (7c4c58c): tenant / principal / agent_template / correlation identity labels on SessionOptions+SessionData — unblocks multi- tenancy aggregation without string-hacking session_id. - P6 (0043844): AgentEvent variants BudgetThresholdHit / PassivationRequested / PeerInvocation — give in-session code a uniform way to observe platform decisions. - P4 (679efb8): BudgetGuard trait wired into the LLM call path — host plugs in cluster-aware quota/cost enforcement; framework emits structured events and bails on Deny. - P2 (9c290ad): HostEnv (IdGenerator + Clock) injection — unlocks deterministic replay of a run on another node. P3 (loop resumable / per-step checkpoint) remains for follow-up. --- crates/code | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/code b/crates/code index f56b216..9c290ad 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit f56b21684c19a8bc654e19d3ecd93eacc3520bf4 +Subproject commit 9c290ada89651a0f9c46b514427ac56ed35e094c From 054a1751d693e72d4fc0b1e8ab9f19dba5eb6c43 Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Thu, 28 May 2026 15:24:02 +0800 Subject: [PATCH 04/18] =?UTF-8?q?chore(code):=20bump=20submodule=20?= =?UTF-8?q?=E2=80=94=20P3=20cut=201=20(loop=20checkpoint=20data=20+=20pers?= =?UTF-8?q?istence)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Picks up: - LoopCheckpoint data contract + SessionStoreCheckpointSink adapter. - SessionStore::save_loop_checkpoint / load_loop_checkpoint (default no-op; MemorySessionStore + FileSessionStore implement). - AgentLoop auto-wires a checkpoint sink from session.session_store in build_agent_loop, and persists after every successful tool round in execute_loop_inner. - Integration tests: store roundtrip + the no-tool-call negative property. Cut 2 (resume_run API) remains in the framework's P3 backlog. --- crates/code | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/code b/crates/code index 9c290ad..db87a74 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit 9c290ada89651a0f9c46b514427ac56ed35e094c +Subproject commit db87a747e7eae576a149c698635bd4e1a3a54abf From 6327f4724ff13d5176fd01bd09fff74224b2d5c7 Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Thu, 28 May 2026 15:42:44 +0800 Subject: [PATCH 05/18] =?UTF-8?q?chore(code):=20bump=20submodule=20?= =?UTF-8?q?=E2=80=94=20P3=20complete=20(resume=5Frun=20API)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Picks up `AgentSession::resume_run(checkpoint_run_id)` which loads a LoopCheckpoint via SessionStore and replays the agent loop from that boundary. Together with P3 cut 1 (in the previous submodule bump), the framework now provides full crash-tolerant run semantics — 书安OS plugs in placement / drain choreography on top. Two distinguishable error paths (`session_store` missing vs `loop checkpoint` missing) lock the API for host-side scheduling. --- crates/code | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/code b/crates/code index db87a74..e125562 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit db87a747e7eae576a149c698635bd4e1a3a54abf +Subproject commit e12556286ae03c37c78befdf3dc68c6d3c604a29 From d34e8023f84c3002d179f96b4fc3da04e2cf727e Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Thu, 28 May 2026 17:14:15 +0800 Subject: [PATCH 06/18] =?UTF-8?q?chore(code):=20bump=20submodule=20?= =?UTF-8?q?=E2=80=94=20SDK=20identity=20labels=20+=20resume=5Frun?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Surfaces the P5 (identity labels) and P3 (resume_run) framework additions through both Node and Python SDKs. JS/TS callers get `session.resumeRun(...)` + `session.tenantId` etc; Python callers get `session.resume_run(...)` + matching property getters. --- crates/code | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/code b/crates/code index e125562..ef01792 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit e12556286ae03c37c78befdf3dc68c6d3c604a29 +Subproject commit ef01792f02c0c8bebc02adfcf22784f428e24a71 From fe6c388b7be958f299d353b4f5bfadd308a62c55 Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Thu, 28 May 2026 17:16:54 +0800 Subject: [PATCH 07/18] docs(code): cluster-grade extension points (en + cn) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit New section in both api-contract pages walking through the five framework-level extension points the host platform (书安OS) sits on: - Identity labels (tenant_id / principal / agent_template_id / correlation_id) — opaque transport, host aggregates. - BudgetGuard — Allow / SoftLimit / Deny decision shape; structured events on threshold hits; LLM call-site enforcement. - Cluster AgentEvent variants — BudgetThresholdHit, PassivationRequested, PeerInvocation; host emits via HookExecutor. - Deterministic IDs / time via HostEnv (SequentialIdGenerator + FixedClock for replay). - LoopCheckpoint + session.resumeRun/resume_run with both error paths documented so cluster scheduling code can branch. Boundary policy ("between tool rounds, never mid-tool") is called out explicitly so host-side reasoning about lost-work semantics matches framework behaviour. Bumps crates/code submodule for the matching README update. --- .../content/docs/cn/code/api-contract.mdx | 76 +++++++++++++ .../content/docs/en/code/api-contract.mdx | 100 ++++++++++++++++++ crates/code | 2 +- 3 files changed, 177 insertions(+), 1 deletion(-) diff --git a/apps/docs/content/docs/cn/code/api-contract.mdx b/apps/docs/content/docs/cn/code/api-contract.mdx index 9039228..e1ca951 100644 --- a/apps/docs/content/docs/cn/code/api-contract.mdx +++ b/apps/docs/content/docs/cn/code/api-contract.mdx @@ -462,3 +462,79 @@ new UnixSocketTransport('/tmp/a3s.sock').kind; // 'unix_socket' ``` 该检查不断言 live AHP server exchange。 + +## 集群级扩展点 + +这些契约让集群控制面(例如 书安OS)在**不 fork 框架**的前提下接入多租户、成本管控和容错运行。框架定义"决策点"和"结构化事件",**策略实现由 host 提供**。 + +### 身份标签 + +`SessionOptions` 上四个可选 slot,会透传到 hooks / traces / `SessionData`,框架本身不解释: + +```ts +const session = agent.session(workspace, { + tenantId: 'acme-prod', + principal: 'svc-deploy-bot', + agentTemplateId: 'ci-runner-v7', + correlationId: 'trace-1234abcd', + sessionStore: new FileSessionStore('./sessions'), +}); +session.tenantId; // -> 'acme-prod' +session.correlationId; // -> 'trace-1234abcd' +``` + +resume 时 `apply_persisted_runtime_options` 会从持久化快照里还原标签;但**调用方在 resume_session 时传的 opts 优先**,可以借此 relabel。 + +### 预算 / 成本守卫 + +`BudgetGuard` 在每次 LLM 调用前(以及调用后做用量记录)被询问。`Deny` 返回 `CodeError::BudgetExhausted { resource, reason }`;`SoftLimit` 发射 `AgentEvent::BudgetThresholdHit { kind: "soft", .. }` 后继续执行。 + +目前仅 Rust 层接入(Node/Python wrapper 后续补): + +```rust +let guard: Arc = /* host-supplied impl */; +let opts = SessionOptions::new().with_budget_guard(guard); +``` + +### 集群事件词汇 + +`AgentEvent`(`#[non_exhaustive]`)新增三类平台级事件,host 通过 `HookExecutor` 注入: + +- `BudgetThresholdHit { resource, kind, consumed, limit, message? }` +- `PassivationRequested { reason, deadline_ms? }` +- `PeerInvocation { from_session_id, from_tenant_id?, correlation_id? }` + +session 内部 hook 可统一订阅,不必关心 host 用什么传输发过来。 + +### 确定性 ID / 时钟 + +`HostEnv { id_generator, clock }` 替换默认的 `uuid::Uuid::new_v4()` + 墙上时钟。Replay 工具传入 `SequentialIdGenerator` + `FixedClock` 即可在另一台机器上 bit-identical 重放一个 run。 + +### Loop checkpoint + run 恢复 + +配置了 `SessionStore` 后,agent loop **每次 tool round 结束**会持久化一个 `LoopCheckpoint`(按 `run_id` 索引)。任何拥有同一个 store 的节点都能从最近的边界 rehydrate: + +```ts +// Node — host 探测到 A 节点死掉;在 B 节点上: +const session = agentB.session(workspace, { + sessionStore: new FileSessionStore('./sessions'), + sessionId: 'session-from-node-a', +}); +const result = await session.resumeRun('run-id-from-node-a'); +``` + +```python +# Python 等价 +opts = SessionOptions() +opts.session_store = FileSessionStore('./sessions') +opts.session_id = 'session-from-node-a' +session = agent_b.session(workspace, opts) +result = session.resume_run('run-id-from-node-a') +``` + +resume 出来的会**分配一个全新的 run id** — 框架不假装旧 run 还在继续,新旧 run 的关系是 host 的元数据。两个可区分的错误路径方便 host 端调度分支: + +- `"resume_run requires a session_store"` — host 应该回退到新建 session。 +- `"no loop checkpoint found for run 'X'"` — host 可以稍等重试(checkpoint 写入竞态),或当 run 已丢失。 + +**边界策略**:checkpoint 只在 tool round **之间**取,不在工具执行中途取。进程在工具执行中途死掉时,这一轮的工作会丢失,LLM 从前一个边界重新思考。这是用"重试成本"换"正确性" — 把非幂等工具(write、bash)在边界两侧重跑比让 LLM 重想要糟得多。 diff --git a/apps/docs/content/docs/en/code/api-contract.mdx b/apps/docs/content/docs/en/code/api-contract.mdx index 38dbfef..12cb571 100644 --- a/apps/docs/content/docs/en/code/api-contract.mdx +++ b/apps/docs/content/docs/en/code/api-contract.mdx @@ -480,3 +480,103 @@ new UnixSocketTransport('/tmp/a3s.sock').kind; // 'unix_socket' ``` The check does not assert a live AHP server exchange. + +## Cluster-grade extension points + +These contracts let a cluster control plane (e.g. 书安OS) wire +multi-tenancy, cost governance, and crash-tolerant runs **without +forking the framework**. The framework defines decision points and +emits structured events; the host supplies the policy implementations. + +### Identity labels + +Four optional `SessionOptions` slots are propagated through hooks, +traces, and `SessionData` but never interpreted by the framework: + +```ts +const session = agent.session(workspace, { + tenantId: 'acme-prod', + principal: 'svc-deploy-bot', + agentTemplateId: 'ci-runner-v7', + correlationId: 'trace-1234abcd', + sessionStore: new FileSessionStore('./sessions'), +}); +session.tenantId; // -> 'acme-prod' +session.correlationId; // -> 'trace-1234abcd' +``` + +`apply_persisted_runtime_options` restores them on resume; caller- +supplied options on resume take precedence so you can relabel. + +### Budget / cost guard + +`BudgetGuard` is consulted before every LLM call (and after, for +usage accounting). `Deny` returns +`CodeError::BudgetExhausted { resource, reason }`; `SoftLimit` emits +an `AgentEvent::BudgetThresholdHit { kind: "soft", .. }` and proceeds. + +Wire from Rust today (Node/Python wrappers will follow): + +```rust +let guard: Arc = /* host-supplied impl */; +let opts = SessionOptions::new().with_budget_guard(guard); +``` + +### Cluster event vocabulary + +`AgentEvent` (non-exhaustive) carries platform-level events the host +emits via `HookExecutor`: + +- `BudgetThresholdHit { resource, kind, consumed, limit, message? }` +- `PassivationRequested { reason, deadline_ms? }` +- `PeerInvocation { from_session_id, from_tenant_id?, correlation_id? }` + +In-session hooks subscribe to these to react uniformly regardless of +how the host's transport delivers them. + +### Deterministic IDs / time + +`HostEnv { id_generator, clock }` replaces the default +`uuid::Uuid::new_v4()` + wall-clock pair. Replay tooling configures +`SequentialIdGenerator` + `FixedClock` to recreate a run bit-identical +on another node. + +### Loop checkpoints + run resumption + +When a `SessionStore` is configured, the agent loop persists a +`LoopCheckpoint` after each completed tool round, keyed by `run_id`. +Any node holding the same store can rehydrate a run from its last +boundary: + +```ts +// Node — host detected node A died mid-run; on node B: +const session = agentB.session(workspace, { + sessionStore: new FileSessionStore('./sessions'), + sessionId: 'session-from-node-a', +}); +const result = await session.resumeRun('run-id-from-node-a'); +``` + +```python +# Python equivalent +opts = SessionOptions() +opts.session_store = FileSessionStore('./sessions') +opts.session_id = 'session-from-node-a' +session = agent_b.session(workspace, opts) +result = session.resume_run('run-id-from-node-a') +``` + +A **new** run id is allocated for the resumed work — the framework +does not pretend the old run continues. Two distinguishable error +paths: + +- `"resume_run requires a session_store"` — host should fall back to + a fresh session. +- `"no loop checkpoint found for run 'X'"` — host can retry later + (race against checkpoint write) or treat the run as lost. + +Boundary policy: checkpoints are taken **only between tool rounds**, +never mid-tool. If a process dies mid-tool the work of that round is +lost; the LLM re-deliberates from the previous boundary. This trades +retry cost for correctness — re-executing a non-idempotent tool +across the boundary is worse than re-asking the LLM. diff --git a/crates/code b/crates/code index ef01792..975da86 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit ef01792f02c0c8bebc02adfcf22784f428e24a71 +Subproject commit 975da861334d3b94d50164fd655b82ab049d0918 From f6a93f60a7f78b481516e3b3eeb2f1f27adf7069 Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Thu, 28 May 2026 18:39:30 +0800 Subject: [PATCH 08/18] =?UTF-8?q?chore(code):=20bump=20submodule=20?= =?UTF-8?q?=E2=80=94=20retention=20caps=20for=20in-memory=20stores?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Picks up `SessionRetentionLimits` with four optional FIFO caps: max_runs_retained / max_events_per_run / max_trace_events / max_terminal_subagent_tasks. Plumbs through SessionOptions::with_retention_limits → AgentConfig → store constructors so long-running cluster sessions stop accumulating memory unboundedly. Defaults stay unbounded — existing callers see no behaviour change. Eviction policy preserves the most-recent entries (useful for debugging) and never drops Running subagent tasks. 1692 unit + 9 integration tests green; clippy clean. --- crates/code | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/code b/crates/code index 975da86..ef2501f 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit 975da861334d3b94d50164fd655b82ab049d0918 +Subproject commit ef2501f6ec45db09689505c9b573c3752cb3e4b1 From 6ad4a3f331bcdef485b382f5950c3d6384a357dd Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Thu, 28 May 2026 18:44:53 +0800 Subject: [PATCH 09/18] =?UTF-8?q?chore(code):=20bump=20submodule=20?= =?UTF-8?q?=E2=80=94=20retention=20caps=20+=20resume=5Frun=20E2E=20test?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Picks up SessionRetentionLimits with FIFO caps on RunStore / TraceSink / SubagentTracker plus the E2E happy-path test for resume_run that locks the P3 contract surface 书安OS will sit on. Defaults stay unbounded — pure additions. --- crates/code | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/code b/crates/code index ef2501f..c91e267 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit ef2501f6ec45db09689505c9b573c3752cb3e4b1 +Subproject commit c91e2675cdc2dcf1be1120c175d19a9fa907c7a1 From 6ce60af32b1649724d4765175f3a955a11c200f1 Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Thu, 28 May 2026 18:54:36 +0800 Subject: [PATCH 10/18] =?UTF-8?q?chore(code):=20bump=20submodule=20?= =?UTF-8?q?=E2=80=94=20MCP=20idle=20disconnect?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Picks up McpManager::disconnect_idle + Agent::disconnect_idle_mcp. Hosts now have a focused entry point to reap quiet MCP subprocesses without losing the registered config — paired with the in-memory retention caps shipped earlier this batch, the framework no longer leaks memory / FDs across long-running cluster workloads. --- crates/code | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/code b/crates/code index c91e267..e218137 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit c91e2675cdc2dcf1be1120c175d19a9fa907c7a1 +Subproject commit e2181378d3b5ee1c4d7ce5e87afdcb808becbb73 From d7c35ebb115684ce15438e1c86c3daa639dc45ab Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Fri, 29 May 2026 08:28:18 +0800 Subject: [PATCH 11/18] =?UTF-8?q?chore(code):=20bump=20submodule=20?= =?UTF-8?q?=E2=80=94=20BudgetGuard=20SDK=20propagation=20(Python=20+=20Nod?= =?UTF-8?q?e)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Picks up Python (PyBudgetGuard via Python::with_gil) and Node (NodeBudgetGuard via ThreadsafeFunction) bridges, plus the small framework addition (AgentSession::set_budget_guard) that lets the Node SDK install a JS-backed guard after session construction — required because JsFunction values can't live in the value-typed SessionOptions struct. Both SDKs use the same decision shape ({decision:'allow'|'soft'|'deny', ...}) and the same fail-safe defaults (unknown shapes / callback errors → Allow). --- crates/code | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/code b/crates/code index e218137..dedaa4e 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit e2181378d3b5ee1c4d7ce5e87afdcb808becbb73 +Subproject commit dedaa4ea10d35689e934601234aa001129cae40f From 753ddaaf9df25ea91f743071b45e9af38a5755ef Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Fri, 29 May 2026 08:31:40 +0800 Subject: [PATCH 12/18] docs(code): retention caps + MCP idle + BudgetGuard SDK examples (en + cn) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three new sub-sections under "Cluster-grade extension points" so the operational additions ship with discoverable usage examples: - Retention caps for long-running sessions — SessionRetentionLimits.with_max_runs / max_events_per_run / max_trace_events / max_terminal_subagent_tasks. Notes that running subagent tasks are never evicted and that SDK shapes follow later. - MCP idle disconnect — agent.disconnectIdleMcp / disconnect_idle_mcp with a periodic-sweeper example for both SDKs. Calls out McpManager.touch for side-channel keep-warm. - BudgetGuard SDK bridges — decision-shape table (allow/soft/deny) shared across Python and Node, Python class-style attach via SessionOptions.budget_guard, Node setBudgetGuard({...}) handler attach (justified by JsFunction lifetime), and the "callback errors fall back to Allow" fail-safe. Bumps crates/code submodule for the matching README update. --- .../content/docs/cn/code/api-contract.mdx | 83 ++++++++++++++ .../content/docs/en/code/api-contract.mdx | 101 ++++++++++++++++++ crates/code | 2 +- 3 files changed, 185 insertions(+), 1 deletion(-) diff --git a/apps/docs/content/docs/cn/code/api-contract.mdx b/apps/docs/content/docs/cn/code/api-contract.mdx index e1ca951..0bc1cea 100644 --- a/apps/docs/content/docs/cn/code/api-contract.mdx +++ b/apps/docs/content/docs/cn/code/api-contract.mdx @@ -538,3 +538,86 @@ resume 出来的会**分配一个全新的 run id** — 框架不假装旧 run - `"no loop checkpoint found for run 'X'"` — host 可以稍等重试(checkpoint 写入竞态),或当 run 已丢失。 **边界策略**:checkpoint 只在 tool round **之间**取,不在工具执行中途取。进程在工具执行中途死掉时,这一轮的工作会丢失,LLM 从前一个边界重新思考。这是用"重试成本"换"正确性" — 把非幂等工具(write、bash)在边界两侧重跑比让 LLM 重想要糟得多。 + +### 长跑 session 的保留上限 + +`SessionRetentionLimits` 让 host 给四种 in-memory 存储设上限:run 记录、每 run 的事件、trace 事件、**终态的** subagent 任务快照。每个字段都是可选的(`None` 保持原本无上限的默认 — 短 session 没问题,小时/天级的就漏内存)。FIFO 严格按插入序丢;**Running 状态的** subagent 任务永不被丢。 + +```rust +use a3s_code_core::retention::SessionRetentionLimits; + +let limits = SessionRetentionLimits::new() + .with_max_runs(100) + .with_max_events_per_run(5_000) + .with_max_trace_events(10_000) + .with_max_terminal_subagent_tasks(1_000); + +let opts = SessionOptions::new().with_retention_limits(limits); +``` + +上限建议跟 host 自己 Prometheus / 观测系统的内存预算保持一致。SDK 直接调用形式后续补。 + +### MCP 闲置断开 + +`Agent::disconnect_idle_mcp(threshold_ms)` 扫描所有已连接的 MCP server,把"最后活跃时间"早于 `now - threshold_ms` 的全部断开。注册的配置**保留** — 后续 tool 调用会按需重连。返回被断开的 server 名称列表。 + +```ts +// Node — 周期回收闲置 MCP 子进程 +setInterval(async () => { + const dropped = await agent.disconnectIdleMcp(5 * 60 * 1000); // 5min + if (dropped.length) { + console.log('reaped idle MCP servers:', dropped); + } +}, 60_000); +``` + +```python +# Python — 等价 +dropped = agent.disconnect_idle_mcp(5 * 60 * 1000) +``` + +每次 `connect` 和成功的 `call_tool` 都会刷新活跃时间。Host 走旁路通道路由 tool 时,可以手动 `McpManager.touch(name)` 把 server 保温。 + +### BudgetGuard 的 SDK 桥接 + +两个 SDK 共用同一个决策返回形状: + +| 返回值 | 效果 | +|-----------------------------------------------------------------------|-------------------------------------------------------------------------------| +| `None` / `null` / `{decision:'allow'}` | 静默放行 | +| `{decision:'soft', resource, consumed, limit, message?}` | 发射 `BudgetThresholdHit('soft')` 事件,继续执行 | +| `{decision:'deny', resource, reason}` | 中止调用,Python 抛 `RuntimeError("Budget exhausted...")`/Node reject 同样的错误 | + +guard 对象上缺失的方法 = 宽松默认(Allow / no-op);callback 抛错 → fallback 到 Allow,异常的 guard 不会拖垮 live session。 + +```python +# Python — 通过 SessionOptions 在 session 构造前挂上 +class MyGuard: + def check_before_llm(self, session_id, estimated_tokens): + return {"decision": "deny", "resource": "llm_tokens", "reason": "cap"} + def record_after_llm(self, session_id, usage): + track(session_id, usage["total_tokens"]) + +opts = SessionOptions() +opts.budget_guard = MyGuard() +session = agent.session(workspace, opts) +``` + +```ts +// Node — session 构造后通过 setBudgetGuard 挂上。 +// JsFunction 不能塞进值类型的 SessionOptions,所以 guard 在 Session 上注册, +// 下一次 send/stream 生效。 +session.setBudgetGuard({ + checkBeforeLlm: (sessionId, estimatedTokens) => { + if (overBudget(sessionId)) { + return { decision: 'deny', resource: 'llm_tokens', reason: 'cap' }; + } + return null; + }, + recordAfterLlm: (sessionId, usage) => { + track(sessionId, usage.total_tokens); + }, +}); +``` + +Node 用 `setBudgetGuard(null)` 清除;Python 把 `opts.budget_guard` 设回 `None` 后重建 session。 diff --git a/apps/docs/content/docs/en/code/api-contract.mdx b/apps/docs/content/docs/en/code/api-contract.mdx index 12cb571..56a13b9 100644 --- a/apps/docs/content/docs/en/code/api-contract.mdx +++ b/apps/docs/content/docs/en/code/api-contract.mdx @@ -580,3 +580,104 @@ never mid-tool. If a process dies mid-tool the work of that round is lost; the LLM re-deliberates from the previous boundary. This trades retry cost for correctness — re-executing a non-idempotent tool across the boundary is worse than re-asking the LLM. + +### Retention caps for long-running sessions + +`SessionRetentionLimits` lets the host cap the four in-memory stores +that grow with session age: the run records, per-run event buffers, +trace events, and **terminal** subagent task snapshots. Each cap is +optional (`None` keeps the unbounded default — fine for short +sessions, a memory leak for hour- or day-long ones). Eviction is +strict FIFO; running subagent tasks are never dropped. + +```rust +use a3s_code_core::retention::SessionRetentionLimits; + +let limits = SessionRetentionLimits::new() + .with_max_runs(100) + .with_max_events_per_run(5_000) + .with_max_trace_events(10_000) + .with_max_terminal_subagent_tasks(1_000); + +let opts = SessionOptions::new().with_retention_limits(limits); +``` + +The host should pick caps from the same observability budget that +caps the rest of its in-memory state (Prometheus carries history +anyway). SDK shapes for retention land in a follow-up. + +### MCP idle disconnect + +`Agent::disconnect_idle_mcp(threshold_ms)` walks the connected MCP +servers and drops any whose last activity is older than +`now - threshold_ms`. The server's *registered config* stays — a +later tool call will reconnect on demand. Returns the names of +disconnected servers. + +```ts +// Node — periodically reap quiet MCP subprocesses. +setInterval(async () => { + const dropped = await agent.disconnectIdleMcp(5 * 60 * 1000); // 5 min + if (dropped.length) { + console.log('reaped idle MCP servers:', dropped); + } +}, 60_000); +``` + +```python +# Python — same shape. +dropped = agent.disconnect_idle_mcp(5 * 60 * 1000) +``` + +Activity is stamped on `connect` and on every successful `call_tool`. +Hosts that route tool traffic through a side channel can call +`McpManager.touch(name)` to manually keep a server warm. + +### BudgetGuard SDK bridges + +Both SDKs accept the same decision shape: + +| Return | Effect | +|-----------------------------------------------------------------------|-----------------------------------------------------------------------| +| `None` / `null` / `{decision:'allow'}` | proceed silently | +| `{decision:'soft', resource, consumed, limit, message?}` | emit `BudgetThresholdHit('soft')` event, proceed | +| `{decision:'deny', resource, reason}` | abort the call, throw `RuntimeError("Budget exhausted...")` (Python) | +| | / reject with `"Budget exhausted..."` (Node) | + +Missing methods on the guard object are treated as a permissive +default (Allow / no-op). Callback errors fall back to Allow — a +misbehaving guard cannot halt a live session. + +```python +# Python — attach via SessionOptions before agent.session(...) +class MyGuard: + def check_before_llm(self, session_id, estimated_tokens): + return {"decision": "deny", "resource": "llm_tokens", "reason": "cap"} + def record_after_llm(self, session_id, usage): + track(session_id, usage["total_tokens"]) + +opts = SessionOptions() +opts.budget_guard = MyGuard() +session = agent.session(workspace, opts) +``` + +```ts +// Node — attach via session.setBudgetGuard after construction. +// JsFunction values can't live inside the value-typed SessionOptions, +// so the guard is installed on the Session itself; takes effect on +// the next send/stream. +session.setBudgetGuard({ + checkBeforeLlm: (sessionId, estimatedTokens) => { + if (overBudget(sessionId)) { + return { decision: 'deny', resource: 'llm_tokens', reason: 'cap' }; + } + return null; + }, + recordAfterLlm: (sessionId, usage) => { + track(sessionId, usage.total_tokens); + }, +}); +``` + +Pass `null` to `setBudgetGuard` (Node) or set `opts.budget_guard = +None` and re-create the session (Python) to clear. diff --git a/crates/code b/crates/code index dedaa4e..6431ac5 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit dedaa4ea10d35689e934601234aa001129cae40f +Subproject commit 6431ac515970bc9ce0478a5edd0e88e1ed566ad8 From 318d2c221b9c658651bb2935a4b2ec9eaf7a7012 Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Fri, 29 May 2026 08:39:18 +0800 Subject: [PATCH 13/18] =?UTF-8?q?chore(code):=20bump=20submodule=20?= =?UTF-8?q?=E2=80=94=20SessionRetentionLimits=20SDK=20propagation?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Picks up Python `opts.retention_limits = {dict}` and Node `opts.retentionLimits = {object}` shapes. Both forward into the framework's SessionRetentionLimits and into the per-session store construction. Missing fields keep the unbounded default. --- crates/code | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/code b/crates/code index 6431ac5..8bfed74 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit 6431ac515970bc9ce0478a5edd0e88e1ed566ad8 +Subproject commit 8bfed747df3c37ee8b918ae1e101909e3de5835e From f9425c530dbac8e46626865b97863c998554b0ba Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Fri, 29 May 2026 08:42:53 +0800 Subject: [PATCH 14/18] =?UTF-8?q?chore(code):=20bump=20submodule=20?= =?UTF-8?q?=E2=80=94=20cluster=20ops=20consolidation=20test?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Picks up `cluster_ops_consolidated_session_lifecycle`, a single integration test that exercises identity labels + subagent persistence + LoopCheckpoint round-trip across two simulated nodes sharing one MemorySessionStore. Reference flow for 书安OS-side scheduling. --- crates/code | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/code b/crates/code index 8bfed74..9f05154 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit 8bfed747df3c37ee8b918ae1e101909e3de5835e +Subproject commit 9f05154d086931eb8a7d328b856e68ac3f6c7a0e From 9fcc0894d71defc4eccc398698206b36aad54446 Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Fri, 29 May 2026 10:19:02 +0800 Subject: [PATCH 15/18] =?UTF-8?q?chore(code):=20bump=20submodule=20?= =?UTF-8?q?=E2=80=94=20cluster-pillars=20review=20hardening=20(11=20fixes)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Folds in the full fix batch from the adversarial multi-dimension review of the cluster-pillars work (11 confirmed findings, 1 rejected): core (4b35537): H4 checkpoint leak + crash-atomic write; H3 event_count corruption; H2 resume_run metric loss; M1/M2 eviction TOCTOU; M3 MCP timestamp leak; L1 registry prune. sdk (281dc58): H1 Node BudgetGuard fail-closed (timeout/parse -> Deny, not Allow) + documented no-throw constraint; M4 disconnect_idle_mcp exposed in both SDKs (docs now true); L2 Python re-entrancy doc. 1705 lib + 10 integration green; Node 27 + Python 19 cargo tests; all SDK smokes pass; clippy clean across core + both SDKs. --- crates/code | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/code b/crates/code index 9f05154..281dc58 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit 9f05154d086931eb8a7d328b856e68ac3f6c7a0e +Subproject commit 281dc582f3c000f89774fa2975eedfebd4c74bf2 From 14a494dc5c928e63384936a1b51179147f14ef8e Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Fri, 29 May 2026 11:19:04 +0800 Subject: [PATCH 16/18] =?UTF-8?q?chore(code):=20bump=20submodule=20?= =?UTF-8?q?=E2=80=94=20v3.3.0=20release=20prep?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Points to the a3s-code v3.3.0 release-prep commit: all package versions synced to 3.3.0, CHANGELOG entry added, SDK sources fmt-clean. Full core suite green (1705 lib + all integration files). Not pushed / not tagged. --- crates/code | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/code b/crates/code index 281dc58..cbd5b20 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit 281dc582f3c000f89774fa2975eedfebd4c74bf2 +Subproject commit cbd5b204092e76facacd9fe34aaf82066d50468e From 85784ea50f7383760c611362775e81e0925ef97b Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Fri, 29 May 2026 11:36:16 +0800 Subject: [PATCH 17/18] =?UTF-8?q?chore(code):=20bump=20submodule=20?= =?UTF-8?q?=E2=80=94=20real-LLM=20cluster-feature=20tests?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Picks up core/tests/test_real_llm_cluster_features.rs: 5 #[ignore] end-to-end tests validating the 3.3.0 LLM-loop features against a live provider. Verified passing against openai/MiniMax-M2.7-highspeed. --- crates/code | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/code b/crates/code index cbd5b20..cd991a5 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit cbd5b204092e76facacd9fe34aaf82066d50468e +Subproject commit cd991a5222fcd8a0b8c2fe30aaa7ea6502613bec From 84ed6d656c8e4ca3a508f663111b6c79cceea08e Mon Sep 17 00:00:00 2001 From: Roy Lin Date: Fri, 29 May 2026 13:51:55 +0800 Subject: [PATCH 18/18] =?UTF-8?q?chore(code):=20bump=20submodule=20?= =?UTF-8?q?=E2=80=94=20v3.3.0=20released=20(crates.io/npm/PyPI/GH)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Points crates/code at 44702931 (v3.3.0 tag + the bootstrap test fix from AI45Lab/Code#48). Release is live on all four registries. --- crates/code | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/code b/crates/code index cd991a5..4470293 160000 --- a/crates/code +++ b/crates/code @@ -1 +1 @@ -Subproject commit cd991a5222fcd8a0b8c2fe30aaa7ea6502613bec +Subproject commit 44702931ea7f0cfb26580ea9e6e1bad58729b908