diff --git a/docs/design-docs/agentic-backend-readiness.md b/docs/design-docs/agentic-backend-readiness.md new file mode 100644 index 000000000..2a4cfea7e --- /dev/null +++ b/docs/design-docs/agentic-backend-readiness.md @@ -0,0 +1,389 @@ +# Agentic Backend Readiness + +Spacebot's architecture — per-agent durable storage, multi-agent communication graph, the `pending_approval` task pattern, sandbox containment, skill injection — already supports a deployment shape that the dev-coding harness lane can't: **embedding Spacebot as a backend dependency of a parent application that runs many agents on behalf of its own tenants**. One agent per customer, per entity, per listing — wired into the parent app via webhooks and an HTTP control surface. + +This doc is the cluster of changes that close the gap between "Spacebot can do this in principle" and "Spacebot is genuinely ready to be the agentic backend of a production multi-tenant application." Five changes, all surgical, none architectural rewrites. + +## The use case + +The parent application has many entities (customers, accounts, listings, projects — domain depends). Each entity benefits from a dedicated, persistent agent that: + +- Owns its entity's history forever (not polluted by other entities' context) +- Holds isolated credentials (vendor keys, OAuth tokens) that other entities' agents must not see +- Wakes only when there's something to do — entities are mostly idle most of the time +- Reports outcomes back to the parent app via the existing HTTP / webhook surface +- Fails gracefully when blocked (captcha, rate limit, vendor outage) without retry-looping +- Can be bounded with predictable wall-clock timeouts so the parent app can plan + +Spacebot supports the *shape* of this today (multi-agent first-class, per-agent SQLite + LanceDB + identity files, `send_agent_message` for cross-agent dispatch, `pending_approval` for human-in-the-loop). The friction is in the operational sharp edges that matter when you go from "a handful of agents on one developer's machine" to "thousands of dormant agents on a shared instance, each holding its tenant's credentials, expected to run reliably without supervision." + +## What needs to change + +| Change | Why | Section | +|---|---|---| +| 1. **Per-agent secret isolation** | Today all agents on one instance share `secrets.redb` — agent A can read agent B's tool secrets. Multi-tenant deployments cannot ship without scoped credentials. | [§1](#1-per-agent-secret-isolation) | +| 2. **Dormant cortex mode** | Today every agent's cortex ticks on `tick_interval_secs`. With thousands of mostly-idle agents this is wasted bulletin generation that's also stale by the time it's used. | [§2](#2-dormant-cortex-mode) | +| 3. **Wall-clock worker timeout** | Today workers have segment cap (15 × 10 = 150 turns) but no wall-clock bound. A stuck browser session can drift indefinitely. | [§3](#3-wall-clock-worker-timeout) | +| 4. **Browser captcha / blocked detection** | Today `tools/browser.rs` has SSRF protection only. Captcha and fraud-detect manifest as opaque retry-loops or generic errors instead of structured "blocked: needs human" outcomes. | [§4](#4-browser-captcha--blocked-detection) | +| 5. **Per-agent cron defaults** | Today cron timeout is 1500s with optional per-job override. Agent-class defaults (e.g. research-heavy agents need 30min default) need to be agent-level, not per-job. | [§5](#5-per-agent-cron-defaults) | + +These are independent changes that can ship in any order. Together they form a deployment-readiness milestone for the agentic-backend use case. + +--- + +## 1. Per-agent secret isolation + +### Today + +`main.rs:1480-1543` consolidated what were once per-agent secret stores into a single instance-level `secrets.redb`. The migration moved data from `/agents//data/secrets.redb` into `/data/secrets.redb`. This was the right call for a single-user instance with a handful of agents — one master key, simpler backup, easier credential reuse across agents. + +For agentic-backend deployments it's the wrong shape. Every agent calling `tool_secret_names()` (`store.rs:532`, no agent argument today) sees the entire instance set. `Sandbox` already holds a `secrets_store: ArcSwap>>` (`sandbox.rs:170`) and `wrap()` injects via `tool_secrets()` (`sandbox.rs:256-262`), but the read is unscoped — same store regardless of which agent is spawning the worker. There is no scoping primitive. + +In a multi-tenant deployment this means a worker spawned by agent-A reads its tenant's vendor API key out of the env, but the same worker, if it deviates from its prompt, can call `printenv` and see the API keys of every other tenant on the instance. That's a confidentiality breach by construction, not by bug. + +### Design + +Keep the master key instance-level (one OS-credential-store entry, one Argon2id-derived KEK). Scope the **records** by agent. + +**Composes with existing `SecretCategory`.** `SecretCategory { System, Tool }` already exists (`secrets/store.rs:78`) and auto-categorizes LLM/messaging keys as `System` (excluded from subprocess env via `auto_categorize` at `:1138-1164`). The new `SecretScope` is **orthogonal** to category — keying becomes effectively `(scope, name)` with category preserved as a row attribute. `System` secrets stay `InstanceShared` always (singleton `LlmManager` / `MessagingManager` consume them). `Tool` secrets default to `Agent(...)` for agentic-backend deployments; admin can promote any `Tool` secret to `InstanceShared` when all agents legitimately share it (e.g. a single-tenant `GH_TOKEN`). + +**Schema:** `secrets.redb` rows keyed by `(scope, name)` instead of `name`. Migration is a one-shot in-place key rewrite — every legacy unprefixed key becomes `InstanceShared:` (preserves current behavior; nothing breaks). Admin reclassifies `Tool` secrets to `Agent(...)` post-migration as needed. + +**`tool_secret_names()` and friends take `agent_id`:** returns the calling agent's tool secrets — i.e. `InstanceShared(Tool) ∪ Agent(this_agent, Tool)`. New signatures: `fn tool_secret_names(&self, agent_id: &AgentId) -> Vec`, same for `tool_env_vars` (`store.rs:499`) and `tool_secret_pairs`. + +**`wrap()` reads scoped secrets:** `Sandbox` is already constructed per-agent (one per `AgentDeps`, see `lib.rs:450`). Pass `agent_id` into `Sandbox::new` once; `Sandbox::tool_secrets()` then calls `store.tool_env_vars(&self.agent_id)`. `wrap()` callers don't change. Injection still flows via `--setenv` (bubblewrap) or `Command::env()` (passthrough / sandbox-exec). + +This composes with the env-sanitization work in `sandbox-hardening.md` — `--clearenv` strips inherited vars, then per-agent tool secrets are re-injected. System secrets (LLM keys, messaging tokens) stay out of every worker subprocess regardless. + +**Instance-shared escape hatch:** some secrets legitimately are instance-level (the LLM key). A typed enum `SecretScope { InstanceShared, Agent(AgentId) }` lets callers be explicit. Default for tool secrets created via the dashboard / API is `Agent(...)`. `InstanceShared` is admin-only and rare. + +### Schema migration + +```rust +// Old key +struct SecretKey { + name: String, +} + +// New key +enum SecretScope { + InstanceShared, + Agent(AgentId), +} + +struct SecretKey { + scope: SecretScope, + name: String, +} +``` + +Migration: existing records become `InstanceShared` rows. The dashboard surfaces both scopes; a one-time admin task may reclassify select tool secrets to per-agent. The migration is non-destructive — `InstanceShared` always remains a valid scope. + +### Files Changed + +| File | Change | +|------|--------| +| `src/secrets/store.rs` | Add `SecretScope` enum; rekey rows by `(scope, name)`; `tool_secret_names(agent_id)`, `tool_env_vars(agent_id)`, `tool_secret_pairs(agent_id)` filter by scope | +| `src/secrets/migration.rs` (new) | One-shot in-place key rewrite: legacy unprefixed → `InstanceShared:`. Idempotent. | +| `src/sandbox.rs` | `Sandbox::new` accepts owning `agent_id`; `tool_secrets()` calls `store.tool_env_vars(&agent_id)`. Sandbox is already per-agent, so `wrap()` callers don't change. | +| `src/agent/cortex.rs:3740`, `src/agent/channel_dispatch.rs:600,1279`, `src/tools/spawn_worker.rs:446` | Pass `&deps.agent_id` to `tool_secret_names()` (4 production callsites) | +| `src/tools.rs:971`, `src/secrets/scrub.rs:229` | Scrubber pair source becomes scoped — only the calling agent's tool secrets are redaction candidates | +| `src/api/secrets.rs` | API surface accepts `scope` on create; lists by scope; new `GET /api/agents/:id/secrets` for per-agent listing | +| `interface/...` | Dashboard scope picker on secret create; per-agent secret listing | + +--- + +## 2. Dormant cortex mode + +### Today + +Every agent spawns four loops on startup (`spawn_warmup_loop`, `spawn_cortex_loop`, `spawn_association_loop`, `spawn_ready_task_loop` — defined at `agent/cortex.rs:1754`, `:1805`, `:3682`, `:3694`, called from `main.rs:3740-3754` and `api/agents.rs:1024-1028`). The cortex loop ticks on `tick_interval_secs`, regenerating the memory bulletin via `memory_recall` and observing system signals; memory consolidation/decay runs inside it. `spawn_ready_task_loop` independently polls `task_store.claim_next_ready(agent_id)` for `ready` tasks. + +This is correct for an actively-used agent — the bulletin needs to be fresh when a user message arrives, ready tasks need to be picked up between user interactions. It's wasteful for an agent that's mostly idle. At thousands of agents on a shared instance, regenerating bulletins every 60 minutes for agents that haven't been triggered in days is meaningful LLM cost and produces bulletins that are stale by the time they're consulted (the agent wakes hours or days later when an external trigger arrives, with a bulletin from the wrong moment). + +### Design + +A new per-agent setting `cortex_mode: active | dormant` (default `active` for backwards compat). + +When `dormant`: + +- None of the four cortex loops are spawned. No periodic bulletin generation, no ready-task polling. +- Tasks get picked up via wake-on-trigger paths instead. +- Memory consolidation / decay (currently passive cortex work) moves to a separate instance-wide janitor cron that walks all agents on a slow schedule. See "Memory maintenance" below. + +Wake triggers (each sends into `wake_tx: mpsc::Sender<(AgentId, WakeTrigger)>`, consumed by a central `WakeManager` that calls `cortex::wake(agent_id, trigger)`): + +1. **`send_agent_message` post-insert hook** — when another agent dispatches a task to a dormant agent, the wake fires synchronously after `task_store.create()` returns (`tools/send_agent_message.rs:226`). +2. **Messaging / webhook routing dispatch** — `MessagingManager` and `webhook.rs` are agent-blind; binding-to-agent resolution happens in the `main.rs:2192` event loop ("Main event loop: route inbound messages to agent channels"). Wake fires from that loop after the binding resolves an `agent_id`, before spawning the channel or dispatching the payload. Covers Discord / Slack / Telegram / webchat / webhook uniformly. +3. **Cron timer fire** — cron jobs still work in dormant mode; they're just one more wake trigger. Each cron fire wakes the agent, runs the cron channel, then re-sleeps after grace. +4. **Internal admin wake API** — `POST /api/agents/:id/wake` for debugging / testing. + +Wake `try_send` is non-blocking; on full channel, log and drop (the trigger work proceeds regardless — wake is best-effort bulletin-warming). + +Wake sequence: + +```text +trigger fires + → cortex::wake(agent_id, trigger) + → check cached bulletin: if within grace period, reuse + → else: synthesize bulletin from memory store (no LLM tool calls — pure memory query + RRF + single LLM synthesis call) + → cache bulletin with TTL = wake_grace_period_secs + → hand off to trigger-appropriate process: + ─ task triggers → spawn worker via existing ready-task path + ─ messaging triggers → spawn channel + ─ webhook triggers → dispatch into the existing handler + ─ cron triggers → spawn ephemeral channel (existing cron pattern) + → grace timer starts; on expiry without new triggers, return to fully dormant +``` + +The grace period matters because triggers cluster — three tasks dispatched to the same agent in 30 seconds shouldn't synthesize the bulletin three times. A short cache (default 5min) amortizes across bursts. + +### Memory maintenance for dormant agents + +Cortex's passive work (memory consolidation, decay, importance recalibration) currently runs as part of the tick loop. In dormant mode it can't, but it shouldn't run on every wake either (latency on the trigger path matters). + +Move it to an **instance-wide memory janitor**. A separate cron job that walks all agents on a slow schedule (default daily at off-peak), running consolidation/decay against each agent's memory store. The janitor doesn't fire wake — it operates directly against the memory store as a privileged background process. + +`active`-mode agents continue running their own cortex passes (the janitor is additive; consolidation is idempotent). `dormant` agents get their maintenance exclusively from the janitor. + +### Configuration + +```toml +[cortex] +mode = "active" # or "dormant" +tick_interval_secs = 3600 # only consulted when mode = "active" +wake_grace_period_secs = 300 # only consulted when mode = "dormant" + +[memory_janitor] +enabled = true +schedule = "0 4 * * *" # daily at 04:00 instance-local +``` + +Per-agent `cortex.mode` override in agent config. `tick_interval_secs` ignored for dormant agents. `wake_grace_period_secs` ignored for active agents. + +### Files Changed + +| File | Change | +|------|--------| +| `src/agent/cortex.rs` | Branch on `cortex_mode`: skip warmup/cortex/association/ready_task loops when dormant; new `wake(agent_id, trigger)` entry point; per-agent bulletin cache with TTL | +| `src/config/types.rs` | `CortexConfig.mode: CortexMode { Active, Dormant }`; `wake_grace_period_secs`; `MemoryJanitorConfig { enabled, schedule }` on top-level `Config` | +| `src/agent/wake.rs` (new) | `WakeManager` registry + `wake_tx` mpsc dispatcher; per-agent `Mutex>` with TTL | +| `src/tools/send_agent_message.rs` | Post-insert hook (after `task_store.create()`) fires `wake_tx.try_send(...)` for the receiving agent | +| `src/cron/scheduler.rs` | Cron fire path fires `wake_tx.try_send(...)` for the owning agent (no-op branch if active) | +| `src/api/agents.rs` | `POST /api/agents/:id/wake` admin endpoint; gate the four cortex loops on `cortex_mode` at agent-create path (~line 1024) | +| `src/main.rs` | Event loop (line 2192) fires `wake_tx.try_send(...)` for messaging + webhook inbound after binding resolves `agent_id`; gate the four cortex loops at agent-startup path (~line 3740); spawn `WakeManager` and memory janitor at startup | +| `src/agent/maintenance.rs` (new) | Instance-wide memory janitor — walks agents on cron schedule; uses existing `memory::maintenance` machinery | + +### Cost framing + +For an instance running 2,000 mostly-idle agents on a 60-minute tick: ~48,000 LLM-backed bulletin generations per day. With dormant mode and an average of ~5 wakes per agent per day, that drops to ~10,000 bulletin generations — a 79% reduction without losing any quality (wake-time bulletins are *fresher* than tick-generated ones, since they're synthesized at the moment of work). For deployments billing through a fixed LLM provider plan, this is the difference between "fits within quota" and "needs metered billing." + +--- + +## 3. Wall-clock worker timeout + +### Today + +`agent/worker.rs:26,48` defines `TURNS_PER_SEGMENT = 15` and `MAX_SEGMENTS = 10`, giving a 150-turn ceiling. There is no wall-clock bound. A worker stuck on slow browser navigation, retried network calls, or a tool that yields slowly can run for hours before the segment ceiling kicks in. For agentic-backend deployments where the parent app expects to plan around predictable per-task latency, this is a problem. + +There's a transient-error retry budget (`MAX_TRANSIENT_RETRIES = 5`, `worker.rs:40`) and an overflow-retry budget (`MAX_OVERFLOW_RETRIES = 2`, `worker.rs:34`), but neither is wall-clock-shaped. + +### Design + +**Naming.** `CortexConfig.worker_timeout_secs` (`config/types.rs:1043`, default 600s) already exists as the supervisor's idle-kill bound, measured from `last_activity_at`. The new wall-clock bound is a different mechanism; pick a non-colliding name. Use `worker_wall_clock_timeout_secs` (or rename the supervisor field to `worker_idle_timeout_secs`). + +**Structured outcome surface.** Today `Worker::run` returns `Result` (`worker.rs:325`). There is no `WorkerOutcome` enum; the LLM signals completion via `set_status(kind: Outcome)` and the cron-only `SetOutcomeTool` writes `CronOutcome { content: String }` for scheduler delivery. This phase invents the structured surface from scratch — it's the shared substrate phase 4 also needs (see [§4](#4-browser-captcha--blocked-detection)). + +```rust +pub enum WorkerOutcome { + Success { result: String }, + Failed { reason: String }, + Partial { result: String, segments_run: usize }, // existing max-segment exit + Timeout { elapsed_secs: u64, segments_run: usize }, // this phase + Blocked { reason: BlockReason, url: Option, evidence: BlockEvidence }, // phase 4 +} +``` + +`Worker::run` becomes `-> Result`. Refactor the segment loop into `run_inner`; wrap the call in `tokio::time::timeout`: + +```rust +pub async fn run(mut self) -> Result { + let timeout = self.resolve_wall_clock_timeout(); // conversation > agent > 1800s default + match tokio::time::timeout(Duration::from_secs(timeout), self.run_inner()).await { + Ok(outcome) => outcome, + Err(_) => { + self.persist_transcript().await?; + self.write_failure_log("wall_clock_timeout").await?; + Ok(WorkerOutcome::Timeout { elapsed_secs: timeout, segments_run: self.segments_run }) + } + } +} +``` + +Wall-clock encompasses the entire worker lifetime, not per-segment. Per-segment bounds come from `MAX_TRANSIENT_RETRIES` and the segment cap; the wall-clock bound is a separate ceiling that catches the slow-drift case. + +**Resolution chain.** `ConversationSettings.worker_wall_clock_timeout_secs` → `ResolvedAgentConfig.worker_wall_clock_timeout_secs` → 1800s system default. Mirrors the existing `ResolvedConversationSettings::resolve` chain. + +```toml +[conversation_settings] +worker_wall_clock_timeout_secs = 1800 # 30 min — generous default for research-heavy work +``` + +Per-task override flows through the task description (already plumbed via `tool_budget` in the task-creation API). + +**Resumed workers.** Each `Worker::run` invocation gets a fresh wall-clock budget — the timer starts at resumption, not at original spawn. Interactive resumption (`worker.rs:387-395`) is treated as a new run for budgeting purposes. + +### Interaction with `set_outcome` + +`SetOutcomeTool` (`tools/set_outcome.rs`) is unrelated — it's the cron-job content-delivery tool that writes a `CronOutcome { content: String }` for the scheduler. It stays as-is; this phase doesn't extend it. The new `WorkerOutcome` is a Rust-side return value, not an LLM-callable tool. The LLM continues to signal completion intent via `set_status(kind: Outcome)`. + +Caller updates: `agent/channel_dispatch.rs`'s `WorkerCompletionError` (line 44) and `map_worker_completion_result` (line 94) are the existing convergence point; extend them to handle the new variants. `cortex.rs:3694+` (`spawn_ready_task_loop` → `pickup_one_ready_task` at `:3718`) and `tools/spawn_worker.rs` also pattern-match on the new enum. + +### Files Changed + +| File | Change | +|------|--------| +| `src/conversation/settings.rs` | Add `worker_wall_clock_timeout_secs: Option` | +| `src/config/types.rs` | Add `worker_wall_clock_timeout_secs: Option` to `ResolvedAgentConfig` | +| `src/agent/worker.rs` | New `WorkerOutcome` enum (shared with phase 4); `Worker::run` returns `Result`; segment loop extracted to `run_inner`; wrapped in `tokio::time::timeout` | +| `src/agent/channel_dispatch.rs`, `src/agent/cortex.rs`, `src/tools/spawn_worker.rs` | Update worker call sites for new return type; pattern-match `WorkerOutcome` | +| `src/api/tasks.rs` | Surface `Timeout` state in task detail responses | + +--- + +## 4. Browser captcha / blocked detection + +### Today + +`tools/browser.rs` lines 72-131 ship SSRF protection — blocks requests to cloud metadata endpoints, private IPs, loopback, and link-local addresses. Solid for what it does. Nothing for captcha, fraud-detect, login walls, or rate-limit ceilings. Browser sessions that hit these manifest as opaque DOM states or generic errors; the LLM sees confusing output and either retry-loops or invents a workaround that doesn't apply. + +For agentic-backend deployments where browser tooling runs against many vendors with varying defenses, this is a daily failure mode. Without a structured "blocked" signal, the agent can't capture the failed path in memory and avoid it next time. + +### Design + +Detection heuristics layered into `tools/browser.rs` after existing SSRF checks: + +**Captcha detection.** On page load and after navigation, scan the DOM and active iframes for known captcha frame patterns: + +- recaptcha: iframe `src` containing `google.com/recaptcha/`, `gstatic.com/recaptcha/` +- hcaptcha: iframe `src` containing `hcaptcha.com/captcha/` +- Cloudflare Turnstile: iframe `src` containing `challenges.cloudflare.com/turnstile/` +- Cloudflare challenge page: `cf-chl-bypass`, `cf-mitigated` headers; `cf_challenge_response` cookie name +- Generic challenge heuristics: presence of ``, ``, etc. + +**Login wall heuristics.** Detect when the navigation target redirected to a login URL: + +- Final URL host matches the navigated host but path is in a small set of login signals (`/login`, `/signin`, `/auth/`, `/account/login`) +- Status 401 / 403 with auth-prompting `WWW-Authenticate` headers + +**Rate-limit / fraud-detect.** Status-code-based: + +- `429 Too Many Requests` — block, capture `Retry-After` if present +- `403 Forbidden` with body matching common WAF banners (Akamai, Cloudflare, AWS WAF, Imperva) +- Empty 200 responses on pages where navigation was expected to land somewhere meaningful (heuristic, opt-in per-tool) + +**On detection.** The browser tool returns `BrowserToolError::Blocked(BlockReason)` with captured evidence. The worker's tool-error handler converts to `WorkerOutcome::Blocked` (the structured surface from [§3](#3-wall-clock-worker-timeout)), terminating the segment loop: + +```rust +WorkerOutcome::Blocked { + reason: BlockReason::Captcha { provider: "cloudflare-turnstile" }, + url: Some("https://example.com/signup".into()), + evidence: BlockEvidence { + screenshot: Option>, + html_snippet: Option, + request_headers: HashMap, + }, +} +``` + +The structured outcome is the worker's output. The calling agent's memory captures the blocked path so future attempts at the same URL/path don't retry the same dead end. + +### What this is not + +This is **detection**, not **bypass**. We don't try to solve captchas, rotate user-agents to evade fraud-detect, or use residential proxies. The signal exists so the agent can fail cleanly and surface to a human; the parent app's escalation path takes over from there. + +### Files Changed + +| File | Change | +|------|--------| +| `src/tools/browser.rs` | Add detection hooks at navigation/click/snapshot entry points; on positive detection return `BrowserToolError::Blocked(...)` with captured evidence | +| `src/tools/browser_detection.rs` (new) | Pure detection logic — testable in isolation. Flat-file alongside `browser.rs` (currently a 2729-line monolith). Module-restructure to `tools/browser/{mod,detection}.rs` is optional and out of scope for this phase. | +| `src/agent/worker.rs` | `BlockReason` and `BlockEvidence` types live next to `WorkerOutcome` from §3; `BrowserToolError::Blocked` → `WorkerOutcome::Blocked` conversion in the worker's tool-error path | + +--- + +## 5. Per-agent cron defaults + +### Today + +`cron/scheduler.rs:1387` defines a 1500s (25min) default timeout with optional per-job override (`CronJob.timeout_secs: Option`, defined at `scheduler.rs:39, 107`). Useful for the typical case but doesn't compose well with agent classes. Research-heavy agents legitimately need longer for a single cron firing (browser navigation + LLM reasoning + tool calls); short-task agents need bounded timeouts to fail fast. + +Setting per-job timeouts on every cron job for an agent class is repetitive and scales poorly. Per-agent default + per-job override is the right shape. + +### Design + +Add `cron_default_timeout_secs` to `ResolvedAgentConfig`. Cron resolution chain becomes: `job.timeout_secs` (per-job override) → `agent.cron_default_timeout_secs` (per-agent default) → `1500` (existing system default, preserved). + +```rust +fn resolve_cron_timeout(job: &CronJob, agent: &ResolvedAgentConfig) -> u64 { + job.timeout_secs + .or(agent.cron_default_timeout_secs) + .unwrap_or(DEFAULT_CRON_TIMEOUT_SECS) +} +``` + +Small change, surfaces in agent config TOML, no migration needed. + +### Files Changed + +| File | Change | +|------|--------| +| `src/config.rs` | `ResolvedAgentConfig.cron_default_timeout_secs: Option` | +| `src/cron/scheduler.rs` | `resolve_cron_timeout` chain | + +--- + +## Phase plan + +Four changes are truly independent. **Phase 4 has a hard dependency on Phase 3** because both surface results through the new `WorkerOutcome` enum that doesn't exist today — Phase 3 builds the enum, Phase 4 just adds the `Blocked` variant + tool-error conversion. + +Suggested sequence based on impact for the agentic-backend deployment shape: + +### Phase 1 — Per-agent secret isolation + +The hardest gating problem. No multi-tenant deployment ships without it. Touches schema, sandbox env injection, dashboard. Ship first because it blocks all other deployment work. + +### Phase 2 — Dormant cortex mode + +The cost-of-scale problem. Required before any deployment exceeding ~100 agents on shared infrastructure. Includes the memory janitor as the maintenance counterpart. + +### Phase 3 — Wall-clock worker timeout + +Predictable per-task latency for the parent app's planning. Builds the `WorkerOutcome` structured-outcome enum that Phase 4 consumes — sequence Phase 3 before Phase 4. + +### Phase 4 — Browser captcha / blocked detection + +Quality-of-failure improvement. Lets agents fail cleanly and surface to humans rather than retry-loop opaquely. Depends on the `WorkerOutcome` enum from Phase 3. + +### Phase 5 — Per-agent cron defaults + +Smallest of the five. Quality-of-life for agent-class config. Ship anytime. + +--- + +## Non-goals + +- **No new agent topology primitives.** The communication graph (`multi-agent-communication-graph.md`) and link channels are unchanged. Agentic-backend deployments use the same `send_agent_message` / `LinkKind` / `LinkDirection` primitives as a single-user instance. +- **No new task primitives.** The `pending_approval` pattern, `assigned_agent_id` ownership, ready-task pickup, and task comments (`autonomy.md`) all stay as-is. Dormant cortex changes *when* tasks get picked up (on wake instead of on tick), not the task model itself. +- **No new sandbox backend.** bubblewrap (Linux) and sandbox-exec (macOS) remain the runtime sandboxes. Per-agent secret isolation composes with the env-sanitization work in `sandbox-hardening.md`. VM-level isolation (`stereos-integration.md`) is a separate, much later, opt-in deployment path. +- **No bypass capabilities.** Captcha / blocked detection is for failing cleanly, not for bypassing protections. The structured outcome is the surface; what the parent app does with it is its decision. +- **No identity-file changes.** `SOUL.md` / `IDENTITY.md` / `USER.md` / `ROLE.md` (per `multi-agent-communication-graph.md`) all stay as-is. Per-agent identity files are already first-class. + +## Cross-references + +- `secret-store.md` — the master-key model and tool/system-secret distinction this work scopes per agent. +- `sandbox-hardening.md` — env sanitization (`--clearenv`) composes with per-agent secret injection in `wrap()`. +- `autonomy.md` — task pickup model. Dormant cortex changes the *trigger* for ready-task pickup (wake vs tick) without changing the task model. +- `multi-agent-communication-graph.md` — the link / topology system. `send_agent_message` post-insert hook is the integration point for dormant wake. +- `cron-outcome-delivery.md` — cron outcome surfacing. `set_outcome` timeout / blocked variants surface here. +- `production-worker-failures.md` — adjacent operational concerns for worker reliability. diff --git a/src/agent.rs b/src/agent.rs index e6aa74db7..66a8730d6 100644 --- a/src/agent.rs +++ b/src/agent.rs @@ -12,9 +12,11 @@ pub mod cortex_chat; pub mod ingestion; #[cfg(test)] mod invariant_harness; +pub mod maintenance; pub mod process_control; pub mod prompt_snapshot; pub mod status; +pub mod wake; pub mod worker; pub(crate) fn panic_payload_to_string(panic_payload: &(dyn std::any::Any + Send)) -> String { diff --git a/src/agent/branch.rs b/src/agent/branch.rs index fb8c95973..2d106dfde 100644 --- a/src/agent/branch.rs +++ b/src/agent/branch.rs @@ -247,7 +247,7 @@ impl Branch { // Layer 1: exact-match redaction of known secrets from the store. // Layer 2: regex-based redaction of unknown secret patterns. let conclusion = if let Some(store) = self.deps.runtime_config.secrets.load().as_ref() { - crate::secrets::scrub::scrub_with_store(&conclusion, store) + crate::secrets::scrub::scrub_with_store(&conclusion, store, &self.deps.agent_id) } else { conclusion }; diff --git a/src/agent/channel.rs b/src/agent/channel.rs index 963fdeab8..b256901f0 100644 --- a/src/agent/channel.rs +++ b/src/agent/channel.rs @@ -870,13 +870,17 @@ impl Channel { let has_links = !crate::links::links_for_agent(&deps.links.load(), &deps.agent_id).is_empty(); if has_links { - Some(crate::tools::SendAgentMessageTool::new( + let mut tool = crate::tools::SendAgentMessageTool::new( deps.agent_id.clone(), deps.links.clone(), deps.agent_names.clone(), deps.task_store.clone(), ConversationLogger::new(deps.sqlite_pool.clone()), - )) + ); + if let Some(wake_tx) = deps.wake_tx.clone() { + tool = tool.with_wake_tx(wake_tx); + } + Some(tool) } else { None } diff --git a/src/agent/channel_dispatch.rs b/src/agent/channel_dispatch.rs index 30402c119..83295ca76 100644 --- a/src/agent/channel_dispatch.rs +++ b/src/agent/channel_dispatch.rs @@ -7,7 +7,7 @@ use crate::agent::branch::{Branch, BranchExecutionConfig}; use crate::agent::channel::ChannelState; use crate::agent::channel_prompt::TemporalContext; -use crate::agent::worker::Worker; +use crate::agent::worker::{Worker, WorkerOutcome}; use crate::conversation::settings::{WorkerContextMode, WorkerHistoryMode}; use crate::error::{AgentError, Error as SpacebotError}; use crate::tools::{BranchToolProfile, MemoryPersistenceContractState}; @@ -36,7 +36,10 @@ pub(crate) fn reserve_worker_slot_local( #[derive(Debug, Clone, Copy, PartialEq, Eq)] enum WorkerCompletionKind { Success, + Partial, Cancelled, + Timeout, + Blocked, Failed, } @@ -68,11 +71,44 @@ impl WorkerCompletionError { } } -fn classify_worker_completion_result( - result: std::result::Result, +fn classify_worker_completion( + outcome: std::result::Result, ) -> (String, WorkerCompletionKind) { - match result { - Ok(text) => (text, WorkerCompletionKind::Success), + match outcome { + Ok(WorkerOutcome::Success { result }) => (result, WorkerCompletionKind::Success), + Ok(WorkerOutcome::Partial { + result, + segments_run, + }) => ( + format!( + "{result}\n\n(reached max segments after {segments_run} attempts — partial result)" + ), + WorkerCompletionKind::Partial, + ), + Ok(WorkerOutcome::Cancelled { reason }) => ( + format!("Worker cancelled: {reason}"), + WorkerCompletionKind::Cancelled, + ), + Ok(WorkerOutcome::Timeout { + elapsed_secs, + segments_run, + }) => ( + format!( + "Worker exceeded {elapsed_secs}s wall-clock timeout after {segments_run} segments." + ), + WorkerCompletionKind::Timeout, + ), + Ok(WorkerOutcome::Blocked { reason, url, .. }) => { + let body = match url { + Some(url) => format!("Worker blocked: {} at {url}", reason.describe()), + None => format!("Worker blocked: {}", reason.describe()), + }; + (body, WorkerCompletionKind::Blocked) + } + Ok(WorkerOutcome::Failed { reason }) => ( + format!("Worker failed: {reason}"), + WorkerCompletionKind::Failed, + ), Err(WorkerCompletionError::Cancelled { reason }) => ( format!("Worker cancelled: {reason}"), WorkerCompletionKind::Cancelled, @@ -86,15 +122,18 @@ fn classify_worker_completion_result( fn completion_flags(kind: WorkerCompletionKind) -> (bool, bool) { let notify = true; - let success = matches!(kind, WorkerCompletionKind::Success); + let success = matches!( + kind, + WorkerCompletionKind::Success | WorkerCompletionKind::Partial + ); (notify, success) } -/// Normalize worker completion into event payload fields. -pub(crate) fn map_worker_completion_result( - result: std::result::Result, +/// Normalize a worker outcome (or terminal error) into event payload fields. +pub(crate) fn map_worker_completion( + outcome: std::result::Result, ) -> (String, bool, bool) { - let (result_text, kind) = classify_worker_completion_result(result); + let (result_text, kind) = classify_worker_completion(outcome); let (notify, success) = completion_flags(kind); (result_text, notify, success) } @@ -347,7 +386,7 @@ async fn spawn_branch( // Layer 2: regex-based redaction of unknown secret patterns. let raw = format!("Branch failed: {error}"); let conclusion = if let Some(store) = secrets_snapshot.as_ref() { - crate::secrets::scrub::scrub_with_store(&raw, store) + crate::secrets::scrub::scrub_with_store(&raw, store, &agent_id) } else { raw }; @@ -598,7 +637,7 @@ async fn spawn_worker_inner( // Collect tool secret names so the worker template can list available credentials. let secrets_guard = rc.secrets.load(); let tool_secret_names = match (*secrets_guard).as_ref() { - Some(store) => store.tool_secret_names(), + Some(store) => store.tool_secret_names(&state.deps.agent_id), None => Vec::new(), }; @@ -969,7 +1008,9 @@ async fn spawn_opencode_worker_inner( }); } - Ok::(result.result_text) + Ok::(WorkerOutcome::Success { + result: result.result_text, + }) } .instrument(worker_span), ); @@ -1031,7 +1072,7 @@ pub(crate) fn spawn_worker_task( future: F, ) -> tokio::task::JoinHandle<()> where - F: std::future::Future> + Send + 'static, + F: std::future::Future> + Send + 'static, { tokio::spawn(async move { #[cfg(feature = "metrics")] @@ -1043,36 +1084,25 @@ where .with_label_values(&[&*agent_id]) .inc(); - let outcome = std::panic::AssertUnwindSafe(future).catch_unwind().await; - let worker_result: std::result::Result = match outcome { - Ok(Ok(text)) => { - // Scrub tool secret values from the result before it reaches - // the channel. The channel never sees raw secret values. - // Layer 1: exact-match redaction of known secrets from the store. - // Layer 2: regex-based redaction of unknown secret patterns. - let scrubbed = if let Some(store) = &secrets_store { - crate::secrets::scrub::scrub_with_store(&text, store) - } else { - text - }; - let scrubbed = crate::secrets::scrub::scrub_leaks(&scrubbed); - Ok(scrubbed) - } - Ok(Err(error)) => { - let failure = WorkerCompletionError::from_spacebot_error(error); - match failure { - WorkerCompletionError::Cancelled { .. } => Err(failure), - WorkerCompletionError::Failed { message } => { - let scrubbed = if let Some(store) = &secrets_store { - crate::secrets::scrub::scrub_with_store(&message, store) - } else { - message - }; - let scrubbed = crate::secrets::scrub::scrub_leaks(&scrubbed); - Err(WorkerCompletionError::Failed { message: scrubbed }) - } + let raw = std::panic::AssertUnwindSafe(future).catch_unwind().await; + let scrub = |text: String| -> String { + let layer1 = if let Some(store) = &secrets_store { + crate::secrets::scrub::scrub_with_store(&text, store, &agent_id) + } else { + text + }; + crate::secrets::scrub::scrub_leaks(&layer1) + }; + let worker_result: std::result::Result = match raw { + Ok(Ok(outcome)) => Ok(scrub_outcome(outcome, &scrub)), + Ok(Err(error)) => match WorkerCompletionError::from_spacebot_error(error) { + WorkerCompletionError::Cancelled { reason } => { + Err(WorkerCompletionError::Cancelled { reason }) } - } + WorkerCompletionError::Failed { message } => Err(WorkerCompletionError::Failed { + message: scrub(message), + }), + }, Err(panic_payload) => { let panic_message = crate::agent::panic_payload_to_string(&*panic_payload); tracing::error!( @@ -1085,12 +1115,18 @@ where ))) } }; - let (result_text, kind) = classify_worker_completion_result(worker_result); + let (result_text, kind) = classify_worker_completion(worker_result); match kind { - WorkerCompletionKind::Success => {} + WorkerCompletionKind::Success | WorkerCompletionKind::Partial => {} WorkerCompletionKind::Cancelled => { tracing::info!(worker_id = %worker_id, result = %result_text, "worker cancelled"); } + WorkerCompletionKind::Timeout => { + tracing::warn!(worker_id = %worker_id, result = %result_text, "worker timed out"); + } + WorkerCompletionKind::Blocked => { + tracing::warn!(worker_id = %worker_id, result = %result_text, "worker blocked"); + } WorkerCompletionKind::Failed => { tracing::error!(worker_id = %worker_id, result = %result_text, "worker failed"); } @@ -1120,6 +1156,58 @@ where }) } +/// Apply scrubbing to any text content carried by a `WorkerOutcome`. +fn scrub_outcome(outcome: WorkerOutcome, scrub: &F) -> WorkerOutcome +where + F: Fn(String) -> String, +{ + match outcome { + WorkerOutcome::Success { result } => WorkerOutcome::Success { + result: scrub(result), + }, + WorkerOutcome::Partial { + result, + segments_run, + } => WorkerOutcome::Partial { + result: scrub(result), + segments_run, + }, + WorkerOutcome::Cancelled { reason } => WorkerOutcome::Cancelled { reason }, + WorkerOutcome::Timeout { + elapsed_secs, + segments_run, + } => WorkerOutcome::Timeout { + elapsed_secs, + segments_run, + }, + WorkerOutcome::Blocked { + reason, + url, + mut evidence, + } => { + // Scrub URL too — query params and path segments routinely + // carry bearer tokens / session ids. The blocked URL flows + // into WorkerComplete.result and channel logs, so an + // un-scrubbed URL is a credential-leak path. + let url = url.map(scrub); + if let Some(snippet) = evidence.html_snippet.take() { + evidence.html_snippet = Some(scrub(snippet)); + } + if let Some(final_url) = evidence.final_url.take() { + evidence.final_url = Some(scrub(final_url)); + } + WorkerOutcome::Blocked { + reason, + url, + evidence, + } + } + WorkerOutcome::Failed { reason } => WorkerOutcome::Failed { + reason: scrub(reason), + }, + } +} + /// Resume an idle interactive worker into a channel's state after restart. /// /// Loads the prior transcript, creates a resumed worker (builtin or opencode), @@ -1225,7 +1313,9 @@ pub async fn resume_idle_worker_into_state( } }); } - Ok::(result.result_text) + Ok::(WorkerOutcome::Success { + result: result.result_text, + }) } .instrument(worker_span), ); @@ -1277,7 +1367,7 @@ pub async fn resume_idle_worker_into_state( let sandbox_write_allowlist = state.deps.sandbox.prompt_write_allowlist(); let secrets_guard = rc.secrets.load(); let tool_secret_names = match (*secrets_guard).as_ref() { - Some(store) => store.tool_secret_names(), + Some(store) => store.tool_secret_names(&state.deps.agent_id), None => Vec::new(), }; let browser_config = (**rc.browser_config.load()).clone(); @@ -1396,7 +1486,7 @@ fn expand_tilde(path: &str) -> std::path::PathBuf { #[cfg(test)] mod tests { - use super::{WorkerCompletionError, map_worker_completion_result, spawn_worker_task}; + use super::{WorkerCompletionError, WorkerOutcome, map_worker_completion, spawn_worker_task}; use crate::{ProcessEvent, WorkerId}; use std::sync::Arc; use std::time::Duration; @@ -1406,7 +1496,7 @@ mod tests { #[test] fn cancelled_errors_are_classified_as_cancelled_results() { let (text, notify, success) = - map_worker_completion_result(Err(WorkerCompletionError::Cancelled { + map_worker_completion(Err(WorkerCompletionError::Cancelled { reason: "user requested".to_string(), })); assert_eq!(text, "Worker cancelled: user requested"); @@ -1414,6 +1504,48 @@ mod tests { assert!(!success); } + #[test] + fn timeout_outcome_is_classified_as_unsuccessful() { + let (text, notify, success) = map_worker_completion(Ok(WorkerOutcome::Timeout { + elapsed_secs: 1800, + segments_run: 7, + })); + assert_eq!( + text, + "Worker exceeded 1800s wall-clock timeout after 7 segments." + ); + assert!(notify); + assert!(!success); + } + + #[test] + fn blocked_outcome_is_classified_as_unsuccessful() { + use crate::agent::worker::{BlockEvidence, BlockReason}; + let (text, notify, success) = map_worker_completion(Ok(WorkerOutcome::Blocked { + reason: BlockReason::Captcha { + provider: "cloudflare-turnstile".to_string(), + }, + url: Some("https://example.com/signup".to_string()), + evidence: Box::new(BlockEvidence::default()), + })); + assert!(text.contains("captcha")); + assert!(text.contains("https://example.com/signup")); + assert!(notify); + assert!(!success); + } + + #[test] + fn partial_outcome_is_classified_as_successful() { + let (text, notify, success) = map_worker_completion(Ok(WorkerOutcome::Partial { + result: "partial body".to_string(), + segments_run: 10, + })); + assert!(text.contains("partial body")); + assert!(text.contains("max segments")); + assert!(notify); + assert!(success); + } + #[tokio::test] async fn spawn_worker_task_emits_cancelled_completion_event() { let (event_tx, mut event_rx) = broadcast::channel(8); @@ -1427,7 +1559,7 @@ mod tests { None, "builtin", async { - Err::( + Err::( crate::error::AgentError::Cancelled { reason: "user requested".to_string(), } @@ -1472,7 +1604,11 @@ mod tests { Some(channel_id.clone()), None, "builtin", - async { Ok::("result".to_string()) }, + async { + Ok::(WorkerOutcome::Success { + result: "result".to_string(), + }) + }, ); let event = tokio::time::timeout(Duration::from_secs(2), event_rx.recv()) diff --git a/src/agent/cortex.rs b/src/agent/cortex.rs index f95007bb3..c560a759a 100644 --- a/src/agent/cortex.rs +++ b/src/agent/cortex.rs @@ -9,18 +9,19 @@ //! The cortex also observes system-wide activity via signals for future use in //! health monitoring and memory consolidation. -use crate::agent::channel_dispatch::{WorkerCompletionError, map_worker_completion_result}; +use crate::agent::channel_dispatch::{WorkerCompletionError, map_worker_completion}; use crate::agent::process_control::{ ControlActionResult, DetachedWorkerControl, ProcessControlRegistry, }; use crate::agent::worker::Worker; +use crate::agent::worker::WorkerOutcome; use crate::error::Result; use crate::hooks::CortexHook; use crate::llm::SpacebotModel; use crate::memory::maintenance as memory_maintenance; use crate::memory::search::{SearchConfig, SearchMode, SearchSort}; use crate::memory::types::{Association, MemoryType, RelationType}; -use crate::tasks::{TaskStatus, UpdateTaskInput}; +use crate::tasks::{TaskStatus, TaskStore, UpdateTaskInput}; use crate::{ AgentDeps, AgentId, BranchId, ChannelId, ProcessEvent, ProcessId, ProcessType, WorkerId, }; @@ -3690,6 +3691,191 @@ pub fn spawn_association_loop( }) } +/// How to route a completed detached worker against the task store. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +enum DetachedRouting { + /// Worker produced a usable result — mark `Done` and notify success. + Success, + /// Terminal failure that won't fix itself (cancelled, wall-clock + /// timeout, captcha / login wall). Mark `Done` so the loop doesn't + /// pick the task up again; emit a failure event so callers can react + /// (escalate to a human, raise a follow-up task, etc.). + Terminal, + /// Transient failure (LLM error, panic, generic Failed). Requeue to + /// `Ready` so the next pickup pass can retry. + Requeue, +} + +#[allow(clippy::too_many_arguments)] +async fn handle_detached_completion( + routing: DetachedRouting, + task: &crate::tasks::Task, + worker_id: WorkerId, + result_text: &str, + task_store: &Arc, + run_logger: &crate::conversation::history::ProcessRunLogger, + logger: &CortexLogger, + event_tx: &tokio::sync::broadcast::Sender, + agent_id: &str, + links: &Arc>>, + agent_names: &Arc>, + sqlite_pool: &sqlx::SqlitePool, + injection_tx: &tokio::sync::mpsc::Sender, +) { + let success = matches!(routing, DetachedRouting::Success); + let new_status = match routing { + DetachedRouting::Success | DetachedRouting::Terminal => TaskStatus::Done, + DetachedRouting::Requeue => TaskStatus::Ready, + }; + let update_input = match routing { + DetachedRouting::Success | DetachedRouting::Terminal => UpdateTaskInput { + status: Some(new_status), + ..Default::default() + }, + DetachedRouting::Requeue => UpdateTaskInput { + status: Some(new_status), + clear_worker_id: true, + ..Default::default() + }, + }; + + let update_result = task_store.update(task.task_number, update_input).await; + let persisted = match update_result { + Ok(Some(_)) => true, + Ok(None) => { + tracing::warn!( + task_number = task.task_number, + routing = ?routing, + "task store returned no row when updating completed task — task may have been deleted" + ); + logger.log( + "task_pickup_no_row", + &format!( + "Picked-up task #{} completed ({routing:?}) but the task row was missing on update", + task.task_number + ), + Some(serde_json::json!({ + "task_number": task.task_number, + "worker_id": worker_id.to_string(), + "routing": format!("{routing:?}"), + })), + ); + run_logger.log_worker_completed(worker_id, result_text, false); + let _ = event_tx.send(ProcessEvent::WorkerComplete { + agent_id: Arc::from(agent_id), + worker_id, + channel_id: None, + result: result_text.to_string(), + notify: true, + success: false, + }); + return; + } + Err(error) => { + tracing::warn!( + %error, + task_number = task.task_number, + routing = ?routing, + "failed to persist task status after worker completion" + ); + run_logger.log_worker_completed(worker_id, result_text, false); + logger.log( + "task_pickup_persist_failure", + &format!( + "Picked-up task #{} ({routing:?}) could not persist status: {error}", + task.task_number + ), + Some(serde_json::json!({ + "task_number": task.task_number, + "worker_id": worker_id.to_string(), + "routing": format!("{routing:?}"), + })), + ); + let _ = event_tx.send(ProcessEvent::WorkerComplete { + agent_id: Arc::from(agent_id), + worker_id, + channel_id: None, + result: result_text.to_string(), + notify: true, + success: false, + }); + return; + } + }; + + if !persisted { + return; + } + + run_logger.log_worker_completed(worker_id, result_text, success); + let _ = event_tx.send(ProcessEvent::TaskUpdated { + agent_id: Arc::from(agent_id), + task_number: task.task_number, + status: new_status.as_str().to_string(), + action: "updated".to_string(), + }); + + let log_event = match routing { + DetachedRouting::Success => "task_pickup_completed", + DetachedRouting::Terminal => "task_pickup_terminal_failure", + DetachedRouting::Requeue => "task_pickup_failed", + }; + let log_message = match routing { + DetachedRouting::Success => format!("Completed picked-up task #{}", task.task_number), + DetachedRouting::Terminal => format!( + "Terminal failure on picked-up task #{}: {result_text}", + task.task_number + ), + DetachedRouting::Requeue => { + format!("Picked-up task #{} failed: {result_text}", task.task_number) + } + }; + logger.log( + log_event, + &log_message, + Some(serde_json::json!({ + "task_number": task.task_number, + "worker_id": worker_id.to_string(), + "routing": format!("{routing:?}"), + })), + ); + + notify_delegation_completion( + task, + result_text, + success, + agent_id, + links, + agent_names, + sqlite_pool, + injection_tx, + ) + .await; + + let _ = event_tx.send(ProcessEvent::WorkerComplete { + agent_id: Arc::from(agent_id), + worker_id, + channel_id: None, + result: result_text.to_string(), + notify: true, + success, + }); +} + +/// One-shot wake for dormant agents. +/// +/// Triggered by `agent::wake::WakeManager` when an external event delivers +/// a wake (cross-agent message, cron fire, admin API). Picks up at most +/// one ready task synchronously, then returns. Idempotent and safe to +/// call on active-mode agents — the spawn_ready_task_loop will simply +/// race with us, which is fine because `task_store.claim_next_ready` is +/// transactional. +pub async fn wake_one(deps: &AgentDeps) -> anyhow::Result<()> { + tracing::debug!(agent_id = %deps.agent_id, "cortex wake fired"); + let logger = CortexLogger::new(deps.sqlite_pool.clone()); + pickup_one_ready_task(deps, &logger).await +} + /// Spawn a background loop that picks up ready tasks when idle. pub fn spawn_ready_task_loop(deps: AgentDeps, logger: CortexLogger) -> tokio::task::JoinHandle<()> { tokio::spawn(async move { @@ -3738,7 +3924,7 @@ async fn pickup_one_ready_task(deps: &AgentDeps, logger: &CortexLogger) -> anyho // Collect tool secret names so the worker template can list available credentials. let secrets_guard = deps.runtime_config.secrets.load(); let tool_secret_names = match (*secrets_guard).as_ref() { - Some(store) => store.tool_secret_names(), + Some(store) => store.tool_secret_names(&deps.agent_id), None => Vec::new(), }; @@ -3891,6 +4077,7 @@ async fn pickup_one_ready_task(deps: &AgentDeps, logger: &CortexLogger) -> anyho ); let task_store = deps.task_store.clone(); + let agent_id_typed = deps.agent_id.clone(); let agent_id = deps.agent_id.to_string(); let event_tx = deps.event_tx.clone(); let logger = logger.clone(); @@ -3906,7 +4093,7 @@ async fn pickup_one_ready_task(deps: &AgentDeps, logger: &CortexLogger) -> anyho // before persisting, logging, or emitting events. let scrub = |text: String| -> String { let scrubbed = if let Some(store) = secrets_snapshot.as_ref() { - crate::secrets::scrub::scrub_with_store(&text, store) + crate::secrets::scrub::scrub_with_store(&text, store, &agent_id_typed) } else { text }; @@ -3928,231 +4115,66 @@ async fn pickup_one_ready_task(deps: &AgentDeps, logger: &CortexLogger) -> anyho if let Some(worker_result) = worker_result { let completion_won = claim_detached_completion(&detached_worker_lifecycle); if completion_won { - match worker_result { - Ok(Ok(raw_result_text)) => { - let result_text = scrub(raw_result_text); - let db_updated = task_store - .update( - task.task_number, - UpdateTaskInput { - status: Some(TaskStatus::Done), - ..Default::default() - }, - ) - .await; - - if let Err(ref error) = db_updated { - tracing::warn!( - %error, - task_number = task.task_number, - "failed to mark picked-up task done" - ); - run_logger.log_worker_completed(worker_id, &result_text, false); - logger.log( - "task_pickup_completed_persist_failure", - &format!( - "Picked-up task #{} completed but could not persist done state: {error}", - task.task_number - ), - Some(serde_json::json!({ - "task_number": task.task_number, - "worker_id": worker_id.to_string(), - })), - ); - let _ = event_tx.send(ProcessEvent::WorkerComplete { - agent_id: Arc::from(agent_id.as_str()), - worker_id, - channel_id: None, - result: result_text, - notify: true, - success: false, - }); - } else { - run_logger.log_worker_completed(worker_id, &result_text, true); - let _ = event_tx.send(ProcessEvent::TaskUpdated { - agent_id: Arc::from(agent_id.as_str()), - task_number: task.task_number, - status: "done".to_string(), - action: "updated".to_string(), - }); - - logger.log( - "task_pickup_completed", - &format!("Completed picked-up task #{}", task.task_number), - Some(serde_json::json!({ - "task_number": task.task_number, - "worker_id": worker_id.to_string(), - })), - ); - - notify_delegation_completion( - &task, - &result_text, - true, - &agent_id, - &links, - &agent_names, - &sqlite_pool, - &injection_tx, - ) - .await; - - let _ = event_tx.send(ProcessEvent::WorkerComplete { - agent_id: Arc::from(agent_id.as_str()), - worker_id, - channel_id: None, - result: result_text, - notify: true, - success: true, - }); - } - } + // Convert the raw worker_result into a Result + // so the routing decision below can inspect the variant kind + // — Timeout / Blocked / Cancelled need terminal handling so + // they don't hot-loop on requeue to Ready. + let outcome_or_error: std::result::Result< + WorkerOutcome, + WorkerCompletionError, + > = match worker_result { + Ok(Ok(outcome)) => Ok(outcome), Ok(Err(error)) => { - let scrubbed_error = scrub(error.to_string()); - let (error_message, _notify, _success) = map_worker_completion_result( - Err(WorkerCompletionError::failed(scrubbed_error.clone())), - ); - let worker_complete_message = format!("Worker failed: {error}"); - run_logger.log_worker_completed(worker_id, &error_message, false); - let requeue_result = task_store - .update( - task.task_number, - UpdateTaskInput { - status: Some(TaskStatus::Ready), - clear_worker_id: true, - ..Default::default() - }, - ) - .await; - - if let Err(ref update_error) = requeue_result { - tracing::warn!( - %update_error, - task_number = task.task_number, - "failed to return task to ready after failure" - ); - logger.log( - "task_pickup_failed_to_persist", - &format!( - "Picked-up task #{} failed but could not persist failure state: {error}", - task.task_number - ), - Some(serde_json::json!({ - "task_number": task.task_number, - "worker_id": worker_id.to_string(), - "error": error.to_string(), - })), - ); - } else { - let _ = event_tx.send(ProcessEvent::TaskUpdated { - agent_id: Arc::from(agent_id.as_str()), - task_number: task.task_number, - status: "ready".to_string(), - action: "updated".to_string(), - }); - - logger.log( - "task_pickup_failed", - &format!( - "Picked-up task #{} failed: {error}", - task.task_number - ), - Some(serde_json::json!({ - "task_number": task.task_number, - "worker_id": worker_id.to_string(), - "error": error.to_string(), - })), - ); - - notify_delegation_completion( - &task, - &error_message, - false, - &agent_id, - &links, - &agent_names, - &sqlite_pool, - &injection_tx, - ) - .await; - } - - let _ = event_tx.send(ProcessEvent::WorkerComplete { - agent_id: Arc::from(agent_id.as_str()), - worker_id, - channel_id: None, - result: worker_complete_message, - notify: true, - success: false, - }); + Err(WorkerCompletionError::failed(scrub(error.to_string()))) } Err(panic_payload) => { - let scrubbed_panic = - scrub(crate::agent::panic_payload_to_string(&*panic_payload)); - let (error_message, _notify, _success) = - map_worker_completion_result(Err(WorkerCompletionError::failed( - format!("worker task panicked: {scrubbed_panic}"), - ))); - run_logger.log_worker_completed(worker_id, &error_message, false); - let requeue_result = task_store - .update( - task.task_number, - UpdateTaskInput { - status: Some(TaskStatus::Ready), - clear_worker_id: true, - ..Default::default() - }, - ) - .await; - - if let Err(ref update_error) = requeue_result { - tracing::warn!( - %update_error, - task_number = task.task_number, - "failed to return task to ready after panic" - ); - logger.log( - "task_pickup_panic_persist_failure", - &format!( - "Picked-up task #{} panicked and could not persist failure state: {error_message}", - task.task_number - ), - Some(serde_json::json!({ - "task_number": task.task_number, - "worker_id": worker_id.to_string(), - })), - ); - } else { - let _ = event_tx.send(ProcessEvent::TaskUpdated { - agent_id: Arc::from(agent_id.as_str()), - task_number: task.task_number, - status: "ready".to_string(), - action: "updated".to_string(), - }); - - notify_delegation_completion( - &task, - &error_message, - false, - &agent_id, - &links, - &agent_names, - &sqlite_pool, - &injection_tx, - ) - .await; - } - - let _ = event_tx.send(ProcessEvent::WorkerComplete { - agent_id: Arc::from(agent_id.as_str()), - worker_id, - channel_id: None, - result: error_message, - notify: true, - success: false, - }); + let panic_message = + crate::agent::panic_payload_to_string(&*panic_payload); + tracing::error!( + %worker_id, + %panic_message, + "cortex worker task panicked" + ); + Err(WorkerCompletionError::failed(format!( + "worker task panicked: {}", + scrub(panic_message) + ))) } - } + }; + + // Routing: Success/Partial → mark Done. + // Cancelled/Timeout/Blocked → terminal (mark Done, do not + // retry — these need human / explicit re-creation, not a + // pickup loop racing the same dead end again). + // Failed/error/panic → requeue Ready (true transient errors + // that might succeed on a future tick). + let routing = match &outcome_or_error { + Ok(WorkerOutcome::Success { .. }) | Ok(WorkerOutcome::Partial { .. }) => { + DetachedRouting::Success + } + Ok(WorkerOutcome::Cancelled { .. }) + | Ok(WorkerOutcome::Timeout { .. }) + | Ok(WorkerOutcome::Blocked { .. }) => DetachedRouting::Terminal, + Ok(WorkerOutcome::Failed { .. }) | Err(_) => DetachedRouting::Requeue, + }; + let (result_text, _notify, _success) = map_worker_completion(outcome_or_error); + let result_text = scrub(result_text); + handle_detached_completion( + routing, + &task, + worker_id, + &result_text, + &task_store, + &run_logger, + &logger, + &event_tx, + &agent_id, + &links, + &agent_names, + &sqlite_pool, + &injection_tx, + ) + .await; let _ = detached_worker_lifecycle.compare_exchange( crate::agent::process_control::DETACHED_WORKER_LIFECYCLE_COMPLETING, diff --git a/src/agent/invariant_harness.rs b/src/agent/invariant_harness.rs index 7e335c102..938405dc4 100644 --- a/src/agent/invariant_harness.rs +++ b/src/agent/invariant_harness.rs @@ -1,8 +1,9 @@ //! Invariant-driven fault harness for agent loop regression tests. use super::channel_dispatch::{ - WorkerCompletionError, map_worker_completion_result, reserve_worker_slot_local, + WorkerCompletionError, map_worker_completion, reserve_worker_slot_local, }; +use super::worker::WorkerOutcome; use rand::rngs::StdRng; use rand::{Rng, SeedableRng}; @@ -97,16 +98,15 @@ impl HarnessState { } } HarnessFault::WorkerFailure => { - let (_text, _notify, success) = map_worker_completion_result(Err( - WorkerCompletionError::failed("worker failed"), - )); + let (_text, _notify, success) = + map_worker_completion(Err(WorkerCompletionError::failed("worker failed"))); if success { self.worker_false_success = true; } } HarnessFault::WorkerCancelled => { let (text, _notify, success) = - map_worker_completion_result(Err(WorkerCompletionError::Cancelled { + map_worker_completion(Err(WorkerCompletionError::Cancelled { reason: "user requested".to_string(), })); if success || !text.starts_with("Worker cancelled:") { @@ -114,7 +114,9 @@ impl HarnessState { } } HarnessFault::WorkerSuccess => { - let (_text, _notify, success) = map_worker_completion_result(Ok("ok".to_string())); + let (_text, _notify, success) = map_worker_completion(Ok(WorkerOutcome::Success { + result: "ok".to_string(), + })); if !success { self.worker_false_success = true; } diff --git a/src/agent/maintenance.rs b/src/agent/maintenance.rs new file mode 100644 index 000000000..871533bd0 --- /dev/null +++ b/src/agent/maintenance.rs @@ -0,0 +1,107 @@ +//! Instance-wide memory maintenance for dormant-mode agents. +//! +//! When `CortexConfig.mode == Dormant`, the cortex's tick loop never spawns, +//! so the periodic memory consolidation / decay / pruning that normally runs +//! inside `spawn_cortex_loop` never fires. The janitor is the alternative +//! path: a single instance-wide cron task that walks every registered agent +//! on a slow schedule and runs the same `memory::maintenance` machinery. +//! +//! Active-mode agents are also walked when the janitor is enabled — the +//! maintenance functions are idempotent, so the additional pass costs a +//! small amount of redundant work but cannot corrupt state. +//! +//! Disabled by default (`MemoryJanitorConfig::enabled = false`); operators +//! opt in once they're running enough dormant agents that periodic +//! maintenance via tick is no longer happening. + +use crate::AgentDeps; +use crate::AgentId; +use std::collections::HashMap; +use std::sync::Arc; +use std::time::Duration; + +/// Per-agent maintenance budget. Generous because LanceDB index rebuilds +/// can take a while for large memory stores, but bounded so a single hung +/// embedding call can't stall the whole janitor. +const PER_AGENT_TIMEOUT_SECS: u64 = 600; + +/// Spawn the janitor task. Returns the join handle so the caller can keep +/// it alive for the process lifetime. +/// +/// `interval` is the gap between full sweeps. Default in `MemoryJanitorConfig` +/// is `86_400` (daily). The first sweep fires after `interval` to give the +/// instance time to settle on startup. +pub fn spawn_memory_janitor( + registry: Arc>>, + interval_secs: u64, +) -> tokio::task::JoinHandle<()> { + tokio::spawn(async move { + let interval = Duration::from_secs(interval_secs.max(60)); + tracing::info!(interval_secs = interval.as_secs(), "memory janitor started"); + loop { + tokio::time::sleep(interval).await; + let agents: Vec = { + let guard = registry.read().await; + guard.values().cloned().collect() + }; + tracing::info!( + agent_count = agents.len(), + "memory janitor running maintenance pass" + ); + for deps in agents { + let agent_id = deps.agent_id.clone(); + let per_agent = tokio::time::timeout( + Duration::from_secs(PER_AGENT_TIMEOUT_SECS), + run_maintenance_for_agent(&deps), + ) + .await; + match per_agent { + Ok(Ok(())) => {} + Ok(Err(error)) => { + tracing::warn!( + %agent_id, + %error, + "janitor maintenance pass failed for agent" + ); + } + Err(_) => { + tracing::warn!( + %agent_id, + timeout_secs = PER_AGENT_TIMEOUT_SECS, + "janitor maintenance for agent exceeded timeout — skipping to next agent" + ); + } + } + } + } + }) +} + +async fn run_maintenance_for_agent(deps: &AgentDeps) -> anyhow::Result<()> { + let cortex = deps.runtime_config.cortex.load(); + let config = crate::memory::maintenance::MaintenanceConfig { + prune_threshold: cortex.maintenance_prune_threshold, + decay_rate: cortex.maintenance_decay_rate, + min_age_days: cortex.maintenance_min_age_days, + merge_similarity_threshold: cortex.maintenance_merge_similarity_threshold, + }; + let memory_search = &deps.memory_search; + let report = crate::memory::maintenance::run_maintenance( + memory_search.store(), + memory_search.embedding_table(), + memory_search.embedding_model_arc(), + &config, + ) + .await + .map_err(|error| anyhow::anyhow!("memory maintenance failed: {error}"))?; + + // Prunes and merges change memory content; decay is importance-only and + // does not dirty knowledge synthesis. Mirrors the cortex-loop path in + // `cortex.rs` so dormant agents don't get a stale memory bulletin / + // knowledge synthesis after a janitor pass that pruned or merged. + if report.pruned > 0 || report.merged > 0 { + deps.runtime_config.bump_knowledge_synthesis_version(); + } + + Ok(()) +} diff --git a/src/agent/wake.rs b/src/agent/wake.rs new file mode 100644 index 000000000..d92185d63 --- /dev/null +++ b/src/agent/wake.rs @@ -0,0 +1,74 @@ +//! Wake infrastructure for dormant-mode agents. +//! +//! When `CortexConfig.mode == Dormant`, the cortex's periodic loops don't +//! spawn — the agent stays idle until an external event delivers a wake. +//! Wake triggers send the receiving agent's ID over an unbounded mpsc; +//! `WakeManager` consumes from that channel and dispatches to the agent's +//! `cortex::wake()` entry point. +//! +//! Active-mode agents are also valid wake targets — wake is a no-op (idempotent) +//! when the cortex's normal loops are already covering pickup. Triggers fire +//! the same wake call regardless of mode; the receiving agent decides whether +//! the wake is meaningful. +//! +//! This module owns no business logic — it's purely the dispatch substrate. +//! The "do work" happens in `cortex::wake_one`, which `WakeManager` calls. + +use crate::AgentDeps; +use crate::AgentId; +use std::collections::HashMap; +use std::sync::Arc; +use tokio::sync::mpsc; + +/// Sender side of the wake channel. Cloned into every `AgentDeps` and into +/// any tool / API endpoint that needs to deliver a wake. +pub type WakeSender = mpsc::UnboundedSender; + +/// Spawn the wake-dispatch loop. Returns a `WakeSender` that can be cloned +/// into `AgentDeps` and other contexts. +/// +/// `registry` is a shared map of agent ID → deps. The dispatcher reads it +/// each time a wake arrives, so registering / unregistering agents at +/// runtime is safe (cortex chat session pushes / removes can happen +/// without restarting the dispatcher). +pub fn spawn_wake_manager( + registry: Arc>>, +) -> WakeSender { + let (tx, mut rx) = mpsc::unbounded_channel::(); + tokio::spawn(async move { + tracing::info!("wake manager started"); + while let Some(agent_id) = rx.recv().await { + let deps = { + let guard = registry.read().await; + guard.get(&agent_id).cloned() + }; + let Some(deps) = deps else { + tracing::debug!(%agent_id, "wake delivered for unknown agent — dropping"); + continue; + }; + tokio::spawn(async move { + if let Err(error) = crate::agent::cortex::wake_one(&deps).await { + tracing::warn!( + agent_id = %deps.agent_id, + %error, + "cortex wake failed", + ); + } + }); + } + tracing::info!("wake manager exiting (channel closed)"); + }); + tx +} + +/// Best-effort wake delivery. Logs and drops when the receiver is gone +/// (the wake manager task panicked or was never spawned). +pub fn fire_wake(sender: &WakeSender, agent_id: &AgentId) { + if let Err(error) = sender.send(agent_id.clone()) { + tracing::warn!( + agent_id = %agent_id, + %error, + "wake send failed — wake manager not running" + ); + } +} diff --git a/src/agent/worker.rs b/src/agent/worker.rs index 7ec35e10e..3f0a80eaf 100644 --- a/src/agent/worker.rs +++ b/src/agent/worker.rs @@ -47,6 +47,168 @@ const TRANSIENT_RETRY_BASE_DELAY: std::time::Duration = std::time::Duration::fro /// without completing the task. const MAX_SEGMENTS: usize = 10; +/// Default wall-clock budget for a worker run when no agent override is set. +pub const DEFAULT_WORKER_WALL_CLOCK_TIMEOUT_SECS: u64 = 1800; + +/// Reason a worker run was blocked by an external defense (captcha, login +/// wall, rate-limit, fraud-detect WAF). Detection only — the agent fails +/// cleanly and surfaces to the parent app's escalation path. +#[derive(Debug, Clone, PartialEq, Eq)] +pub enum BlockReason { + /// Captcha challenge (recaptcha, hcaptcha, Cloudflare Turnstile, etc.). + Captcha { provider: String }, + /// Page redirected to a login URL or returned 401/403 with auth-prompting + /// headers. + LoginWall, + /// HTTP 429 or equivalent rate-limit signal. `retry_after_secs` is set + /// when the response carried a `Retry-After` header. + RateLimit { retry_after_secs: Option }, + /// WAF-style fraud / bot-protect block (Akamai, Cloudflare challenge, + /// AWS WAF, Imperva). `vendor` identifies the WAF where known. + FraudDetect { vendor: String }, + /// Detector fired but couldn't classify into a specific category. + Unknown, +} + +impl BlockReason { + pub fn kind_tag(&self) -> &'static str { + match self { + Self::Captcha { .. } => "captcha", + Self::LoginWall => "login_wall", + Self::RateLimit { .. } => "rate_limit", + Self::FraudDetect { .. } => "fraud_detect", + Self::Unknown => "unknown", + } + } + + pub fn describe(&self) -> String { + match self { + Self::Captcha { provider } => format!("captcha ({provider})"), + Self::LoginWall => "login wall".to_string(), + Self::RateLimit { retry_after_secs } => match retry_after_secs { + Some(secs) => format!("rate limit (retry after {secs}s)"), + None => "rate limit".to_string(), + }, + Self::FraudDetect { vendor } => format!("fraud detect ({vendor})"), + Self::Unknown => "unknown block".to_string(), + } + } +} + +/// Captured artifacts at the moment a block was detected. Surfaced via +/// `WorkerOutcome::Blocked` so the parent app's escalation path has enough +/// to triage (and the agent can capture the dead-end into memory). +#[derive(Debug, Clone, Default)] +pub struct BlockEvidence { + /// Final URL the browser landed on at detection time. + pub final_url: Option, + /// Snippet of page HTML (first ~4KB) for human triage. + pub html_snippet: Option, + /// HTTP status code where available (server-side detected). + pub status: Option, +} + +/// Shared mutable slot that browser tools write to on positive block +/// detection. The worker reads it at each loop iteration and short-circuits +/// to `WorkerOutcome::Blocked` when set. Mirrors the role of +/// `SpacebotHook::outcome_signaled` for terminal completion intent. +pub type BlockSignal = std::sync::Arc>>; + +#[derive(Debug, Clone)] +pub struct BlockSignalData { + pub reason: BlockReason, + pub url: Option, + pub evidence: BlockEvidence, +} + +/// Construct a fresh empty block signal slot. +pub fn new_block_signal() -> BlockSignal { + std::sync::Arc::new(std::sync::Mutex::new(None)) +} + +/// Structured outcome of a worker's `run()` call. +/// +/// Replaces the prior `Result` return so callers can distinguish a +/// clean success from a partial (max-segments) exit, a wall-clock timeout, +/// a tool-driven block (captcha / login wall), a cancellation, or a hard +/// failure. All variants carry enough payload that the caller can emit a +/// `ProcessEvent::WorkerComplete` without losing context. +#[derive(Debug, Clone)] +pub enum WorkerOutcome { + /// Worker completed and produced a final assistant response. + Success { result: String }, + /// Worker hit the segment ceiling (`MAX_SEGMENTS`) before producing a + /// terminal response. `result` is the last assistant text or a synthetic + /// "max segments reached" marker. + Partial { result: String, segments_run: usize }, + /// Worker was cancelled (LLM `PromptCancelled`, manual cancel, etc.). + Cancelled { reason: String }, + /// Worker exceeded the wall-clock timeout. `segments_run` reflects how + /// many segments completed before the timeout fired. + Timeout { + elapsed_secs: u64, + segments_run: usize, + }, + /// Browser (or other tool) detected an external defense — captcha, + /// login wall, rate limit, WAF block. The worker exits cleanly so the + /// agent can capture the dead-end into memory and the parent app can + /// escalate to a human. + Blocked { + reason: BlockReason, + url: Option, + evidence: Box, + }, + /// Worker failed for any other reason (LLM error, transient retries + /// exhausted, context overflow exhausted, empty result without outcome). + Failed { reason: String }, +} + +impl WorkerOutcome { + /// Render the outcome as text suitable for relay to a channel. + pub fn into_text(self) -> String { + match self { + Self::Success { result } => result, + Self::Partial { + result, + segments_run, + } => format!( + "{result}\n\n(reached max segments after {segments_run} attempts — partial result)" + ), + Self::Cancelled { reason } => format!("Worker cancelled: {reason}"), + Self::Timeout { + elapsed_secs, + segments_run, + } => format!( + "Worker exceeded {elapsed_secs}s wall-clock timeout after {segments_run} segments." + ), + Self::Blocked { reason, url, .. } => match url { + Some(url) => format!("Worker blocked: {} at {url}", reason.describe()), + None => format!("Worker blocked: {}", reason.describe()), + }, + Self::Failed { reason } => format!("Worker failed: {reason}"), + } + } + + /// Whether the outcome should be classified as a successful completion + /// for downstream notification / task-state purposes. `Partial` counts as + /// success — the worker did real work, just hit a ceiling. + pub fn is_success(&self) -> bool { + matches!(self, Self::Success { .. } | Self::Partial { .. }) + } + + /// Short stable kind tag for telemetry / event payloads. + pub fn kind_tag(&self) -> &'static str { + match self { + Self::Success { .. } => "success", + Self::Partial { .. } => "partial", + Self::Cancelled { .. } => "cancelled", + Self::Timeout { .. } => "timeout", + Self::Blocked { .. } => "blocked", + Self::Failed { .. } => "failed", + } + } +} + /// Worker state machine. #[derive(Debug, Clone, Copy, PartialEq, Eq)] pub enum WorkerState { @@ -97,6 +259,21 @@ pub struct Worker { pub wiki_write: bool, /// Model override from conversation settings (per-process or blanket). pub model_override: Option, + /// Wall-clock budget for the entire `run()` invocation. Distinct from + /// the supervisor's `CortexConfig.worker_timeout_secs` (which is an + /// idle-kill bound measured from `last_activity_at`). Resolution chain + /// at construction: agent `CortexConfig.worker_wall_clock_timeout_secs` + /// → `DEFAULT_WORKER_WALL_CLOCK_TIMEOUT_SECS`. + pub worker_wall_clock_timeout_secs: u64, + /// Shared segment counter so the outer timeout wrapper can report + /// progress in `WorkerOutcome::Timeout` after the inner future is + /// cancelled. + pub segments_run: std::sync::Arc, + /// Shared slot that browser tools (and any future block-aware tool) + /// write to on positive detection of an external defense. The worker + /// loop reads it at each iteration and short-circuits to + /// `WorkerOutcome::Blocked` when populated. + pub blocked_signal: BlockSignal, } impl Worker { @@ -128,6 +305,12 @@ impl Worker { let (status_tx, status_rx) = watch::channel("starting".to_string()); let (inject_tx, inject_rx) = mpsc::channel(8); + let worker_wall_clock_timeout_secs = deps + .runtime_config + .cortex + .load() + .worker_wall_clock_timeout_secs; + ( Self { id, @@ -153,6 +336,9 @@ impl Worker { worker_memory_mode, wiki_write, model_override, + worker_wall_clock_timeout_secs, + segments_run: std::sync::Arc::new(std::sync::atomic::AtomicUsize::new(0)), + blocked_signal: new_block_signal(), }, inject_tx, ) @@ -316,13 +502,47 @@ impl Worker { Ok(()) } + /// Run the worker with the configured wall-clock budget. + /// + /// Wraps `run_inner` in `tokio::time::timeout`. On expiry the inner + /// future is dropped (cancelled mid-step) and we return + /// `WorkerOutcome::Timeout`, reading the segment count from the shared + /// atomic so the caller knows how far the worker got. + pub async fn run(self) -> Result { + let timeout_secs = self.worker_wall_clock_timeout_secs; + let segments_tracker = self.segments_run.clone(); + let worker_id = self.id; + let agent_id = self.deps.agent_id.clone(); + + let inner_fut = self.run_inner(); + tokio::pin!(inner_fut); + + tokio::select! { + outcome = &mut inner_fut => outcome, + _ = tokio::time::sleep(std::time::Duration::from_secs(timeout_secs)) => { + let segments = segments_tracker.load(std::sync::atomic::Ordering::Relaxed); + tracing::warn!( + %worker_id, + %agent_id, + timeout_secs, + segments_run = segments, + "worker exceeded wall-clock timeout" + ); + Ok(WorkerOutcome::Timeout { + elapsed_secs: timeout_secs, + segments_run: segments, + }) + } + } + } + /// Run the worker's LLM agent loop until completion. /// /// Runs in segments of 25 turns. After each segment, checks context usage /// and compacts if the worker is approaching the context window limit. /// This prevents long-running workers from dying mid-task due to context /// exhaustion. - pub async fn run(mut self) -> Result { + async fn run_inner(mut self) -> Result { // Wire the injection receiver into the hook so `on_completion_call` // can drain pending injected context before each LLM turn. if let Some(inject_rx) = self.inject_rx.take() { @@ -361,6 +581,7 @@ impl Worker { self.deps.memory_search.clone(), self.wiki_write, self.deps.wiki_store.clone(), + Some(self.blocked_signal.clone()), ); let routing = self.deps.runtime_config.routing.load(); @@ -404,9 +625,10 @@ impl Worker { // Run the initial task in segments with compaction checkpoints // (skipped entirely for resumed workers). let mut prompt = self.task.clone(); - let mut segments_run = 0; + let mut segments_run: usize = 0; let mut overflow_retries = 0; let mut transient_retries = 0; + let mut hit_max_segments = false; let mut result = if resuming { // For resumed workers, synthesize a "result" from the task @@ -414,7 +636,36 @@ impl Worker { String::new() } else { loop { + // Short-circuit on a tool-detected block (captcha / login wall + // / WAF). Browser tools write to `blocked_signal` after their + // page action; the worker exits cleanly so the agent can + // capture the dead-end into memory and the parent app can + // escalate. + if let Some(data) = self.blocked_signal.lock().ok().and_then(|mut s| s.take()) { + self.state = WorkerState::Failed; + self.hook + .send_status(format!("blocked: {}", data.reason.kind_tag())); + self.write_failure_log( + &history, + &format!("blocked: {}", data.reason.describe()), + ); + self.persist_transcript(&compacted_history, &history).await; + tracing::warn!( + worker_id = %self.id, + reason = %data.reason.kind_tag(), + url = ?data.url, + "worker short-circuited on blocked signal" + ); + return Ok(WorkerOutcome::Blocked { + reason: data.reason, + url: data.url, + evidence: Box::new(data.evidence), + }); + } + segments_run += 1; + self.segments_run + .store(segments_run, std::sync::atomic::Ordering::Relaxed); // Pre-prompt maintenance: dedup stale tool results and check // context usage *before* each LLM call, not just at segment @@ -446,6 +697,7 @@ impl Worker { "worker hit max segments, returning partial result" ); self.hook.send_status("done (max segments)"); + hit_max_segments = true; break crate::agent::extract_last_assistant_text(&history) .unwrap_or_else(|| { "Worker reached maximum segments without a final response." @@ -475,17 +727,20 @@ impl Worker { self.write_failure_log(&history, &format!("cancelled: {reason}")); self.persist_transcript(&compacted_history, &history).await; tracing::info!(worker_id = %self.id, %reason, "worker cancelled"); - return Err(crate::error::AgentError::Cancelled { reason }.into()); + return Ok(WorkerOutcome::Cancelled { reason }); } Err(error) if is_context_overflow_error(&error.to_string()) => { overflow_retries += 1; if overflow_retries > MAX_OVERFLOW_RETRIES { self.state = WorkerState::Failed; self.hook.send_status("failed"); - self.write_failure_log(&history, &format!("context overflow after {MAX_OVERFLOW_RETRIES} compaction attempts: {error}")); + let reason = format!( + "context overflow after {MAX_OVERFLOW_RETRIES} compaction attempts: {error}" + ); + self.write_failure_log(&history, &reason); self.persist_transcript(&compacted_history, &history).await; tracing::error!(worker_id = %self.id, %error, "worker context overflow unrecoverable"); - return Err(crate::error::AgentError::Other(error.into()).into()); + return Ok(WorkerOutcome::Failed { reason }); } tracing::warn!( @@ -508,9 +763,10 @@ impl Worker { if transient_retries > MAX_TRANSIENT_RETRIES { self.state = WorkerState::Failed; self.hook.send_status("failed"); - self.write_failure_log(&history, &format!( + let reason = format!( "transient provider error after {MAX_TRANSIENT_RETRIES} retries: {error}" - )); + ); + self.write_failure_log(&history, &reason); self.persist_transcript(&compacted_history, &history).await; tracing::error!( worker_id = %self.id, @@ -518,7 +774,7 @@ impl Worker { %error, "worker transient error retries exhausted" ); - return Err(crate::error::AgentError::Other(error.into()).into()); + return Ok(WorkerOutcome::Failed { reason }); } let delay = @@ -543,10 +799,11 @@ impl Worker { Err(error) => { self.state = WorkerState::Failed; self.hook.send_status("failed"); - self.write_failure_log(&history, &error.to_string()); + let reason = error.to_string(); + self.write_failure_log(&history, &reason); self.persist_transcript(&compacted_history, &history).await; tracing::error!(worker_id = %self.id, %error, "worker LLM call failed"); - return Err(crate::error::AgentError::Other(error.into()).into()); + return Ok(WorkerOutcome::Failed { reason }); } } } @@ -570,18 +827,18 @@ impl Worker { } else { self.state = WorkerState::Failed; self.hook.send_status("failed (empty result)"); + let reason = "worker produced empty result without signaling a meaningful outcome" + .to_string(); self.write_failure_log(&history, "worker produced empty result — likely a reasoning-only exit that bypassed the outcome gate"); self.persist_transcript(&compacted_history, &history).await; tracing::error!(worker_id = %self.id, "worker produced empty result, treating as failure"); - return Err(crate::error::AgentError::Other(anyhow::anyhow!( - "worker produced empty result without signaling a meaningful outcome" - )) - .into()); + return Ok(WorkerOutcome::Failed { reason }); } } // For interactive workers, enter a follow-up loop let mut follow_up_failure: Option = None; + let mut follow_up_blocked: Option = None; if let Some(mut input_rx) = self.input_rx.take() { if !resuming { // Fresh worker: persist transcript and signal idle for the first time. @@ -596,6 +853,15 @@ impl Worker { self.state = WorkerState::Running; self.hook.send_status("processing follow-up"); + // Honor a blocked signal that may have arrived while we were + // idle waiting for input. Browser tools called during a prior + // follow-up turn may have detected captcha / login / WAF + // after the turn returned. + if let Some(data) = self.blocked_signal.lock().ok().and_then(|mut s| s.take()) { + follow_up_blocked = Some(data); + break; + } + // Dedup stale tool results and compact before follow-up if needed dedup_tool_results(&mut history); self.maybe_compact_history(&mut compacted_history, &mut history) @@ -693,7 +959,11 @@ impl Worker { let scrubbed = if let Some(store) = self.deps.runtime_config.secrets.load().as_ref().as_ref() { - crate::secrets::scrub::scrub_with_store(&response, store) + crate::secrets::scrub::scrub_with_store( + &response, + store, + &self.deps.agent_id, + ) } else { response }; @@ -710,6 +980,28 @@ impl Worker { } } Err(failure_reason) => { + // Surface as Blocked when the failure was caused by a + // browser tool detecting captcha / login / WAF — the + // tool returned an error that bubbled up here, but + // also wrote to `blocked_signal`. Without this check, + // phase 4's structured Blocked outcome would + // regress to a generic Failed for interactive + // workers hitting blocks mid-conversation. + if let Some(data) = + self.blocked_signal.lock().ok().and_then(|mut s| s.take()) + { + follow_up_blocked = Some(data); + self.state = WorkerState::Failed; + self.hook.send_status(format!( + "blocked: {}", + follow_up_blocked + .as_ref() + .expect("just set") + .reason + .kind_tag() + )); + break; + } self.state = WorkerState::Failed; self.hook.send_status("failed"); follow_up_failure = Some(failure_reason); @@ -724,10 +1016,27 @@ impl Worker { } } + if let Some(data) = follow_up_blocked { + self.persist_transcript(&compacted_history, &history).await; + tracing::warn!( + worker_id = %self.id, + reason = %data.reason.kind_tag(), + url = ?data.url, + "worker follow-up short-circuited on blocked signal" + ); + return Ok(WorkerOutcome::Blocked { + reason: data.reason, + url: data.url, + evidence: Box::new(data.evidence), + }); + } + if let Some(failure_reason) = follow_up_failure { self.persist_transcript(&compacted_history, &history).await; tracing::error!(worker_id = %self.id, reason = %failure_reason, "worker failed"); - return Err(crate::error::AgentError::Other(anyhow::anyhow!(failure_reason)).into()); + return Ok(WorkerOutcome::Failed { + reason: failure_reason, + }); } self.state = WorkerState::Done; @@ -757,7 +1066,14 @@ impl Worker { } tracing::info!(worker_id = %self.id, "worker completed"); - Ok(result) + if hit_max_segments { + Ok(WorkerOutcome::Partial { + result, + segments_run, + }) + } else { + Ok(WorkerOutcome::Success { result }) + } } /// Check context usage and compact history if approaching the limit. diff --git a/src/api/agents.rs b/src/api/agents.rs index 590f3a482..3ccc530df 100644 --- a/src/api/agents.rs +++ b/src/api/agents.rs @@ -4,7 +4,7 @@ use crate::agent::cortex::CortexLogger; use crate::conversation::channels::ChannelStore; use axum::Json; -use axum::extract::{Query, State}; +use axum::extract::{Path, Query, State}; use axum::http::StatusCode; use serde::{Deserialize, Serialize}; use sqlx::Row as _; @@ -396,6 +396,53 @@ pub(super) async fn get_warmup_status( Ok(Json(WarmupStatusResponse { statuses })) } +#[derive(serde::Serialize, utoipa::ToSchema)] +pub(super) struct WakeAgentResponse { + pub agent_id: String, + pub fired: bool, + pub message: String, +} + +/// Manually wake a (typically dormant) agent. +/// +/// Fires the same wake path that `send_agent_message`, cron, and other +/// trigger sources use. Useful for debugging dormant deployments and +/// recovering an agent stuck on a missed trigger. +#[utoipa::path( + post, + path = "/agents/{agent_id}/wake", + params(("agent_id" = String, Path, description = "Agent ID")), + responses( + (status = 202, body = WakeAgentResponse), + (status = 503, description = "Wake manager not running"), + ), + tag = "agents", +)] +pub(super) async fn wake_agent( + State(state): State>, + Path(agent_id): Path, +) -> Result, StatusCode> { + let wake_tx_guard = state.wake_tx.load(); + let Some(tx) = wake_tx_guard.as_ref().as_ref() else { + tracing::warn!(%agent_id, "wake requested but wake manager is not running"); + return Err(StatusCode::SERVICE_UNAVAILABLE); + }; + let target: crate::AgentId = std::sync::Arc::from(agent_id.as_str()); + if !state.wake_registry.read().await.contains_key(&target) { + tracing::warn!(%agent_id, "wake requested for unregistered agent"); + return Err(StatusCode::NOT_FOUND); + } + if let Err(error) = tx.send(target) { + tracing::warn!(%agent_id, %error, "wake send failed — manager not receiving"); + return Err(StatusCode::SERVICE_UNAVAILABLE); + } + Ok(Json(WakeAgentResponse { + agent_id, + fired: true, + message: "wake queued".to_string(), + })) +} + /// Trigger warmup for one agent or all agents. #[utoipa::path( post, @@ -514,6 +561,7 @@ pub(super) async fn trigger_warmup( working_memory, api_state: None, wiki_store: None, + wake_tx: None, }; let mut logger = CortexLogger::new(sqlite_pool); if let Some(store) = notif_store_warmup { @@ -878,6 +926,7 @@ pub async fn create_agent_internal( agent_config.workspace.clone(), &instance_dir, agent_config.data_dir.clone(), + std::sync::Arc::from(agent_config.id.as_str()), ) .await, ); @@ -948,6 +997,7 @@ pub async fn create_agent_internal( }, api_state: Some(state.clone()), wiki_store: state.wiki_store.load().as_ref().clone(), + wake_tx: state.wake_tx.load().as_ref().as_ref().cloned(), }; let event_rx = event_tx.subscribe(); @@ -955,6 +1005,15 @@ pub async fn create_agent_internal( let tool_output_rx = tool_output_tx.subscribe(); state.register_tool_output_stream(agent_id.clone(), tool_output_rx); + // Register with the wake manager so dormant-mode triggers can reach + // this agent. Without this, agents created at runtime would never + // receive cross-agent message wakes or admin /wake calls. + state + .wake_registry + .write() + .await + .insert(arc_agent_id.clone(), deps.clone()); + let cron_store = std::sync::Arc::new(crate::cron::CronStore::new(db.sqlite.clone())); let cron_context = crate::cron::CronContext { deps: deps.clone(), @@ -1346,6 +1405,14 @@ pub(super) async fn delete_agent( })?; } + // Drop the wake-manager registration only after the fallible config write + // succeeds. Removing it earlier would leave the agent alive but unwakeable + // if the write failed mid-flight. + { + let key: crate::AgentId = std::sync::Arc::from(agent_id.as_str()); + state.wake_registry.write().await.remove(&key); + } + // Close the SQLite pool before removing state { let pools = state.agent_pools.load(); diff --git a/src/api/providers.rs b/src/api/providers.rs index 3cf4421aa..ec8f67de0 100644 --- a/src/api/providers.rs +++ b/src/api/providers.rs @@ -417,7 +417,7 @@ pub(super) async fn get_providers( let resolve_value = |value: &str| -> Option { if let Some(alias) = value.strip_prefix("secret:") { let store = secrets_store.as_ref().as_ref()?; - return match store.get(alias) { + return match store.get(&crate::secrets::store::SecretScope::shared(), alias) { Ok(secret) => Some(secret.expose().to_string()), Err(error) => { tracing::warn!(%error, alias, "failed to resolve secret reference"); diff --git a/src/api/secrets.rs b/src/api/secrets.rs index 356593459..1779d7b99 100644 --- a/src/api/secrets.rs +++ b/src/api/secrets.rs @@ -10,7 +10,8 @@ use crate::config::{ TwitchConfig, }; use crate::secrets::store::{ - ExportData, SecretCategory, SecretsStore, StoreState, SystemSecrets, auto_categorize, + ExportData, SecretCategory, SecretScope, SecretsStore, StoreState, SystemSecrets, + auto_categorize, }; use axum::Json; @@ -69,6 +70,9 @@ pub async fn secrets_status(State(state): State>) -> impl IntoResp #[derive(Serialize, utoipa::ToSchema)] struct SecretListItem { name: String, + /// Visibility scope. `InstanceShared` is the default for system / + /// admin-managed secrets; `Agent` rows are per-agent tool credentials. + scope: SecretScope, category: SecretCategory, created_at: chrono::DateTime, updated_at: chrono::DateTime, @@ -95,12 +99,18 @@ pub async fn list_secrets(State(state): State>) -> impl IntoRespon Err(e) => return e.into_response(), }; - match store.list_metadata() { + // Restrict to InstanceShared until per-agent CRUD endpoints land — the + // existing PUT/DELETE/INFO endpoints only operate on shared scope, so + // surfacing agent-scoped rows here would let the dashboard show entries + // it can't read or mutate. Agent-scoped secrets get their own listing + // through the per-agent endpoints (forthcoming). + match store.list_metadata(Some(&SecretScope::shared())) { Ok(metadata) => { let mut secrets: Vec = metadata .into_iter() - .map(|(name, meta)| SecretListItem { + .map(|((scope, name), meta)| SecretListItem { name, + scope, category: meta.category, created_at: meta.created_at, updated_at: meta.updated_at, @@ -167,7 +177,7 @@ pub async fn put_secret( let category = body.category.unwrap_or_else(|| auto_categorize(&name)); - match store.set(&name, &body.value, category) { + match store.set(&SecretScope::shared(), &name, &body.value, category) { Ok(()) => { let reload_required = category == SecretCategory::System; let message = if reload_required { @@ -230,7 +240,7 @@ pub async fn delete_secret( .into_response(); } - match store.delete(&name) { + match store.delete(&SecretScope::shared(), &name) { Ok(()) => Json(DeleteSecretResponse { deleted: name, warning: None, @@ -275,7 +285,7 @@ pub async fn secret_info( Err(e) => return e.into_response(), }; - match store.get_metadata(&name) { + match store.get_metadata(&SecretScope::shared(), &name) { Ok(meta) => Json(SecretInfoResponse { name, category: meta.category, @@ -736,7 +746,7 @@ fn try_migrate_field( } let category = auto_categorize(secret_name); - if let Err(error) = store.set(secret_name, &value_str, category) { + if let Err(error) = store.set(&SecretScope::shared(), secret_name, &value_str, category) { tracing::warn!(%error, secret_name, "failed to migrate secret"); return; } @@ -819,7 +829,9 @@ fn migrate_section_secrets( }; let category = auto_categorize(&secret_name); - if let Err(error) = store.set(&secret_name, &value_str, category) { + if let Err(error) = + store.set(&SecretScope::shared(), &secret_name, &value_str, category) + { tracing::warn!(%error, %secret_name, "failed to migrate instance secret"); continue; } diff --git a/src/api/server.rs b/src/api/server.rs index d7b52db80..a60cb7ec5 100644 --- a/src/api/server.rs +++ b/src/api/server.rs @@ -65,6 +65,7 @@ pub fn api_router() -> OpenApiRouter> { .routes(routes!(agents::reconnect_agent_mcp)) .routes(routes!(agents::get_warmup_status)) .routes(routes!(agents::trigger_warmup)) + .routes(routes!(agents::wake_agent)) .routes(routes!(agents::agent_overview)) .routes(routes!(agents::get_agent_profile)) .routes(routes!( diff --git a/src/api/state.rs b/src/api/state.rs index 79e8828b3..80d6ca604 100644 --- a/src/api/state.rs +++ b/src/api/state.rs @@ -259,6 +259,16 @@ pub struct ApiState { pub task_store: ArcSwap>>, /// Instance-wide wiki knowledge base. pub wiki_store: ArcSwap>>, + /// Wake-dispatch sender for dormant-mode agent triggers. Set at startup + /// when the wake manager spawns; consumed by tools / endpoints that + /// deliver wakes to other agents. + pub wake_tx: ArcSwap>, + /// Live registry of agent ID → deps for the wake manager. Mutated by + /// startup, runtime agent-create, and runtime agent-delete paths so + /// dynamically-added agents are wakeable and removed agents stop + /// receiving wakes. + pub wake_registry: + Arc>>, /// Instance-level shared project store. pub project_store: ArcSwap>>, /// Instance-level notification store for the dashboard inbox. @@ -530,6 +540,8 @@ impl ApiState { cron_schedulers: arc_swap::ArcSwap::from_pointee(HashMap::new()), task_store: ArcSwap::from_pointee(None), wiki_store: ArcSwap::from_pointee(None), + wake_tx: ArcSwap::from_pointee(None), + wake_registry: Arc::new(tokio::sync::RwLock::new(std::collections::HashMap::new())), project_store: ArcSwap::from_pointee(None), notification_store: ArcSwap::from_pointee(None), runtime_configs: ArcSwap::from_pointee(HashMap::new()), diff --git a/src/config/load.rs b/src/config/load.rs index 377440de4..b4c4e5df5 100644 --- a/src/config/load.rs +++ b/src/config/load.rs @@ -15,7 +15,7 @@ use super::{ CoalesceConfig, CompactionConfig, Config, CortexConfig, CronDef, DefaultsConfig, DiscordConfig, DiscordInstanceConfig, EmailConfig, EmailInstanceConfig, GroupDef, HumanDef, IngestionConfig, LinkDef, LlmConfig, MattermostConfig, MattermostInstanceConfig, McpServerConfig, McpTransport, - MemoryPersistenceConfig, MessagingConfig, MetricsConfig, OpenCodeConfig, + MemoryJanitorConfig, MemoryPersistenceConfig, MessagingConfig, MetricsConfig, OpenCodeConfig, ParticipantContextConfig, ProjectsConfig, ProviderConfig, SignalConfig, SignalInstanceConfig, SlackCommandConfig, SlackConfig, SlackInstanceConfig, TelegramConfig, TelegramInstanceConfig, TelemetryConfig, TwitchConfig, TwitchInstanceConfig, WarmupConfig, WebhookConfig, @@ -38,7 +38,7 @@ pub(crate) fn resolve_env_value(value: &str) -> Option { if let Some(alias) = value.strip_prefix("secret:") { let guard = RESOLVE_SECRETS_STORE.load(); match (*guard).as_ref() { - Some(store) => match store.get(alias) { + Some(store) => match store.get(&crate::secrets::store::SecretScope::shared(), alias) { Ok(secret) => Some(secret.expose().to_string()), Err(error) => { tracing::warn!(%error, alias, "failed to resolve secret: reference"); @@ -80,6 +80,7 @@ const KNOWN_TOP_LEVEL_KEYS: &[&str] = &[ "api", "metrics", "telemetry", + "memory_janitor", ]; /// Pre-parse check that warns about unrecognised top-level keys in a config @@ -188,13 +189,28 @@ impl CortexConfig { ); } + let worker_wall_clock_timeout_secs = overrides + .worker_wall_clock_timeout_secs + .unwrap_or(defaults.worker_wall_clock_timeout_secs); + if worker_wall_clock_timeout_secs < 1 { + return Err(ConfigError::Invalid( + "worker_wall_clock_timeout_secs must be >= 1".to_string(), + ) + .into()); + } + let config = CortexConfig { + mode: overrides.mode.unwrap_or(defaults.mode), tick_interval_secs: overrides .tick_interval_secs .unwrap_or(defaults.tick_interval_secs), worker_timeout_secs: overrides .worker_timeout_secs .unwrap_or(defaults.worker_timeout_secs), + worker_wall_clock_timeout_secs, + cron_default_timeout_secs: overrides + .cron_default_timeout_secs + .or(defaults.cron_default_timeout_secs), branch_timeout_secs: overrides .branch_timeout_secs .unwrap_or(defaults.branch_timeout_secs), @@ -977,6 +993,7 @@ impl Config { .unwrap_or_else(|_| "spacebot".into()), sample_rate: 1.0, }, + memory_janitor: MemoryJanitorConfig::default(), }) } @@ -2628,6 +2645,17 @@ impl Config { } } + let memory_janitor = MemoryJanitorConfig { + enabled: toml + .memory_janitor + .enabled + .unwrap_or_else(|| MemoryJanitorConfig::default().enabled), + interval_secs: toml + .memory_janitor + .interval_secs + .unwrap_or_else(|| MemoryJanitorConfig::default().interval_secs), + }; + Ok(Config { instance_dir, llm, @@ -2641,6 +2669,7 @@ impl Config { api, metrics, telemetry, + memory_janitor, }) } } diff --git a/src/config/toml_schema.rs b/src/config/toml_schema.rs index 839fa7fe0..b336ad354 100644 --- a/src/config/toml_schema.rs +++ b/src/config/toml_schema.rs @@ -29,6 +29,14 @@ pub(super) struct TomlConfig { pub(super) metrics: TomlMetricsConfig, #[serde(default)] pub(super) telemetry: TomlTelemetryConfig, + #[serde(default)] + pub(super) memory_janitor: TomlMemoryJanitorConfig, +} + +#[derive(Deserialize, Default)] +pub(super) struct TomlMemoryJanitorConfig { + pub(super) enabled: Option, + pub(super) interval_secs: Option, } #[derive(Deserialize)] @@ -362,8 +370,11 @@ pub(super) struct TomlCompactionConfig { #[derive(Deserialize)] pub(super) struct TomlCortexConfig { + pub(super) mode: Option, pub(super) tick_interval_secs: Option, pub(super) worker_timeout_secs: Option, + pub(super) worker_wall_clock_timeout_secs: Option, + pub(super) cron_default_timeout_secs: Option, pub(super) branch_timeout_secs: Option, pub(super) detached_worker_timeout_retry_limit: Option, pub(super) supervisor_kill_budget_per_tick: Option, diff --git a/src/config/types.rs b/src/config/types.rs index 33c1fce6a..65d646c8f 100644 --- a/src/config/types.rs +++ b/src/config/types.rs @@ -59,6 +59,31 @@ pub struct Config { pub metrics: MetricsConfig, /// OpenTelemetry export configuration. pub telemetry: TelemetryConfig, + /// Instance-wide memory janitor — required for dormant-mode agents + /// (their cortex tick never runs maintenance), additive on active-mode + /// agents. + pub memory_janitor: MemoryJanitorConfig, +} + +/// Instance-wide memory maintenance scheduler. +#[derive(Debug, Clone, Copy)] +pub struct MemoryJanitorConfig { + /// When `true`, a single background task walks every registered agent + /// on `interval_secs` and runs `memory::maintenance::run_maintenance` + /// against its memory store. Disabled by default; enable when running + /// dormant-mode agents at scale. + pub enabled: bool, + /// Seconds between full sweeps. Default `86_400` (daily). + pub interval_secs: u64, +} + +impl Default for MemoryJanitorConfig { + fn default() -> Self { + Self { + enabled: false, + interval_secs: 86_400, + } + } } impl Config { @@ -1036,11 +1061,49 @@ impl Default for OpenCodeConfig { } } +/// Whether the cortex runs its periodic loops or stays dormant until woken. +/// +/// `Active` (default) is the historical behavior — the cortex spawns +/// warmup, tick, association, and ready-task loops at agent startup. +/// +/// `Dormant` is the agentic-backend deployment mode — the cortex skips +/// all four loops and instead relies on external wake triggers +/// (`send_agent_message` post-insert hook, cron fire, admin wake API). +/// Required for deployments running thousands of mostly-idle agents on +/// shared infrastructure where periodic LLM-backed bulletin generation +/// would dominate cost. +#[derive(Debug, Clone, Copy, Default, PartialEq, Eq, Serialize, Deserialize, utoipa::ToSchema)] +#[serde(rename_all = "snake_case")] +pub enum CortexMode { + #[default] + Active, + Dormant, +} + +impl CortexMode { + pub fn is_dormant(&self) -> bool { + matches!(self, Self::Dormant) + } +} + /// Cortex configuration. #[derive(Debug, Clone, Copy)] pub struct CortexConfig { + /// Whether the cortex's periodic loops run. See `CortexMode` for the + /// active vs dormant trade-off. + pub mode: CortexMode, pub tick_interval_secs: u64, + /// Supervisor idle-kill bound: max seconds since `last_activity_at` + /// before the cortex supervisor terminates a stuck worker. pub worker_timeout_secs: u64, + /// Wall-clock budget for an entire `Worker::run` invocation. Distinct + /// from `worker_timeout_secs` (which is idle-shaped). Catches the + /// slow-drift case where a worker stays "active" but never completes. + pub worker_wall_clock_timeout_secs: u64, + /// Per-agent default for cron job timeouts. `None` falls back to the + /// system default (`crate::cron::scheduler::DEFAULT_CRON_TIMEOUT_SECS`). + /// A per-job `timeout_secs` always wins over this default. + pub cron_default_timeout_secs: Option, pub branch_timeout_secs: u64, pub detached_worker_timeout_retry_limit: u8, pub supervisor_kill_budget_per_tick: usize, @@ -1078,8 +1141,12 @@ pub struct CortexConfig { impl Default for CortexConfig { fn default() -> Self { Self { + mode: CortexMode::Active, tick_interval_secs: 30, worker_timeout_secs: 600, + worker_wall_clock_timeout_secs: + crate::agent::worker::DEFAULT_WORKER_WALL_CLOCK_TIMEOUT_SECS, + cron_default_timeout_secs: None, branch_timeout_secs: 600, detached_worker_timeout_retry_limit: 2, supervisor_kill_budget_per_tick: 8, diff --git a/src/cron/scheduler.rs b/src/cron/scheduler.rs index cd4dfb12d..dc1d42655 100644 --- a/src/cron/scheduler.rs +++ b/src/cron/scheduler.rs @@ -131,6 +131,11 @@ pub struct CronContext { const MAX_CONSECUTIVE_FAILURES: u32 = 3; +/// System-wide fallback for a cron job's wall-clock timeout when neither +/// the job nor its owning agent specifies one. Generous because cron +/// channels include an LLM synthesis step after the worker completes. +pub const DEFAULT_CRON_TIMEOUT_SECS: u64 = 1500; + /// RAII guard that clears an `AtomicBool` on drop, ensuring the flag is /// released even if the holding task panics. struct ExecutionGuard(Arc); @@ -1384,7 +1389,18 @@ async fn run_cron_job( )); } - let timeout = Duration::from_secs(job.timeout_secs.unwrap_or(1500)); + // Resolution: per-job override → per-agent default → system default. + let agent_default = context + .deps + .runtime_config + .cortex + .load() + .cron_default_timeout_secs; + let timeout_secs = job + .timeout_secs + .or(agent_default) + .unwrap_or(DEFAULT_CRON_TIMEOUT_SECS); + let timeout = Duration::from_secs(timeout_secs); // Keep channel_tx alive — on timeout we send a "synthesize now" message // so the LLM gets a direct turn to compose the final reply. @@ -1475,6 +1491,16 @@ async fn run_cron_job( drop(channel_tx); } + // Channel has fully exited. Wake the owning agent so dormant-mode cortex + // picks up any side-effect tasks the cron run created (delegations, + // follow-up tasks, etc.). Firing here — after the channel finishes — + // ensures the one-shot wake actually finds the work; firing pre-channel + // would consume the wake before any task rows existed. No-op for + // active-mode agents. + if let Some(tx) = context.deps.wake_tx.as_ref() { + crate::agent::wake::fire_wake(tx, &context.deps.agent_id); + } + // Channel has fully exited. Deliver the outcome if set_outcome() was called. let final_response = cron_outcome.take().map(OutboundResponse::Text); diff --git a/src/lib.rs b/src/lib.rs index abe246a99..87af2d3fd 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -464,6 +464,11 @@ pub struct AgentDeps { pub api_state: Option>, /// Instance-wide wiki store. pub wiki_store: Option>, + /// Wake channel for dormant-mode dispatch. When set, tools / cron + /// fire wakes by sending a target `AgentId`; the wake manager looks + /// up the receiver and runs `cortex::wake_one` for it. `None` in + /// test contexts and during config preview where no manager runs. + pub wake_tx: Option, } impl AgentDeps { diff --git a/src/main.rs b/src/main.rs index 51e386832..da784382e 100644 --- a/src/main.rs +++ b/src/main.rs @@ -2773,6 +2773,16 @@ async fn initialize_agents( .collect(), ); + // Wake-dispatch infrastructure for dormant-mode agents. Always spawned + // so active-mode agents can also participate (cron / message wake hooks + // are mode-agnostic — `cortex::wake_one` is a no-op contention with the + // active loop's pickup, harmless either way). The registry lives on + // `api_state` so runtime agent-create / agent-delete paths can keep it + // in sync without going through main. + let wake_registry = api_state.wake_registry.clone(); + let wake_tx = spacebot::agent::wake::spawn_wake_manager(wake_registry.clone()); + api_state.wake_tx.store(Arc::new(Some(wake_tx.clone()))); + for agent_config in &resolved_agents { tracing::info!(agent_id = %agent_config.id, "initializing agent"); @@ -2972,6 +2982,7 @@ async fn initialize_agents( agent_config.workspace.clone(), &config.instance_dir, agent_config.data_dir.clone(), + agent_id.clone(), ) .await, ); @@ -3010,6 +3021,7 @@ async fn initialize_agents( working_memory, api_state: Some(api_state.clone()), wiki_store: Some(global_wiki_store.clone()), + wake_tx: Some(wake_tx.clone()), }; let agent = spacebot::Agent { @@ -3019,6 +3031,12 @@ async fn initialize_agents( deps, }; + // Register with the wake manager so external triggers can reach this agent. + wake_registry + .write() + .await + .insert(agent_id.clone(), agent.deps.clone()); + tracing::info!(agent_id = %agent_config.id, "agent initialized"); agents.insert(agent_id, agent); } @@ -3124,6 +3142,13 @@ async fn initialize_agents( let mut startup_warmup = tokio::task::JoinSet::new(); for (agent_id, agent) in agents.iter() { + // Dormant agents stay cold at startup — running warmup here would + // touch model/embedding load paths the dormant-mode contract + // promises to skip until an explicit wake. + if agent.deps.runtime_config.cortex.load().mode.is_dormant() { + tracing::info!(agent_id = %agent_id, "startup warmup skipped: dormant"); + continue; + } let deps = agent.deps.clone(); let sqlite_pool = agent.db.sqlite.clone(); let agent_id = agent_id.clone(); @@ -3732,8 +3757,19 @@ async fn initialize_agents( } } - // Start cortex warmup, runtime, and association loops for each agent + // Start cortex warmup, runtime, and association loops for each agent. + // Skip every loop when the agent is in Dormant mode — those agents wake + // only on external triggers (cross-agent message, cron fire, admin API). for (agent_id, agent) in agents.iter() { + let cortex_mode = agent.deps.runtime_config.cortex.load().mode; + if cortex_mode.is_dormant() { + tracing::info!( + agent_id = %agent_id, + "cortex loops skipped: agent is in dormant mode" + ); + continue; + } + let cortex_logger = spacebot::agent::cortex::CortexLogger::new(agent.db.sqlite.clone()) .with_notifications(global_notification_store.clone(), agent_id.to_string()); let warmup_handle = @@ -3760,6 +3796,21 @@ async fn initialize_agents( tracing::info!(agent_id = %agent_id, "cortex ready-task loop started"); } + // Spawn the instance-wide memory janitor when configured. Required for + // dormant-mode agents (their cortex loop never runs maintenance); + // additive on active-mode agents (idempotent). + if config.memory_janitor.enabled { + let janitor_handle = spacebot::agent::maintenance::spawn_memory_janitor( + wake_registry.clone(), + config.memory_janitor.interval_secs, + ); + cortex_handles.push(janitor_handle); + tracing::info!( + interval_secs = config.memory_janitor.interval_secs, + "memory janitor started" + ); + } + // Create cortex chat sessions for each agent { let mut sessions = std::collections::HashMap::new(); diff --git a/src/opencode/worker.rs b/src/opencode/worker.rs index 9d5e78909..b09e20fcc 100644 --- a/src/opencode/worker.rs +++ b/src/opencode/worker.rs @@ -244,7 +244,7 @@ impl OpenCodeWorker { /// Returns the scrubbed text. If no secrets store is set, returns the input unchanged. fn scrub_text(&self, text: &str) -> String { match &self.secrets_store { - Some(store) => crate::secrets::scrub::scrub_with_store(text, store), + Some(store) => crate::secrets::scrub::scrub_with_store(text, store, &self.agent_id), None => text.to_string(), } } diff --git a/src/sandbox.rs b/src/sandbox.rs index c4f0ddd30..5eb952e2e 100644 --- a/src/sandbox.rs +++ b/src/sandbox.rs @@ -163,6 +163,10 @@ pub struct Sandbox { data_dir: PathBuf, tools_bin: PathBuf, backend: InternalBackend, + /// Owning agent for this sandbox. Used to scope tool-secret reads when + /// the secrets store is wired in (see `tool_secrets`). Sandbox is per- + /// agent in production; the test constructor uses a placeholder ID. + agent_id: crate::AgentId, /// Reference to the secrets store for injecting tool secrets into worker /// subprocesses. When set, `wrap()` reads tool secrets from the store and /// injects them as env vars via `--setenv` (bubblewrap) or `Command::env()` @@ -193,6 +197,7 @@ impl Sandbox { workspace: PathBuf, instance_dir: &Path, data_dir: PathBuf, + agent_id: crate::AgentId, ) -> Self { let tools_bin = instance_dir.join("tools/bin"); @@ -240,6 +245,7 @@ impl Sandbox { data_dir, tools_bin, backend, + agent_id, secrets_store: ArcSwap::from_pointee(None), } } @@ -252,11 +258,12 @@ impl Sandbox { self.secrets_store.store(Arc::new(Some(store))); } - /// Read tool secrets from the store for injection into subprocess environment. + /// Read tool secrets visible to this sandbox's owning agent for injection + /// into subprocess environment. fn tool_secrets(&self) -> HashMap { let guard = self.secrets_store.load(); match guard.as_ref() { - Some(store) => store.tool_env_vars(), + Some(store) => store.tool_env_vars(&self.agent_id), None => HashMap::new(), } } @@ -898,6 +905,7 @@ impl Sandbox { data_dir: PathBuf::new(), tools_bin: PathBuf::new(), backend: InternalBackend::None, + agent_id: std::sync::Arc::from("test-agent"), secrets_store: ArcSwap::from_pointee(None), } } diff --git a/src/secrets/scrub.rs b/src/secrets/scrub.rs index 7eb2dce3a..03070b238 100644 --- a/src/secrets/scrub.rs +++ b/src/secrets/scrub.rs @@ -222,11 +222,18 @@ pub fn scrub_secrets(text: &str, tool_secrets: &[(String, String)]) -> String { /// Scrub tool secret values from text using a `SecretsStore`. /// -/// Reads the current tool secrets from the store and performs exact-match -/// redaction. For use in result paths where a `SecretsStore` reference is -/// available. -pub fn scrub_with_store(text: &str, store: &crate::secrets::store::SecretsStore) -> String { - let pairs = store.tool_secret_pairs(); +/// Reads the tool secrets visible to `agent_id` (the union of +/// `InstanceShared(Tool)` and the agent's own scope) and performs +/// exact-match redaction. Scoped per agent so a worker's output is only +/// scrubbed against the secret values that worker could plausibly have +/// seen — avoids incidentally redacting another tenant's secret value +/// that happened to appear verbatim in the text. +pub fn scrub_with_store( + text: &str, + store: &crate::secrets::store::SecretsStore, + agent_id: &crate::AgentId, +) -> String { + let pairs = store.tool_secret_pairs(agent_id); scrub_secrets(text, &pairs) } diff --git a/src/secrets/store.rs b/src/secrets/store.rs index 0f34b5331..2c70b48ef 100644 --- a/src/secrets/store.rs +++ b/src/secrets/store.rs @@ -10,6 +10,7 @@ //! The master key lives in the OS credential store (Keychain / kernel keyring), //! never on disk. +use crate::AgentId; use crate::error::SecretsError; use aes_gcm::{Aes256Gcm, KeyInit, Nonce, aead::Aead}; use rand::RngCore; @@ -68,6 +69,92 @@ impl Display for DecryptedSecret { } } +/// Secret scope determines visibility across agents on a shared instance. +/// +/// Orthogonal to `SecretCategory` — `System` secrets are always +/// `InstanceShared` (singleton consumers like `LlmManager` / +/// `MessagingManager`); `Tool` secrets default to `Agent(...)` for +/// agentic-backend deployments where each tenant's worker subprocess must +/// not see another tenant's credentials, but can also be `InstanceShared` +/// when a single-tenant deployment legitimately wants every agent to share +/// the same `Tool` secret (e.g. one repo-wide `GH_TOKEN`). +#[derive(Debug, Clone, PartialEq, Eq, Hash, Serialize, Deserialize, utoipa::ToSchema)] +#[serde(tag = "kind", rename_all = "snake_case")] +pub enum SecretScope { + /// Visible to every agent on the instance. + InstanceShared, + /// Visible only to the named agent. + Agent { + #[serde(rename = "agent_id")] + id: String, + }, +} + +impl SecretScope { + pub fn shared() -> Self { + Self::InstanceShared + } + + pub fn agent(id: &AgentId) -> Self { + Self::Agent { + id: id.as_ref().to_string(), + } + } + + pub fn is_shared(&self) -> bool { + matches!(self, Self::InstanceShared) + } + + /// Returns true when this scope is visible to the given agent — either + /// because it's instance-shared or it's that specific agent's scope. + pub fn visible_to(&self, agent_id: &AgentId) -> bool { + match self { + Self::InstanceShared => true, + Self::Agent { id } => id == agent_id.as_ref(), + } + } +} + +impl std::fmt::Display for SecretScope { + fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result { + match self { + Self::InstanceShared => write!(f, "shared"), + Self::Agent { id } => write!(f, "agent:{id}"), + } + } +} + +/// Encode `(scope, name)` into the redb key. Uses `\x00` (NUL) as the +/// delimiter — env-var names cannot contain NUL, and agent IDs are UUID- +/// shaped strings, so the encoding is unambiguous and round-trips cleanly. +/// +/// Layout: +/// - Shared: `s\x00` +/// - Agent: `a\x00\x00` +/// +/// Legacy unprefixed keys (pre-migration) match neither pattern (no NUL +/// byte) and are rewritten as `s\x00` on first store open. +fn encode_key(scope: &SecretScope, name: &str) -> String { + match scope { + SecretScope::InstanceShared => format!("s\x00{name}"), + SecretScope::Agent { id } => format!("a\x00{id}\x00{name}"), + } +} + +/// Inverse of `encode_key`. Returns `None` for legacy unprefixed keys (so +/// the migration path can detect them) and for malformed keys. +fn decode_key(key: &str) -> Option<(SecretScope, String)> { + let (tag, rest) = key.split_once('\x00')?; + match tag { + "s" => Some((SecretScope::InstanceShared, rest.to_string())), + "a" => { + let (id, name) = rest.split_once('\x00')?; + Some((SecretScope::Agent { id: id.to_string() }, name.to_string())) + } + _ => None, + } +} + /// Secret category determines subprocess exposure. /// /// All secrets are readable by Rust code via `SecretsStore::get()` regardless @@ -231,12 +318,125 @@ impl SecretsStore { } }; - Ok(Self { + let store = Self { db, cipher_state: RwLock::new(None), encrypted: RwLock::new(encrypted), mutation_guard: Mutex::new(()), - }) + }; + + store.migrate_legacy_keys_to_shared()?; + + Ok(store) + } + + /// Rewrite any pre-scope keys (no `s\x00` / `a\x00` prefix) as + /// `InstanceShared`. Idempotent: subsequent opens skip rows that already + /// decode cleanly. Preserves stored values verbatim — encryption state + /// transfers regardless of key format. + fn migrate_legacy_keys_to_shared(&self) -> Result<(), SecretsError> { + // Collect legacy keys first via a read transaction so we don't iterate + // and write under the same transaction (redb doesn't allow that). + let legacy_keys: Vec = { + let read_txn = self.db.begin_read().map_err(|error| { + SecretsError::Other(anyhow::anyhow!("failed to begin read transaction: {error}")) + })?; + let metadata = read_txn.open_table(METADATA_TABLE).map_err(|error| { + SecretsError::Other(anyhow::anyhow!("failed to open metadata table: {error}")) + })?; + let iter = metadata.iter().map_err(|error| { + SecretsError::Other(anyhow::anyhow!("failed to iterate metadata: {error}")) + })?; + let mut legacy = Vec::new(); + for entry in iter { + let (key, _) = entry.map_err(|error| { + SecretsError::Other(anyhow::anyhow!("failed to read metadata entry: {error}")) + })?; + let key_str = key.value().to_string(); + if decode_key(&key_str).is_none() { + legacy.push(key_str); + } + } + legacy + }; + + if legacy_keys.is_empty() { + return Ok(()); + } + + tracing::info!( + count = legacy_keys.len(), + "migrating legacy unscoped secrets → InstanceShared scope" + ); + + let write_txn = self.db.begin_write().map_err(|error| { + SecretsError::Other(anyhow::anyhow!( + "failed to begin migration write transaction: {error}" + )) + })?; + { + let mut secrets = write_txn.open_table(SECRETS_TABLE).map_err(|error| { + SecretsError::Other(anyhow::anyhow!("failed to open secrets table: {error}")) + })?; + let mut metadata = write_txn.open_table(METADATA_TABLE).map_err(|error| { + SecretsError::Other(anyhow::anyhow!("failed to open metadata table: {error}")) + })?; + for old_key in &legacy_keys { + let new_key = encode_key(&SecretScope::InstanceShared, old_key); + let value_bytes = secrets + .get(old_key.as_str()) + .map_err(|error| { + SecretsError::Other(anyhow::anyhow!( + "failed to read legacy value '{old_key}': {error}" + )) + })? + .map(|v| v.value().to_vec()); + let metadata_str = metadata + .get(old_key.as_str()) + .map_err(|error| { + SecretsError::Other(anyhow::anyhow!( + "failed to read legacy metadata '{old_key}': {error}" + )) + })? + .map(|m| m.value().to_string()); + + if let Some(bytes) = value_bytes { + secrets + .insert(new_key.as_str(), bytes.as_slice()) + .map_err(|error| { + SecretsError::Other(anyhow::anyhow!( + "failed to write migrated value '{old_key}': {error}" + )) + })?; + secrets.remove(old_key.as_str()).map_err(|error| { + SecretsError::Other(anyhow::anyhow!( + "failed to remove legacy value '{old_key}': {error}" + )) + })?; + } + if let Some(meta) = metadata_str { + metadata + .insert(new_key.as_str(), meta.as_str()) + .map_err(|error| { + SecretsError::Other(anyhow::anyhow!( + "failed to write migrated metadata '{old_key}': {error}" + )) + })?; + metadata.remove(old_key.as_str()).map_err(|error| { + SecretsError::Other(anyhow::anyhow!( + "failed to remove legacy metadata '{old_key}': {error}" + )) + })?; + } + } + } + write_txn.commit().map_err(|error| { + SecretsError::Other(anyhow::anyhow!( + "failed to commit secret scope migration: {error}" + )) + })?; + + Ok(()) } /// Current store state. @@ -264,7 +464,7 @@ impl SecretsStore { /// Full status snapshot for the API. pub fn status(&self, platform_managed: bool) -> Result { - let all_metadata = self.list_metadata()?; + let all_metadata = self.list_metadata(None)?; let system_count = all_metadata .values() .filter(|m| m.category == SecretCategory::System) @@ -284,9 +484,11 @@ impl SecretsStore { }) } - /// Store a secret with the given category. Creates or updates. + /// Store a secret in the given scope with the given category. Creates or + /// updates. pub fn set( &self, + scope: &SecretScope, name: &str, value: &str, category: SecretCategory, @@ -301,9 +503,10 @@ impl SecretsStore { let stored_value = self.encode_value(value)?; let now = chrono::Utc::now(); + let key = encode_key(scope, name); // Check if updating an existing secret (preserve created_at). - let existing_meta = self.get_metadata(name).ok(); + let existing_meta = self.get_metadata(scope, name).ok(); let metadata = SecretMetadata { category, created_at: existing_meta.as_ref().map(|m| m.created_at).unwrap_or(now), @@ -323,31 +526,34 @@ impl SecretsStore { SecretsError::Other(anyhow::anyhow!("failed to open secrets table: {error}")) })?; secrets - .insert(name, stored_value.as_slice()) + .insert(key.as_str(), stored_value.as_slice()) .map_err(|error| { SecretsError::Other(anyhow::anyhow!( - "failed to insert secret '{name}': {error}" + "failed to insert secret '{name}' ({scope}): {error}" )) })?; let mut meta = write_txn.open_table(METADATA_TABLE).map_err(|error| { SecretsError::Other(anyhow::anyhow!("failed to open metadata table: {error}")) })?; - meta.insert(name, metadata_json.as_str()).map_err(|error| { - SecretsError::Other(anyhow::anyhow!( - "failed to insert metadata for '{name}': {error}" - )) - })?; + meta.insert(key.as_str(), metadata_json.as_str()) + .map_err(|error| { + SecretsError::Other(anyhow::anyhow!( + "failed to insert metadata for '{name}' ({scope}): {error}" + )) + })?; } write_txn.commit().map_err(|error| { - SecretsError::Other(anyhow::anyhow!("failed to commit secret '{name}': {error}")) + SecretsError::Other(anyhow::anyhow!( + "failed to commit secret '{name}' ({scope}): {error}" + )) })?; Ok(()) } - /// Retrieve a decrypted secret value. - pub fn get(&self, name: &str) -> Result { + /// Retrieve a decrypted secret value from the given scope. + pub fn get(&self, scope: &SecretScope, name: &str) -> Result { let state = self.state(); if state == StoreState::Locked { return Err(SecretsError::Other(anyhow::anyhow!( @@ -362,54 +568,65 @@ impl SecretsStore { SecretsError::Other(anyhow::anyhow!("failed to open secrets table: {error}")) })?; + let key = encode_key(scope, name); let value = table - .get(name) + .get(key.as_str()) .map_err(|error| { - SecretsError::Other(anyhow::anyhow!("failed to read key '{name}': {error}")) + SecretsError::Other(anyhow::anyhow!( + "failed to read key '{name}' ({scope}): {error}" + )) })? .ok_or_else(|| SecretsError::NotFound { - key: name.to_string(), + key: format!("{scope}:{name}"), })?; let raw = value.value(); self.decode_value(raw) } - /// Delete a secret. - pub fn delete(&self, name: &str) -> Result<(), SecretsError> { + /// Delete a secret from the given scope. + pub fn delete(&self, scope: &SecretScope, name: &str) -> Result<(), SecretsError> { + let _guard = self.mutation_guard.lock().expect("mutation guard poisoned"); let write_txn = self.db.begin_write().map_err(|error| { SecretsError::Other(anyhow::anyhow!( "failed to begin write transaction: {error}" )) })?; + let key = encode_key(scope, name); { let mut secrets = write_txn.open_table(SECRETS_TABLE).map_err(|error| { SecretsError::Other(anyhow::anyhow!("failed to open secrets table: {error}")) })?; - secrets.remove(name).map_err(|error| { - SecretsError::Other(anyhow::anyhow!("failed to remove key '{name}': {error}")) + secrets.remove(key.as_str()).map_err(|error| { + SecretsError::Other(anyhow::anyhow!( + "failed to remove key '{name}' ({scope}): {error}" + )) })?; let mut meta = write_txn.open_table(METADATA_TABLE).map_err(|error| { SecretsError::Other(anyhow::anyhow!("failed to open metadata table: {error}")) })?; - meta.remove(name).map_err(|error| { + meta.remove(key.as_str()).map_err(|error| { SecretsError::Other(anyhow::anyhow!( - "failed to remove metadata for '{name}': {error}" + "failed to remove metadata for '{name}' ({scope}): {error}" )) })?; } write_txn.commit().map_err(|error| { SecretsError::Other(anyhow::anyhow!( - "failed to commit delete for '{name}': {error}" + "failed to commit delete for '{name}' ({scope}): {error}" )) })?; Ok(()) } - /// List all secret names. - pub fn list(&self) -> Result, SecretsError> { + /// List `(scope, name)` pairs. With `scope_filter = None`, returns every + /// secret across all scopes. + pub fn list( + &self, + scope_filter: Option<&SecretScope>, + ) -> Result, SecretsError> { let read_txn = self.db.begin_read().map_err(|error| { SecretsError::Other(anyhow::anyhow!("failed to begin read transaction: {error}")) })?; @@ -417,7 +634,7 @@ impl SecretsStore { SecretsError::Other(anyhow::anyhow!("failed to open metadata table: {error}")) })?; - let mut names = Vec::new(); + let mut entries = Vec::new(); let iter = table.iter().map_err(|error| { SecretsError::Other(anyhow::anyhow!("failed to iterate metadata table: {error}")) })?; @@ -425,14 +642,28 @@ impl SecretsStore { let (key, _) = entry.map_err(|error| { SecretsError::Other(anyhow::anyhow!("failed to read metadata entry: {error}")) })?; - names.push(key.value().to_string()); + let key_str = key.value(); + let Some((scope, name)) = decode_key(key_str) else { + tracing::warn!(key = key_str, "skipping malformed secret key during list"); + continue; + }; + if let Some(filter) = scope_filter + && scope != *filter + { + continue; + } + entries.push((scope, name)); } - Ok(names) + Ok(entries) } - /// Get metadata for a specific secret. - pub fn get_metadata(&self, name: &str) -> Result { + /// Get metadata for a specific secret in the given scope. + pub fn get_metadata( + &self, + scope: &SecretScope, + name: &str, + ) -> Result { let read_txn = self.db.begin_read().map_err(|error| { SecretsError::Other(anyhow::anyhow!("failed to begin read transaction: {error}")) })?; @@ -440,27 +671,32 @@ impl SecretsStore { SecretsError::Other(anyhow::anyhow!("failed to open metadata table: {error}")) })?; + let key = encode_key(scope, name); let value = table - .get(name) + .get(key.as_str()) .map_err(|error| { SecretsError::Other(anyhow::anyhow!( - "failed to read metadata for '{name}': {error}" + "failed to read metadata for '{name}' ({scope}): {error}" )) })? .ok_or_else(|| SecretsError::NotFound { - key: name.to_string(), + key: format!("{scope}:{name}"), })?; serde_json::from_str(value.value()).map_err(|error| { SecretsError::Other(anyhow::anyhow!( - "failed to parse metadata for '{name}': {error}" + "failed to parse metadata for '{name}' ({scope}): {error}" )) }) } - /// List all secrets with their metadata. Works in all states (names and - /// categories are stored as unencrypted metadata). - pub fn list_metadata(&self) -> Result, SecretsError> { + /// List `(scope, name) → metadata` for every secret matching the filter. + /// `scope_filter = None` returns every secret across all scopes. Works in + /// all states (names, scopes, and categories live in unencrypted metadata). + pub fn list_metadata( + &self, + scope_filter: Option<&SecretScope>, + ) -> Result, SecretsError> { let read_txn = self.db.begin_read().map_err(|error| { SecretsError::Other(anyhow::anyhow!("failed to begin read transaction: {error}")) })?; @@ -476,13 +712,26 @@ impl SecretsStore { let (key, value) = entry.map_err(|error| { SecretsError::Other(anyhow::anyhow!("failed to read metadata entry: {error}")) })?; + let key_str = key.value(); + let Some((scope, name)) = decode_key(key_str) else { + tracing::warn!( + key = key_str, + "skipping malformed secret key during list_metadata" + ); + continue; + }; + if let Some(filter) = scope_filter + && scope != *filter + { + continue; + } match serde_json::from_str::(value.value()) { Ok(meta) => { - result.insert(key.value().to_string(), meta); + result.insert((scope, name), meta); } Err(error) => { tracing::warn!( - key = key.value(), + key = key_str, %error, "skipping secret with corrupted metadata" ); @@ -493,15 +742,18 @@ impl SecretsStore { Ok(result) } - /// Get all tool secrets as name→value pairs for `Sandbox::wrap()` injection. + /// Get all tool secrets visible to `agent_id` as name→value pairs for + /// `Sandbox::wrap()` injection. Visibility is the union of every + /// `InstanceShared(Tool)` row and every `Agent(agent_id, Tool)` row; + /// agent-scoped values shadow shared values when names collide. /// /// Returns an empty map when locked (tool secrets unavailable). - pub fn tool_env_vars(&self) -> HashMap { + pub fn tool_env_vars(&self, agent_id: &AgentId) -> HashMap { if self.state() == StoreState::Locked { return HashMap::new(); } - let metadata = match self.list_metadata() { + let metadata = match self.list_metadata(None) { Ok(m) => m, Err(error) => { tracing::warn!(%error, "failed to list secret metadata for tool env vars"); @@ -509,33 +761,60 @@ impl SecretsStore { } }; + // Two-pass insert so agent-scoped values shadow same-named shared + // values. When an agent-scoped row exists but fails to read, we + // explicitly drop any shared fallback for the same name — otherwise + // a tenant's worker could end up with another tenant's shared value + // injected under a name the dashboard says is per-agent. let mut result = HashMap::new(); - for (name, meta) in &metadata { - if meta.category == SecretCategory::Tool { - match self.get(name) { - Ok(secret) => { - result.insert(name.clone(), secret.expose().to_string()); - } - Err(error) => { - tracing::warn!(secret = %name, %error, "failed to decrypt tool secret — skipping"); - } + for ((scope, name), _meta) in metadata + .iter() + .filter(|(_, m)| m.category == SecretCategory::Tool) + { + if !scope.is_shared() { + continue; + } + if let Some(value) = self.read_value(scope, name) { + result.insert(name.clone(), value); + } + } + for ((scope, name), _meta) in metadata + .iter() + .filter(|(_, m)| m.category == SecretCategory::Tool) + { + if !matches!(scope, SecretScope::Agent { id } if id == agent_id.as_ref()) { + continue; + } + match self.read_value(scope, name) { + Some(value) => { + result.insert(name.clone(), value); + } + None => { + result.remove(name); } } } - result } - /// Get names of all tool secrets for injection into worker system prompts. + /// Get names of all tool secrets visible to `agent_id` for injection + /// into worker system prompts. /// /// Returns names even when locked (metadata is always accessible). - pub fn tool_secret_names(&self) -> Vec { - match self.list_metadata() { - Ok(metadata) => metadata - .into_iter() - .filter(|(_, meta)| meta.category == SecretCategory::Tool) - .map(|(name, _)| name) - .collect(), + pub fn tool_secret_names(&self, agent_id: &AgentId) -> Vec { + match self.list_metadata(None) { + Ok(metadata) => { + let mut names = std::collections::BTreeSet::new(); + for ((scope, name), meta) in &metadata { + if meta.category != SecretCategory::Tool { + continue; + } + if scope.visible_to(agent_id) { + names.insert(name.clone()); + } + } + names.into_iter().collect() + } Err(error) => { tracing::warn!(%error, "failed to list secret metadata for tool secret names"); Vec::new() @@ -543,11 +822,28 @@ impl SecretsStore { } } - /// Get all tool secret name→value pairs for the output scrubber. - /// - /// Returns pairs suitable for `StreamScrubber::new()`. - pub fn tool_secret_pairs(&self) -> Vec<(String, String)> { - self.tool_env_vars().into_iter().collect() + /// Get tool secret name→value pairs visible to `agent_id` for the + /// output scrubber. + pub fn tool_secret_pairs(&self, agent_id: &AgentId) -> Vec<(String, String)> { + self.tool_env_vars(agent_id).into_iter().collect() + } + + /// Read the raw value for a `(scope, name)` pair, returning `None` when + /// the value is missing or fails to decrypt. Used by the tool-secret + /// readers above where a single bad row should not poison the rest. + fn read_value(&self, scope: &SecretScope, name: &str) -> Option { + match self.get(scope, name) { + Ok(secret) => Some(secret.expose().to_string()), + Err(error) => { + tracing::warn!( + secret = %name, + %scope, + %error, + "failed to read tool secret value — skipping" + ); + None + } + } } /// Enable encryption. Generates a random master key, encrypts all existing @@ -575,25 +871,28 @@ impl SecretsStore { // Encrypt sentinel value. let encrypted_sentinel = encrypt_bytes(&cipher, SENTINEL_PLAINTEXT)?; - // Re-encrypt all existing secrets. - let names = self.list()?; - let mut plain_values: Vec<(String, Vec)> = Vec::with_capacity(names.len()); - { + // Re-encrypt every stored value. We iterate the SECRETS_TABLE + // directly with raw redb keys (the encrypted blob is opaque, scope + // structure doesn't matter for the bulk re-encrypt). + let plain_values: Vec<(String, Vec)> = { let read_txn = self.db.begin_read().map_err(|error| { SecretsError::Other(anyhow::anyhow!("failed to begin read transaction: {error}")) })?; let table = read_txn.open_table(SECRETS_TABLE).map_err(|error| { SecretsError::Other(anyhow::anyhow!("failed to open secrets table: {error}")) })?; - for name in &names { - if let Some(val) = table.get(name.as_str()).map_err(|error| { - SecretsError::Other(anyhow::anyhow!("failed to read secret '{name}': {error}")) - })? { - // Current values are plaintext UTF-8. - plain_values.push((name.clone(), val.value().to_vec())); - } + let mut entries = Vec::new(); + let iter = table.iter().map_err(|error| { + SecretsError::Other(anyhow::anyhow!("failed to iterate secrets table: {error}")) + })?; + for entry in iter { + let (key, value) = entry.map_err(|error| { + SecretsError::Other(anyhow::anyhow!("failed to read secret entry: {error}")) + })?; + entries.push((key.value().to_string(), value.value().to_vec())); } - } + entries + }; // Write everything in one transaction. let write_txn = self.db.begin_write().map_err(|error| { @@ -606,13 +905,13 @@ impl SecretsStore { SecretsError::Other(anyhow::anyhow!("failed to open secrets table: {error}")) })?; // Re-encrypt each secret value. - for (name, plaintext_bytes) in &plain_values { + for (raw_key, plaintext_bytes) in &plain_values { let encrypted = encrypt_bytes(&cipher, plaintext_bytes)?; secrets - .insert(name.as_str(), encrypted.as_slice()) + .insert(raw_key.as_str(), encrypted.as_slice()) .map_err(|error| { SecretsError::Other(anyhow::anyhow!( - "failed to write encrypted secret '{name}': {error}" + "failed to write encrypted secret '{raw_key}': {error}" )) })?; } @@ -752,13 +1051,14 @@ impl SecretsStore { let new_cipher = derive_cipher(&new_master_key, &new_salt)?; - // Decrypt all secrets with old cipher, re-encrypt with new. - let names = self.list()?; - let mut re_encrypted: Vec<(String, Vec)> = Vec::with_capacity(names.len()); - for name in &names { - let decrypted = self.get(name)?; + // Decrypt every secret with the old cipher, re-encrypt with the new + // one, key-by-key over the raw redb keys. + let entries = self.list(None)?; + let mut re_encrypted: Vec<(String, Vec)> = Vec::with_capacity(entries.len()); + for (scope, name) in &entries { + let decrypted = self.get(scope, name)?; let encrypted = encrypt_bytes(&new_cipher, decrypted.expose().as_bytes())?; - re_encrypted.push((name.clone(), encrypted)); + re_encrypted.push((encode_key(scope, name), encrypted)); } // Re-encrypt sentinel. @@ -774,12 +1074,12 @@ impl SecretsStore { let mut secrets = write_txn.open_table(SECRETS_TABLE).map_err(|error| { SecretsError::Other(anyhow::anyhow!("failed to open secrets table: {error}")) })?; - for (name, encrypted) in &re_encrypted { + for (raw_key, encrypted) in &re_encrypted { secrets - .insert(name.as_str(), encrypted.as_slice()) + .insert(raw_key.as_str(), encrypted.as_slice()) .map_err(|error| { SecretsError::Other(anyhow::anyhow!( - "failed to write re-encrypted secret '{name}': {error}" + "failed to write re-encrypted secret '{raw_key}': {error}" )) })?; } @@ -813,9 +1113,9 @@ impl SecretsStore { Ok(new_master_key) } - /// Check if a secret exists. - pub fn exists(&self, name: &str) -> bool { - self.get_metadata(name).is_ok() + /// Check if a secret exists in the given scope. + pub fn exists(&self, scope: &SecretScope, name: &str) -> bool { + self.get_metadata(scope, name).is_ok() } /// Export all secrets as a portable backup. @@ -831,12 +1131,13 @@ impl SecretsStore { ))); } - let metadata = self.list_metadata()?; + let metadata = self.list_metadata(None)?; let mut entries = Vec::with_capacity(metadata.len()); - for (name, meta) in &metadata { - let value = self.get(name)?; + for ((scope, name), meta) in &metadata { + let value = self.get(scope, name)?; entries.push(ExportEntry { + scope: scope.clone(), name: name.clone(), value: value.expose().to_string(), category: meta.category, @@ -845,7 +1146,9 @@ impl SecretsStore { }); } - entries.sort_by(|a, b| a.name.cmp(&b.name)); + entries.sort_by(|a, b| { + (a.scope.to_string(), a.name.clone()).cmp(&(b.scope.to_string(), b.name.clone())) + }); Ok(ExportData { version: 1, @@ -873,12 +1176,12 @@ impl SecretsStore { let mut skipped = Vec::new(); for entry in &data.entries { - if self.exists(&entry.name) && !overwrite { - skipped.push(entry.name.clone()); + if self.exists(&entry.scope, &entry.name) && !overwrite { + skipped.push(format!("{}:{}", entry.scope, entry.name)); continue; } - self.set(&entry.name, &entry.value, entry.category)?; + self.set(&entry.scope, &entry.name, &entry.value, entry.category)?; imported += 1; } @@ -934,6 +1237,10 @@ pub struct ExportData { /// A single secret in the export format. #[derive(Debug, Clone, Serialize, Deserialize, utoipa::ToSchema)] pub struct ExportEntry { + /// Scope this secret belongs to. Older exports without a `scope` field + /// import as `InstanceShared` so existing backups continue to load. + #[serde(default = "default_export_scope")] + pub scope: SecretScope, pub name: String, pub value: String, pub category: SecretCategory, @@ -941,6 +1248,10 @@ pub struct ExportEntry { pub updated_at: chrono::DateTime, } +fn default_export_scope() -> SecretScope { + SecretScope::InstanceShared +} + /// Result of an import operation. #[derive(Debug, Clone, Serialize, Deserialize)] pub struct ImportResult { @@ -1174,35 +1485,155 @@ mod tests { (store, file) } + fn shared() -> SecretScope { + SecretScope::InstanceShared + } + + fn test_agent() -> AgentId { + std::sync::Arc::from("test-agent") + } + #[test] fn unencrypted_set_get_delete() { let (store, _file) = temp_store(); assert_eq!(store.state(), StoreState::Unencrypted); store - .set("MY_KEY", "my_value", SecretCategory::Tool) + .set(&shared(), "MY_KEY", "my_value", SecretCategory::Tool) .expect("set"); - let secret = store.get("MY_KEY").expect("get"); + let secret = store.get(&shared(), "MY_KEY").expect("get"); assert_eq!(secret.expose(), "my_value"); - let meta = store.get_metadata("MY_KEY").expect("metadata"); + let meta = store.get_metadata(&shared(), "MY_KEY").expect("metadata"); assert_eq!(meta.category, SecretCategory::Tool); - store.delete("MY_KEY").expect("delete"); - assert!(store.get("MY_KEY").is_err()); + store.delete(&shared(), "MY_KEY").expect("delete"); + assert!(store.get(&shared(), "MY_KEY").is_err()); + } + + #[test] + fn agent_scoped_secret_invisible_to_other_agent() { + let (store, _file) = temp_store(); + let alice: AgentId = std::sync::Arc::from("alice"); + let bob: AgentId = std::sync::Arc::from("bob"); + store + .set( + &SecretScope::agent(&alice), + "DEPLOY_TOKEN", + "alice-deploy", + SecretCategory::Tool, + ) + .expect("set alice"); + store + .set( + &SecretScope::agent(&bob), + "DEPLOY_TOKEN", + "bob-deploy", + SecretCategory::Tool, + ) + .expect("set bob"); + + let alice_env = store.tool_env_vars(&alice); + assert_eq!( + alice_env.get("DEPLOY_TOKEN").map(String::as_str), + Some("alice-deploy") + ); + + let bob_env = store.tool_env_vars(&bob); + assert_eq!( + bob_env.get("DEPLOY_TOKEN").map(String::as_str), + Some("bob-deploy") + ); + + // alice should NOT see bob's value, and vice versa. + assert_ne!(alice_env.get("DEPLOY_TOKEN"), bob_env.get("DEPLOY_TOKEN")); + } + + #[test] + fn agent_scoped_value_shadows_shared_value_with_same_name() { + let (store, _file) = temp_store(); + let alice: AgentId = std::sync::Arc::from("alice"); + store + .set(&shared(), "GH_TOKEN", "shared-token", SecretCategory::Tool) + .expect("set shared"); + store + .set( + &SecretScope::agent(&alice), + "GH_TOKEN", + "alice-token", + SecretCategory::Tool, + ) + .expect("set alice"); + + let alice_env = store.tool_env_vars(&alice); + assert_eq!( + alice_env.get("GH_TOKEN").map(String::as_str), + Some("alice-token") + ); + + let bob: AgentId = std::sync::Arc::from("bob"); + let bob_env = store.tool_env_vars(&bob); + assert_eq!( + bob_env.get("GH_TOKEN").map(String::as_str), + Some("shared-token") + ); + } + + #[test] + fn legacy_unprefixed_keys_migrate_to_shared_on_open() { + let file = NamedTempFile::new().expect("create temp file"); + let path = file.path().to_path_buf(); + + // Inject a legacy unprefixed key directly into the redb store. + { + let db = redb::Database::create(&path).expect("create db"); + let txn = db.begin_write().expect("write txn"); + { + let mut secrets = txn.open_table(SECRETS_TABLE).expect("open secrets"); + secrets + .insert("LEGACY_KEY", b"legacy_value".as_slice()) + .expect("insert"); + let mut metadata = txn.open_table(METADATA_TABLE).expect("open meta"); + let meta = SecretMetadata { + category: SecretCategory::Tool, + created_at: chrono::Utc::now(), + updated_at: chrono::Utc::now(), + }; + metadata + .insert("LEGACY_KEY", serde_json::to_string(&meta).unwrap().as_str()) + .expect("insert meta"); + } + txn.commit().expect("commit"); + } + + // Reopen — migration should run. + let store = SecretsStore::new(&path).expect("reopen"); + let value = store.get(&shared(), "LEGACY_KEY").expect("get migrated"); + assert_eq!(value.expose(), "legacy_value"); + + // Listing returns one entry under InstanceShared. + let entries = store.list(None).expect("list"); + assert_eq!(entries.len(), 1); + assert_eq!(entries[0].0, SecretScope::InstanceShared); + assert_eq!(entries[0].1, "LEGACY_KEY"); } #[test] fn tool_env_vars_returns_only_tool_secrets() { let (store, _file) = temp_store(); store - .set("ANTHROPIC_API_KEY", "sk-ant-xxx", SecretCategory::System) + .set( + &shared(), + "ANTHROPIC_API_KEY", + "sk-ant-xxx", + SecretCategory::System, + ) .expect("set system"); store - .set("GH_TOKEN", "ghp_abc123", SecretCategory::Tool) + .set(&shared(), "GH_TOKEN", "ghp_abc123", SecretCategory::Tool) .expect("set tool"); - let env_vars = store.tool_env_vars(); + let env_vars = store.tool_env_vars(&test_agent()); assert_eq!(env_vars.len(), 1); assert_eq!(env_vars.get("GH_TOKEN").unwrap(), "ghp_abc123"); assert!(!env_vars.contains_key("ANTHROPIC_API_KEY")); @@ -1212,16 +1643,21 @@ mod tests { fn tool_secret_names_returns_only_tool_names() { let (store, _file) = temp_store(); store - .set("OPENAI_API_KEY", "sk-xxx", SecretCategory::System) + .set( + &shared(), + "OPENAI_API_KEY", + "sk-xxx", + SecretCategory::System, + ) .expect("set system"); store - .set("NPM_TOKEN", "npm_xxx", SecretCategory::Tool) + .set(&shared(), "NPM_TOKEN", "npm_xxx", SecretCategory::Tool) .expect("set tool"); store - .set("GH_TOKEN", "ghp_xxx", SecretCategory::Tool) + .set(&shared(), "GH_TOKEN", "ghp_xxx", SecretCategory::Tool) .expect("set tool"); - let names = store.tool_secret_names(); + let names = store.tool_secret_names(&test_agent()); assert_eq!(names.len(), 2); assert!(names.contains(&"NPM_TOKEN".to_string())); assert!(names.contains(&"GH_TOKEN".to_string())); @@ -1233,7 +1669,12 @@ mod tests { // Store a secret in unencrypted mode. store - .set("MY_SECRET", "plaintext_value", SecretCategory::System) + .set( + &shared(), + "MY_SECRET", + "plaintext_value", + SecretCategory::System, + ) .expect("set"); // Enable encryption. @@ -1241,18 +1682,20 @@ mod tests { assert_eq!(store.state(), StoreState::Unlocked); // Secret is still readable. - let secret = store.get("MY_SECRET").expect("get after encrypt"); + let secret = store + .get(&shared(), "MY_SECRET") + .expect("get after encrypt"); assert_eq!(secret.expose(), "plaintext_value"); // Lock the store. store.lock().expect("lock"); assert_eq!(store.state(), StoreState::Locked); - assert!(store.get("MY_SECRET").is_err()); + assert!(store.get(&shared(), "MY_SECRET").is_err()); // Unlock with correct key. store.unlock(&master_key).expect("unlock"); assert_eq!(store.state(), StoreState::Unlocked); - let secret = store.get("MY_SECRET").expect("get after unlock"); + let secret = store.get(&shared(), "MY_SECRET").expect("get after unlock"); assert_eq!(secret.expose(), "plaintext_value"); // Wrong key fails. @@ -1264,7 +1707,7 @@ mod tests { fn key_rotation() { let (store, _file) = temp_store(); store - .set("MY_SECRET", "value123", SecretCategory::Tool) + .set(&shared(), "MY_SECRET", "value123", SecretCategory::Tool) .expect("set"); let _old_key = store.enable_encryption().expect("encrypt"); @@ -1272,13 +1715,15 @@ mod tests { let new_key = store.rotate_key().expect("rotate"); // Secret still readable with new cipher. - let secret = store.get("MY_SECRET").expect("get after rotate"); + let secret = store.get(&shared(), "MY_SECRET").expect("get after rotate"); assert_eq!(secret.expose(), "value123"); // Lock and unlock with new key. store.lock().expect("lock"); store.unlock(&new_key).expect("unlock with new key"); - let secret = store.get("MY_SECRET").expect("get after re-unlock"); + let secret = store + .get(&shared(), "MY_SECRET") + .expect("get after re-unlock"); assert_eq!(secret.expose(), "value123"); } @@ -1394,44 +1839,44 @@ mod tests { fn list_metadata_works_in_all_states() { let (store, _file) = temp_store(); store - .set("KEY1", "val1", SecretCategory::System) + .set(&shared(), "KEY1", "val1", SecretCategory::System) .expect("set"); store - .set("KEY2", "val2", SecretCategory::Tool) + .set(&shared(), "KEY2", "val2", SecretCategory::Tool) .expect("set"); // Unencrypted: metadata readable. - let meta = store.list_metadata().expect("list"); + let meta = store.list_metadata(None).expect("list"); assert_eq!(meta.len(), 2); // Enable encryption. let master_key = store.enable_encryption().expect("encrypt"); // Unlocked: metadata readable. - let meta = store.list_metadata().expect("list"); + let meta = store.list_metadata(None).expect("list"); assert_eq!(meta.len(), 2); // Locked: metadata still readable. store.lock().expect("lock"); - let meta = store.list_metadata().expect("list"); + let meta = store.list_metadata(None).expect("list"); assert_eq!(meta.len(), 2); // But values are not. - assert!(store.get("KEY1").is_err()); + assert!(store.get(&shared(), "KEY1").is_err()); // Unlock restores access. store.unlock(&master_key).expect("unlock"); - assert_eq!(store.get("KEY1").expect("get").expose(), "val1"); + assert_eq!(store.get(&shared(), "KEY1").expect("get").expose(), "val1"); } #[test] fn export_import_roundtrip() { let (store1, _file1) = temp_store(); store1 - .set("KEY_A", "value_a", SecretCategory::System) + .set(&shared(), "KEY_A", "value_a", SecretCategory::System) .expect("set"); store1 - .set("KEY_B", "value_b", SecretCategory::Tool) + .set(&shared(), "KEY_B", "value_b", SecretCategory::Tool) .expect("set"); let export = store1.export_all().expect("export"); @@ -1444,10 +1889,19 @@ mod tests { assert_eq!(result.imported, 2); assert!(result.skipped.is_empty()); - assert_eq!(store2.get("KEY_A").expect("get").expose(), "value_a"); - assert_eq!(store2.get("KEY_B").expect("get").expose(), "value_b"); assert_eq!( - store2.get_metadata("KEY_B").expect("meta").category, + store2.get(&shared(), "KEY_A").expect("get").expose(), + "value_a" + ); + assert_eq!( + store2.get(&shared(), "KEY_B").expect("get").expose(), + "value_b" + ); + assert_eq!( + store2 + .get_metadata(&shared(), "KEY_B") + .expect("meta") + .category, SecretCategory::Tool ); } @@ -1456,7 +1910,7 @@ mod tests { fn import_skips_existing_without_overwrite() { let (store, _file) = temp_store(); store - .set("EXISTING", "original", SecretCategory::System) + .set(&shared(), "EXISTING", "original", SecretCategory::System) .expect("set"); let export = ExportData { @@ -1464,6 +1918,7 @@ mod tests { encrypted: false, entries: vec![ ExportEntry { + scope: shared(), name: "EXISTING".to_string(), value: "new_value".to_string(), category: SecretCategory::Tool, @@ -1471,6 +1926,7 @@ mod tests { updated_at: chrono::Utc::now(), }, ExportEntry { + scope: shared(), name: "FRESH".to_string(), value: "fresh_value".to_string(), category: SecretCategory::Tool, @@ -1482,25 +1938,32 @@ mod tests { let result = store.import_all(&export, false).expect("import"); assert_eq!(result.imported, 1); - assert_eq!(result.skipped, vec!["EXISTING"]); + assert_eq!(result.skipped, vec!["shared:EXISTING"]); // Original value preserved. - assert_eq!(store.get("EXISTING").expect("get").expose(), "original"); + assert_eq!( + store.get(&shared(), "EXISTING").expect("get").expose(), + "original" + ); // New secret imported. - assert_eq!(store.get("FRESH").expect("get").expose(), "fresh_value"); + assert_eq!( + store.get(&shared(), "FRESH").expect("get").expose(), + "fresh_value" + ); } #[test] fn import_overwrites_existing_when_requested() { let (store, _file) = temp_store(); store - .set("EXISTING", "original", SecretCategory::System) + .set(&shared(), "EXISTING", "original", SecretCategory::System) .expect("set"); let export = ExportData { version: 1, encrypted: false, entries: vec![ExportEntry { + scope: shared(), name: "EXISTING".to_string(), value: "overwritten".to_string(), category: SecretCategory::Tool, @@ -1513,27 +1976,30 @@ mod tests { assert_eq!(result.imported, 1); assert!(result.skipped.is_empty()); - assert_eq!(store.get("EXISTING").expect("get").expose(), "overwritten"); + assert_eq!( + store.get(&shared(), "EXISTING").expect("get").expose(), + "overwritten" + ); } #[test] fn set_updates_existing_preserves_created_at() { let (store, _file) = temp_store(); store - .set("MY_KEY", "v1", SecretCategory::Tool) + .set(&shared(), "MY_KEY", "v1", SecretCategory::Tool) .expect("set"); - let meta1 = store.get_metadata("MY_KEY").expect("meta"); + let meta1 = store.get_metadata(&shared(), "MY_KEY").expect("meta"); // Update value and category. store - .set("MY_KEY", "v2", SecretCategory::System) + .set(&shared(), "MY_KEY", "v2", SecretCategory::System) .expect("update"); - let meta2 = store.get_metadata("MY_KEY").expect("meta"); + let meta2 = store.get_metadata(&shared(), "MY_KEY").expect("meta"); assert_eq!(meta2.created_at, meta1.created_at); assert!(meta2.updated_at >= meta1.updated_at); assert_eq!(meta2.category, SecretCategory::System); - assert_eq!(store.get("MY_KEY").expect("get").expose(), "v2"); + assert_eq!(store.get(&shared(), "MY_KEY").expect("get").expose(), "v2"); } #[test] diff --git a/src/tools.rs b/src/tools.rs index 1be39f033..620df58e8 100644 --- a/src/tools.rs +++ b/src/tools.rs @@ -31,6 +31,7 @@ pub mod attachment_recall; pub mod branch_tool; pub mod browser; +pub mod browser_detection; pub mod cancel; pub mod channel_recall; pub mod config_inspect; @@ -948,6 +949,7 @@ pub fn create_worker_tool_server( memory_search: Arc, wiki_write: bool, wiki_store: Option>, + blocked_signal: Option, ) -> ToolServerHandle { let mut server = ToolServer::new() .tool( @@ -968,7 +970,7 @@ pub fn create_worker_tool_server( let mut status_tool = SetStatusTool::new(agent_id.clone(), worker_id, channel_id, event_tx.clone()); if let Some(store) = runtime_config.secrets.load().as_ref() { - status_tool = status_tool.with_tool_secrets(store.tool_secret_pairs()); + status_tool = status_tool.with_tool_secrets(store.tool_secret_pairs(&agent_id)); } status_tool }) @@ -977,11 +979,18 @@ pub fn create_worker_tool_server( server = register_file_tools(server, workspace, sandbox); if let Some(store) = runtime_config.secrets.load().as_ref() { - server = server.tool(SecretSetTool::new(store.clone())); + server = server.tool(SecretSetTool::new(store.clone(), agent_id.clone())); } if browser_config.enabled { - server = register_browser_tools(server, browser_config, screenshot_dir, &runtime_config); + server = register_browser_tools( + server, + browser_config, + screenshot_dir, + &runtime_config, + blocked_signal, + Some(agent_id.clone()), + ); } if let Some(key) = brave_search_key { @@ -1125,7 +1134,14 @@ pub fn create_cortex_chat_tool_server( server = register_file_tools(server, workspace, sandbox); if browser_config.enabled { - server = register_browser_tools(server, browser_config, screenshot_dir, &runtime_config); + server = register_browser_tools( + server, + browser_config, + screenshot_dir, + &runtime_config, + None, + Some(agent_id.clone()), + ); } if let Some(key) = brave_search_key { diff --git a/src/tools/browser.rs b/src/tools/browser.rs index bcd0ed7be..35ce3bf8b 100644 --- a/src/tools/browser.rs +++ b/src/tools/browser.rs @@ -794,6 +794,13 @@ pub(crate) struct BrowserContext { /// the `secret` parameter can look up credential values without exposing /// them in tool arguments or output. secrets: Option>, + /// Worker-shared slot for surfacing a captcha / login wall / WAF block. + /// `None` for channel-level tool invocations (no worker outcome to short- + /// circuit). + blocked_signal: Option, + /// Owning agent for secret-scope resolution. `None` for channel-level + /// invocations — those resolve secrets from instance-shared only. + agent_id: Option, } impl BrowserContext { @@ -808,9 +815,86 @@ impl BrowserContext { config, screenshot_dir, secrets, + blocked_signal: None, + agent_id: None, } } + /// Attach a worker-shared block signal slot. When the browser detects an + /// external defense, it writes the structured reason + evidence into the + /// slot and the worker short-circuits to `WorkerOutcome::Blocked` at the + /// next loop iteration. + pub(crate) fn with_block_signal(mut self, signal: crate::agent::worker::BlockSignal) -> Self { + self.blocked_signal = Some(signal); + self + } + + /// Attach the owning agent so `browser_type` secret resolution prefers + /// the agent's scope before falling back to instance-shared. + pub(crate) fn with_agent_id(mut self, agent_id: crate::AgentId) -> Self { + self.agent_id = Some(agent_id); + self + } + + /// Resolve a secret name to a value. Tries the calling agent's scope + /// first (when set), then falls back to instance-shared. + fn resolve_secret_value( + &self, + name: &str, + ) -> Result { + let store = self.secrets.as_ref().ok_or_else(|| { + BrowserError::new( + "secret store is not available — secrets cannot be resolved. \ + Add the secret via the API or use the `text` parameter instead.", + ) + })?; + if let Some(agent_id) = self.agent_id.as_ref() { + let agent_scope = crate::secrets::store::SecretScope::agent(agent_id); + // Fall back to shared only when the agent-scoped key truly + // doesn't exist. Decryption / read errors must surface — silently + // typing the shared value would inject the wrong tenant's + // credential into a `browser_type` call. + match store.get(&agent_scope, name) { + Ok(secret) => return Ok(secret), + Err(crate::error::SecretsError::NotFound { .. }) => {} + Err(error) => { + return Err(BrowserError::new(format!( + "failed to resolve agent-scoped secret '{name}': {error}" + ))); + } + } + } + store + .get(&crate::secrets::store::SecretScope::shared(), name) + .map_err(|error| { + BrowserError::new(format!("failed to resolve secret '{name}': {error}")) + }) + } + + /// Mark a positive block detection. No-op when no signal is wired. + /// First-write-wins — if the slot already holds a detection, the new + /// data is dropped (the worker has not yet read the prior detection, + /// short-circuit will pick it up first). + pub(crate) fn signal_block( + &self, + reason: crate::agent::worker::BlockReason, + url: Option, + evidence: crate::agent::worker::BlockEvidence, + ) { + let Some(signal) = self.blocked_signal.as_ref() else { + return; + }; + let Ok(mut slot) = signal.lock() else { return }; + if slot.is_some() { + return; + } + *slot = Some(crate::agent::worker::BlockSignalData { + reason, + url, + evidence, + }); + } + /// Get the active page or return an error. Does NOT hold the lock — caller /// must pass a reference to the already-locked state. fn require_active_page<'a>( @@ -1357,8 +1441,22 @@ impl Tool for BrowserNavigateTool { let title = page.get_title().await.ok().flatten(); let current_url = page.url().await.ok().flatten(); + let detection = run_block_detection(page, current_url.as_deref(), Some(&args.url)).await; + state.invalidate_snapshot(); + if let Some(detection) = detection { + self.context.signal_block( + detection.reason.clone(), + current_url.clone(), + detection.evidence, + ); + return Err(BrowserError::new(format!( + "navigation blocked: {} — surfaced to caller; agent will short-circuit", + detection.reason.describe() + ))); + } + Ok(BrowserOutput::success(format!("Navigated to {}", args.url)) .with_page_info(title, current_url)) } @@ -1517,8 +1615,23 @@ impl Tool for BrowserClickTool { let page = self.context.require_active_page(&state)?; wait_for_page_ready(page).await; + let current_url = page.url().await.ok().flatten(); + let detection = run_block_detection(page, current_url.as_deref(), None).await; + state.invalidate_snapshot(); + if let Some(detection) = detection { + self.context.signal_block( + detection.reason.clone(), + current_url.clone(), + detection.evidence, + ); + return Err(BrowserError::new(format!( + "click triggered blocked page: {} — surfaced to caller; agent will short-circuit", + detection.reason.describe() + ))); + } + Ok(BrowserOutput::success(format!( "Clicked element at {label}" ))) @@ -1593,15 +1706,7 @@ impl Tool for BrowserTypeTool { // Resolve the text to type: either from `secret` (secure) or `text` (plain). let (text_value, is_secret) = match (&args.secret, &args.text) { (Some(secret_name), None) => { - let store = self.context.secrets.as_ref().ok_or_else(|| { - BrowserError::new( - "secret store is not available — secrets cannot be resolved. \ - Add the secret via the API or use the `text` parameter instead.", - ) - })?; - let decrypted = store.get(secret_name).map_err(|error| { - BrowserError::new(format!("failed to resolve secret '{secret_name}': {error}")) - })?; + let decrypted = self.context.resolve_secret_value(secret_name)?; (decrypted.expose().to_string(), true) } (None, Some(text)) => (text.clone(), false), @@ -2145,6 +2250,8 @@ pub fn register_browser_tools( config: BrowserConfig, screenshot_dir: PathBuf, runtime_config: &crate::config::RuntimeConfig, + blocked_signal: Option, + agent_id: Option, ) -> rig::tool::server::ToolServer { let state = if let Some(shared) = runtime_config .shared_browser @@ -2158,7 +2265,13 @@ pub fn register_browser_tools( let secrets = runtime_config.secrets.load().as_ref().as_ref().cloned(); - let context = BrowserContext::new(state, config, screenshot_dir, secrets); + let mut context = BrowserContext::new(state, config, screenshot_dir, secrets); + if let Some(signal) = blocked_signal { + context = context.with_block_signal(signal); + } + if let Some(agent_id) = agent_id { + context = context.with_agent_id(agent_id); + } server .tool(BrowserLaunchTool { @@ -2199,6 +2312,20 @@ pub fn register_browser_tools( // Shared helpers +/// Run captcha / login wall / WAF detection against the current page state. +/// Returns `Some(Detection)` on positive match. Reads page HTML via CDP and +/// passes it to the pure detection module. Detection failures (e.g. CDP +/// errors) are silently treated as "no detection" — never block a navigation +/// because we couldn't read its HTML. +async fn run_block_detection( + page: &chromiumoxide::Page, + final_url: Option<&str>, + requested_url: Option<&str>, +) -> Option { + let html = page.content().await.ok()?; + crate::tools::browser_detection::classify(&html, final_url, requested_url) +} + /// Get the active page, or create a first one if the browser has no pages yet. async fn get_or_create_page<'a>( context: &BrowserContext, diff --git a/src/tools/browser_detection.rs b/src/tools/browser_detection.rs new file mode 100644 index 000000000..4aaa2c0e4 --- /dev/null +++ b/src/tools/browser_detection.rs @@ -0,0 +1,299 @@ +//! Pure detection heuristics for captcha / login wall / WAF block pages. +//! +//! Runs against post-navigation page state (HTML + final URL). DOM-only for +//! v1 — chromiumoxide response-header inspection is a separate workstream. +//! Detection is conservative: false positives on legitimate Cloudflare-hosted +//! pages would short-circuit valid worker runs, so each heuristic requires a +//! specific signal rather than a generic vendor marker. + +use crate::agent::worker::{BlockEvidence, BlockReason}; + +/// Result of running detection over a page snapshot. +#[derive(Debug, Clone)] +pub struct Detection { + pub reason: BlockReason, + pub evidence: BlockEvidence, +} + +/// Inspect the page snapshot and return a detection if any heuristic fires. +/// Caller passes the raw page HTML (full content), the final URL after any +/// redirects, and optionally the originally-requested URL so the login-wall +/// detector can distinguish a deliberate sign-in navigation from an +/// unexpected redirect into a login page. +pub fn classify( + html: &str, + final_url: Option<&str>, + requested_url: Option<&str>, +) -> Option { + if let Some(reason) = detect_captcha(html) { + return Some(Detection { + reason, + evidence: build_evidence(html, final_url), + }); + } + if let Some(reason) = detect_cloudflare_challenge(html) { + return Some(Detection { + reason, + evidence: build_evidence(html, final_url), + }); + } + if let Some(reason) = detect_login_wall(final_url, requested_url) { + return Some(Detection { + reason, + evidence: build_evidence(html, final_url), + }); + } + None +} + +/// Detect a captcha challenge by scanning for known provider iframe srcs and +/// form-input markers. Order matters — provider-specific markers come first +/// so the `provider` field in `BlockReason::Captcha` is precise. +pub fn detect_captcha(html: &str) -> Option { + if html.contains("challenges.cloudflare.com/turnstile/") { + return Some(BlockReason::Captcha { + provider: "cloudflare-turnstile".to_string(), + }); + } + if html.contains("hcaptcha.com/captcha/") || html.contains("js.hcaptcha.com/1/api.js") { + return Some(BlockReason::Captcha { + provider: "hcaptcha".to_string(), + }); + } + if html.contains("google.com/recaptcha/") + || html.contains("gstatic.com/recaptcha/") + || html.contains("g-recaptcha-response") + { + return Some(BlockReason::Captcha { + provider: "recaptcha".to_string(), + }); + } + None +} + +/// Detect a Cloudflare interstitial challenge page (the "checking your +/// browser before accessing" interstitial, distinct from sites merely fronted +/// by Cloudflare). Anchored on body text + a CF-specific cookie/header marker +/// to avoid firing on every CF-hosted site. +pub fn detect_cloudflare_challenge(html: &str) -> Option { + let interstitial_markers = [ + "Checking your browser before accessing", + "challenge-platform", + "cf-mitigated", + "cf_chl_opt", + ]; + let mut matched = 0; + for marker in interstitial_markers.iter() { + if html.contains(marker) { + matched += 1; + } + } + if matched >= 2 { + return Some(BlockReason::FraudDetect { + vendor: "cloudflare".to_string(), + }); + } + None +} + +/// Detect that the navigation landed on a login URL distinct from where the +/// caller meant to go. Conservative on purpose — a worker that intentionally +/// navigates to `/login` or clicks a "Sign in" button would otherwise be +/// classified as blocked, breaking normal auth flows. +/// +/// Heuristic: only fire when the *final* path matches a login signal AND +/// the caller's *requested* path does not. If the requested URL was itself +/// a login page, treat the navigation as intentional. If no requested URL +/// is supplied (e.g. follow-up click without context), we err toward not +/// firing — the cost of a false positive (worker dies mid-task) is higher +/// than the cost of a false negative (LLM sees the login page and decides +/// what to do). +pub fn detect_login_wall( + final_url: Option<&str>, + requested_url: Option<&str>, +) -> Option { + let final_parsed = url::Url::parse(final_url?).ok()?; + let final_path = final_parsed.path().to_lowercase(); + if !path_signals_login(&final_path) { + return None; + } + + // Need a requested URL to distinguish unexpected redirect from + // intentional login navigation. + let requested = requested_url?; + let requested_parsed = url::Url::parse(requested).ok()?; + let requested_host = requested_parsed.host_str().unwrap_or(""); + let final_host = final_parsed.host_str().unwrap_or(""); + if !final_host.eq_ignore_ascii_case(requested_host) { + // Cross-host redirect to login (e.g. SSO bounce) — not an + // unexpected block, just a normal auth handoff. + return None; + } + + let requested_path = requested_parsed.path().to_lowercase(); + if path_signals_login(&requested_path) { + // Caller asked to go to a login page — they meant it. + return None; + } + + Some(BlockReason::LoginWall) +} + +fn path_signals_login(path: &str) -> bool { + const LOGIN_PATH_SIGNALS: [&str; 5] = [ + "/login", + "/signin", + "/auth/", + "/account/login", + "/users/sign_in", + ]; + LOGIN_PATH_SIGNALS + .iter() + .any(|signal| path.contains(signal)) +} + +/// Build the captured evidence payload (truncated HTML + URL). +fn build_evidence(html: &str, final_url: Option<&str>) -> BlockEvidence { + const MAX_SNIPPET: usize = 4096; + let snippet = if html.len() > MAX_SNIPPET { + // Truncate at a char boundary to avoid splitting a UTF-8 sequence. + let mut cut = MAX_SNIPPET; + while cut > 0 && !html.is_char_boundary(cut) { + cut -= 1; + } + Some(html[..cut].to_string()) + } else { + Some(html.to_string()) + }; + BlockEvidence { + final_url: final_url.map(String::from), + html_snippet: snippet, + status: None, + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn cloudflare_turnstile_iframe_detected() { + let html = + r#""#; + match detect_captcha(html) { + Some(BlockReason::Captcha { provider }) => { + assert_eq!(provider, "cloudflare-turnstile"); + } + other => panic!("expected cloudflare-turnstile, got {other:?}"), + } + } + + #[test] + fn hcaptcha_detected() { + let html = r#""#; + match detect_captcha(html) { + Some(BlockReason::Captcha { provider }) => assert_eq!(provider, "hcaptcha"), + other => panic!("expected hcaptcha, got {other:?}"), + } + } + + #[test] + fn recaptcha_detected() { + let html = r#""#; + match detect_captcha(html) { + Some(BlockReason::Captcha { provider }) => assert_eq!(provider, "recaptcha"), + other => panic!("expected recaptcha, got {other:?}"), + } + } + + #[test] + fn vanilla_page_not_classified_as_captcha() { + let html = "

Welcome

Just a normal page.

"; + assert!(detect_captcha(html).is_none()); + } + + #[test] + fn cloudflare_hosted_site_does_not_trigger_challenge_detection() { + // Pages merely served via Cloudflare often include `cf-mitigated`-style + // markers in headers but not in body. A single body marker should not + // fire the challenge-page detector. + let html = "cf-mitigated: a-real-page"; + assert!(detect_cloudflare_challenge(html).is_none()); + } + + #[test] + fn cloudflare_interstitial_with_two_markers_detected() { + let html = r#"Just a moment... + +

Checking your browser before accessing

+ + "#; + match detect_cloudflare_challenge(html) { + Some(BlockReason::FraudDetect { vendor }) => assert_eq!(vendor, "cloudflare"), + other => panic!("expected cloudflare fraud-detect, got {other:?}"), + } + } + + #[test] + fn unexpected_redirect_to_login_detected() { + let reason = detect_login_wall( + Some("https://example.com/login?return_to=/dashboard"), + Some("https://example.com/dashboard"), + ); + assert!(matches!(reason, Some(BlockReason::LoginWall))); + } + + #[test] + fn intentional_login_navigation_not_classified() { + // Caller asked to go to /login. This is a deliberate sign-in flow, + // not a block. + let reason = detect_login_wall( + Some("https://example.com/login"), + Some("https://example.com/login"), + ); + assert!(reason.is_none()); + } + + #[test] + fn cross_host_sso_redirect_not_classified() { + // SSO bounce is a normal auth handoff, not a block. + let reason = detect_login_wall( + Some("https://identity-provider.com/login"), + Some("https://example.com/dashboard"), + ); + assert!(reason.is_none()); + } + + #[test] + fn dashboard_url_not_login_wall() { + let reason = detect_login_wall( + Some("https://example.com/dashboard"), + Some("https://example.com/dashboard"), + ); + assert!(reason.is_none()); + } + + #[test] + fn no_requested_url_treats_as_intentional() { + // Without context, err toward not firing — false positives kill + // valid worker runs; false negatives just let the LLM see a login + // page and decide. + let reason = detect_login_wall(Some("https://example.com/account/login"), None); + assert!(reason.is_none()); + } + + #[test] + fn evidence_truncates_long_html_at_char_boundary() { + let html = "x".repeat(8192); + let detection = classify( + &format!( + r#"{html}"# + ), + Some("https://example.com"), + None, + ); + let evidence = detection.expect("captcha should fire").evidence; + let snippet = evidence.html_snippet.expect("snippet present"); + assert!(snippet.len() <= 4096); + } +} diff --git a/src/tools/secret_set.rs b/src/tools/secret_set.rs index cbcb86c00..a1fc86d16 100644 --- a/src/tools/secret_set.rs +++ b/src/tools/secret_set.rs @@ -3,7 +3,8 @@ //! Useful for autonomous workflows where a worker creates accounts, generates //! API keys, or obtains credentials that should be persisted for future use. -use crate::secrets::store::{SecretCategory, SecretsStore, auto_categorize}; +use crate::AgentId; +use crate::secrets::store::{SecretCategory, SecretScope, SecretsStore, auto_categorize}; use rig::completion::ToolDefinition; use rig::tool::Tool; use schemars::JsonSchema; @@ -14,12 +15,21 @@ use std::sync::Arc; #[derive(Debug, Clone)] pub struct SecretSetTool { secrets_store: Arc, + /// Owning agent — secrets created via this tool land in + /// `SecretScope::Agent(self.agent_id)` so a tenant's worker cannot leak + /// credentials into another tenant's view. Promoting a `Tool` secret to + /// `InstanceShared` is an admin-only operation done via the dashboard. + agent_id: AgentId, } impl SecretSetTool { - /// Create a new secret set tool with access to the instance-level store. - pub fn new(secrets_store: Arc) -> Self { - Self { secrets_store } + /// Create a new secret set tool. Secrets stored via `call()` are written + /// to `agent_id`'s scope. + pub fn new(secrets_store: Arc, agent_id: AgentId) -> Self { + Self { + secrets_store, + agent_id, + } } } @@ -124,16 +134,19 @@ impl Tool for SecretSetTool { None => auto_categorize(&name), }; - // Check if this is an update to an existing secret. - let updated = self.secrets_store.get_metadata(&name).is_ok(); + let scope = SecretScope::agent(&self.agent_id); + + // Check if this is an update to an existing secret in this scope. + let updated = self.secrets_store.get_metadata(&scope, &name).is_ok(); self.secrets_store - .set(&name, &args.value, category) + .set(&scope, &name, &args.value, category) .map_err(|error| SecretSetError(format!("{error}")))?; tracing::info!( name = %name, category = %category, + scope = %scope, updated, "worker stored secret via secret_set tool" ); diff --git a/src/tools/send_agent_message.rs b/src/tools/send_agent_message.rs index 82b0267ed..c450d9336 100644 --- a/src/tools/send_agent_message.rs +++ b/src/tools/send_agent_message.rs @@ -41,6 +41,10 @@ pub struct SendAgentMessageTool { /// Set per-turn so task completion notifications route back to the right place. originating_channel: Option, working_memory: Option>, + /// Wake dispatcher for the receiving agent. Fired after task insert so + /// dormant-mode receivers wake up immediately instead of waiting for a + /// (non-existent) tick. No-op when None or when the receiver is active. + wake_tx: Option, } impl std::fmt::Debug for SendAgentMessageTool { @@ -68,9 +72,17 @@ impl SendAgentMessageTool { skip_flag: None, originating_channel: None, working_memory: None, + wake_tx: None, } } + /// Attach the instance-wide wake dispatcher so dormant receivers get + /// woken immediately on delegation. + pub fn with_wake_tx(mut self, tx: crate::agent::wake::WakeSender) -> Self { + self.wake_tx = Some(tx); + self + } + /// Set the per-turn skip flag so the channel turn ends after delegation. pub fn with_skip_flag(mut self, flag: SkipFlag) -> Self { self.skip_flag = Some(flag); @@ -246,6 +258,14 @@ impl Tool for SendAgentMessageTool { let task_number = task.task_number; + // Wake the receiving agent. No-op when the receiver is in active mode + // (its ready-task loop will pick the task up on its own); critical when + // the receiver is dormant (no loop runs). + if let Some(tx) = self.wake_tx.as_ref() { + let receiver: crate::AgentId = std::sync::Arc::from(receiving_agent_id.as_str()); + crate::agent::wake::fire_wake(tx, &receiver); + } + // Log delegation record in the link channel (system message). let sender_display = self .agent_names diff --git a/src/tools/spawn_worker.rs b/src/tools/spawn_worker.rs index 6e1d14978..d54299867 100644 --- a/src/tools/spawn_worker.rs +++ b/src/tools/spawn_worker.rs @@ -444,7 +444,7 @@ impl Tool for DetachedSpawnWorkerTool { let secrets_guard = rc.secrets.load(); let tool_secret_names = match (*secrets_guard).as_ref() { - Some(store) => store.tool_secret_names(), + Some(store) => store.tool_secret_names(&self.deps.agent_id), None => Vec::new(), }; diff --git a/tests/bulletin.rs b/tests/bulletin.rs index 12caac27a..1eacc7368 100644 --- a/tests/bulletin.rs +++ b/tests/bulletin.rs @@ -105,6 +105,7 @@ async fn bootstrap_deps() -> anyhow::Result { agent_config.workspace.clone(), &config.instance_dir, agent_config.data_dir.clone(), + agent_id.clone(), ) .await, ); @@ -137,6 +138,7 @@ async fn bootstrap_deps() -> anyhow::Result { ), api_state: None, wiki_store: None, + wake_tx: None, }) } diff --git a/tests/context_dump.rs b/tests/context_dump.rs index e96a51c08..8ec950490 100644 --- a/tests/context_dump.rs +++ b/tests/context_dump.rs @@ -104,6 +104,7 @@ async fn bootstrap_deps() -> anyhow::Result<(spacebot::AgentDeps, spacebot::conf agent_config.workspace.clone(), &config.instance_dir, agent_config.data_dir.clone(), + agent_id.clone(), ) .await, ); @@ -136,6 +137,7 @@ async fn bootstrap_deps() -> anyhow::Result<(spacebot::AgentDeps, spacebot::conf ), api_state: None, wiki_store: None, + wake_tx: None, }; Ok((deps, config)) @@ -425,6 +427,7 @@ async fn dump_worker_context() { deps.memory_search.clone(), false, None, + None, ); let tool_defs = worker_tool_server @@ -623,6 +626,7 @@ async fn dump_all_contexts() { deps.memory_search.clone(), false, None, + None, ); let worker_tool_defs = worker_tool_server.get_tool_defs(None).await.unwrap(); let worker_tools_text = format_tool_defs(&worker_tool_defs); diff --git a/tests/sandbox_initialization_test.rs b/tests/sandbox_initialization_test.rs index 5639d74cf..19bbdf113 100644 --- a/tests/sandbox_initialization_test.rs +++ b/tests/sandbox_initialization_test.rs @@ -16,6 +16,7 @@ async fn test_sandbox_creation_in_initialize() { workspace.path().to_path_buf(), instance_dir.path(), data_dir.path().to_path_buf(), + Arc::from("test-agent"), ) .await;