openab-agent: multi-vendor OAuth ADR + cross-process auth.json locking (codex + MCP) by brettchien · Pull Request #1190 · openabdev/openab

brettchien · 2026-06-24T19:53:32Z

What problem does this solve?

openab-agent's auth.json is a shared, multi-writer credential file with an unlocked read-modify-write, and openab-agent runs one process per Discord thread. So ordinary concurrent usage = concurrent processes refreshing the same OAuth token → refresh-token-rotation reuse → worst case OAuth 2.1 §10.4 token-family revocation = fleet-wide logout. API-key users never hit this; OAuth adoption (PR #1187, Anthropic) is what activates it.

This PR lands the ADR that sets the multi-vendor OAuth pattern (so every future provider stops hand-rolling its own flow) and the concurrency fix for the release-blocker-class storage race — for both the codex tenant and the MCP CredentialStore.

Closes #

Discord Discussion URL: https://discord.com/channels/1490643539682529300/1519372543377539173

At a Glance

auth.json  (multi-tenant: codex / anthropic-oauth / mcp:<server>×N, one process per Discord thread)

 (a) file integrity — ONE global lock          (b) refresh-token rotation — ONE lock per tenant
 ┌─ with_auth_locked(path, f) ──────────┐       ┌─ lock_tenant_refresh(path, tenant) ───────────┐
 │ flock EX  auth.json.global.lock       │       │ try-lock  auth.json.refresh.<tenant>.lock     │
 │ re-read → f(&mut map) → gc_pending →   │       │  + async backoff + 10s timeout (never wedge)  │
 │ atomic tmp+rename                      │       │ held ACROSS the network refresh               │
 └───────────────────────────────────────┘       └───────────────────────────────────────────────┘
   ▲ save_tokens (codex)                            ▲ get_valid_token / force_refresh (codex)
   ▲ McpCredentialStore::save / clear (MCP)         ▲ resolve_oauth_dial → get_access_token (MCP)
   = writers MERGE, never lost-update                = exactly ONE network refresh per tenant; the
                                                        lock-loser re-load()s the winner's token and
                                                        skips its own refresh (no RT_old reuse)

Prior Art & Industry Research

OpenClaw: API keys + subscription OAuth; stores per-profile {access,refresh,expires,accountId} and treats the profile file as a token sink refreshed under a file lock — direct corroboration for the locked-RMW decision.

Hermes Agent (NousResearch): PROVIDER_REGISTRY dataclasses declare each provider's auth type + URLs + env vars behind one resolve_runtime_provider() entry point; auth.json guarded with fcntl/msvcrt file locks — corroborates both the registry shape (OAuthVendor) and the file-lock decision.

Other references:

Pi (earendil-works/pi) — ported flow for feat(native-agent): Anthropic OAuth (Claude Pro/Max) login for openab-agent #1187; CLAUDE_CODE_OAUTH_TOKEN (#3591); Kiro/Cursor/xAI provider extensions.
rclone — rcloneEncryptedClientSecret + runtime obscure.MustReveal(): precedent for the ADR §9 Q2 encode-at-rest of a non-confidential bundled OAuth secret (scanner-evasion, explicitly not encryption).
Antigravity (agy) OAuth ecosystem — GitHub survey 2026-06-24 (≥326 repos hardcode the same client_id/secret; 99/107 sampled commit it plaintext) grounding the agy go/no-go + secret-storage decisions.

Proposed Solution

ADR (docs/adr/openab-agent-oauth.md): a two-axis model — auth (OAuthVendor descriptor + a shared driver on the official oauth2 crate) kept orthogonal to inference (one provider per wire format) — plus a 14-variant vendor feasibility matrix, the CLAUDE_CODE_OAUTH_TOKEN env route, the /auth two-step UX (per-user-keyed PKCE pending state), and the storage invariant below.

Implementation (this PR):

with_auth_locked — global flock(2) on an auth.json.global.lock sidecar across re-read → mutate → atomic write. All writers funnel through it (codex save_tokens + MCP McpCredentialStore::save/clear), so they merge onto the latest on-disk state instead of lost-updating.
lock_tenant_refresh — per-tenant flock (non-blocking try + async backoff + 10s timeout) held across the network refresh, with a double-checked re-read, so concurrent processes do exactly one real refresh per tenant. Codex uses it in get_valid_token/force_refresh; MCP uses it in resolve_oauth_dial around get_access_token().
flock(2) on sidecar files (kernel auto-releases on fd close / process death — no stale lock); libc::flock under cfg(unix), non-unix no-op.
ADR §7 pending-entry GC: created_at + a 15-min sweep inside with_auth_locked.

Why this approach?

Two locks, two hazards. The global lock fixes file lost-updates; the per-tenant lock fixes RTR revocation. A single global lock held across the network refresh would serialize one tenant's slow refresh in front of every other tenant (head-of-line blocking MCP servers / Codex), so refresh I/O is held under the per-tenant lock only.
flock(2) over a sentinel lockfile — the kernel releases it on process death, so a crashed holder never wedges the file (no orphan cleanup / manual timeout).
MCP single-flight without an rmcp fork. rmcp's CredentialStore exposes no pre-refresh hook, but openab drives the MCP refresh explicitly in resolve_oauth_dial; rmcp's get_access_token re-load()s auth.json and skips the network refresh when the token is already fresh, so the per-tenant lock + rmcp's own fresh-load gives cross-process single-flight with no extra double-check code.
Known limitation: rustix is not in-tree (the ADR text was optimistic), so this uses libc::flock directly.

Alternatives Considered

In-process Mutex / tokio single-flight — useless across the per-thread processes.
Sentinel lockfile (create→delete) — deadlocks if a holder dies; flock(2) auto-releases.
Refresh outside the lock, re-read on commit (an earlier ADR draft) — rejected: every process has already sent the refresh with the same RT_old before it commits, so RTR still fires.
Forcing everything through rmcp CredentialStore — lossy; the shared layer sits below both stores (file RMW).

Validation

Rust changes:

cargo check passes
cargo test passes (192 passed, 0 failed, 11 ignored) — incl. new with_auth_locked_merges_concurrent_tenants_no_lost_update and with_auth_locked_gcs_stale_pending_but_keeps_fresh_and_tokens
cargo clippy -- -D warnings clean
cargo fmt --check clean

All PRs:

Manual testing — ran the full gate from openab-agent/ (the crate is built standalone, matching ci-openab-agent.yml). Reviewed end-to-end with a second agent across two ADR-review rounds and two code-review rounds; all findings (RTR refresh race, per-user PKCE keying, pending GC, portable WouldBlock, MCP cross-process refresh) addressed and re-verified.

Proposed ADR for the openab-agent LLM-provider OAuth revamp: a two-axis OAuthVendor adapter (auth flow vs inference transport), a cross-process flock-guarded credential-store invariant for auth.json, the CLAUDE_CODE_OAUTH_TOKEN env route, a 14-variant vendor feasibility matrix, and the /auth (PR openabdev#1185) auth-trigger model. Surfaced while reviewing PR openabdev#1187 (first OAuth vendor). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… names - Build the OAuthVendor driver on the official `oauth2` crate (already in-tree via the MCP side) instead of a hand-rolled PKCE/exchange/refresh flow; the Anthropic JSON-token-body quirk is applied via the crate's custom http-client hook. Reframe §8 accordingly (hand-rolled flow is the rejected alternative). - Remove the project Rollout-plan section (internal sequencing, not ADR material); keep the race-window mitigation in §5.4. - Use vendor names only; drop internal fleet-agent references from the matrix. - Replace unexplained "ToS-gray" with "ToS-risk" + a definition. - Fix cross-references (crate-qualified paths, §-refs, line numbers) and move the settled model-default decision under "Decisions & open questions". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…w r1) Fold in two release-blocker-class concurrency fixes from Mira's review: - §5.4 refresh-token rotation: the prior "refresh outside lock, re-read on commit" claim was wrong — N processes each send a refresh with the same RT_old before committing, tripping OAuth 2.1 §10.4 reuse -> token-family revocation. Replace with a per-tenant exclusive lock: network refresh held under the tenant lock only (not the global lock), so exactly one refresh per tenant per expiry with no head-of-line blocking across tenants. Extends to mcp:<server> tenants. - §7 /auth: key the pending PKCE verifier+state by the initiating Discord user id instead of a single global entry — prevents concurrent-user overwrite (PKCE mismatch) and session hijack. - §5.1 OAuthVendor::redirect() -> Option, since device flows have no redirect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- §6/§9 Q2: correct the stale "git-safe" claim on the gemini GOCSPX- secret. The value is non-confidential by RFC 8252 / Google docs, but GitHub now push-protects Google secrets by default (changelog 2026-03) and partner-scans them for auto-revoke, so a raw literal is not safe in a public repo. Decide: encode-at-rest (scanner-evasion for a non-secret, NOT a security control) or env-inject at runtime — not raw text. - §7: pending PKCE entries get created_at + a 15-min GC sweep in with_auth_locked so abandoned /auth attempts don't accumulate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…rett decisions r3) - §9 Q2 DECIDED: env-injection (b) is the default for the gemini/agy bundled client_secret; encode-at-rest (a) is the fallback for bundled zero-config binaries only, framed as scanner-evasion (not security). Cite rclone rcloneEncryptedClientSecret + obscure.MustReveal() as the canonical precedent (§10). - §9 Q3 DECIDED: GO gemini/grok (first wave) + agy (experimental, opt-in, ToS caveat — shares gemini's Code-Assist provider, residual risk is ToS not secret storage); No-Go cursor/kiro. Mirror as a build-decision line under §6. - §10: add rclone + GitHub/Google secret-scanning references. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

GitHub survey 2026-06-24 grounds the agy GO decision: - §6/§9 Q2: agy client_secret requirement CONFIRMED — it needs a GOCSPX- secret, a public constant >=20 antigravity-auth repos hardcode verbatim (NoeFabris/opencode-antigravity-auth, router-for-me/CLIProxyAPI, ...). Literal deliberately NOT pasted into the doc — would trip the very §9 Q2 push-protection, so we dogfood the env/encode decision. Redirect confirmed localhost:51121/oauth-callback. - §9 Q3: ecosystem evidence — agy OAuth is widely ported (opencode/pi/hermes/ openclaw plugins + proxies), proving the integration, while the same ecosystem's anti-ban / quota-locking / multi-account-rotation tooling empirically confirms the ToS-ban + 429 risks, reinforcing the opt-in gate. - §10: add the antigravity OAuth ecosystem reference set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… in scope (r5) Grounded by a codebase survey (2026-06-24): - §4/§6: clarify "agy as a GO vendor" means a native OAuthVendor + Code-Assist inference provider — it does NOT run the agy CLI. agy speaks no ACP; the existing `antigravity` runtime variant (Mira/ECS) only works via a dedicated agy-acp adapter that shells out to the agy binary per prompt and polls its SQLite DB. The provider path sidesteps ACP entirely and supersedes the CLI-wrapper for native use, so agy's lack of ACP doesn't block the GO. - §5.4/§9 Q4: make the MCP CredentialStore revamp explicitly in-scope. auth.json has NO lock today (only atomic rename); provider save_tokens (auth.rs:234) and McpCredentialStore::save/clear (auth.rs:284-328) are two independent unlocked RMW callers, so with_auth_locked must wrap BOTH or the race persists. Also correct the stale save_tokens_for name and note with_auth_locked is new. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The §9 Q4 tail referenced 'openab-agent-mcp.md open items openabdev#1 (reqwest 0.12/0.13 split) and openabdev#8 (doctor/runtime two-store split)' — but that ADR's §10 Open Questions has only two items (mcp.json location; native-vs-broker parity), neither matching, and no such numbered items / terms exist anywhere in it. The phantom reference dated to the original draft. Replace with an accurate statement: McpCredentialStore reuses the same TokenStore/auth.json storage (openab-agent-mcp.md §6.1), so the lock lands once and serves both; the reqwest version conflict is the rmcp-OAuth dependency issue surfaced on the feat/openab-agent-mcp-resilience PR, not an mcp-ADR open item. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ive (Brett) Brett's call given the 93%-plaintext ecosystem survey: keep the bundled zero-config UX via encode-at-rest (obscure, rclone obscure.MustReveal style) as the DEFAULT, with env-injection as the alternative for fleet/pod. Reverses the prior (b)-default ordering. Framing unchanged: encode-at-rest is scanner-evasion for a non-confidential value, not a security control. Record the survey numbers (99/107 plaintext; auto-revoke largely unrealized) and that the real risk mitigated is org-repo push-protection friction, not credential loss. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Implements the ADR §5.4 invariant. auth.json had NO lock: provider save_tokens and McpCredentialStore::save/clear each did an independent unlocked read-modify-write, so a concurrent codex refresh and MCP save last-writer-wins the whole map, and N processes (one openab-agent per Discord thread) could each refresh the codex token with the same RT_old → OAuth 2.1 §10.4 token-family revocation (fleet-wide logout). Two locks, flock(2) on sidecar files (kernel auto-releases on death), cfg(unix) with a non-unix no-op: - with_auth_locked: global exclusive lock across the re-read -> mutate -> atomic-write. ALL writers funnel through it — save_tokens (codex) and McpCredentialStore::save/clear (MCP) — so writers merge onto the latest on-disk state instead of lost-updating. Held only for the fast file RMW, never across network I/O. - lock_tenant_refresh: per-tenant refresh serialisation. get_valid_token / force_refresh take the codex tenant lock (non-blocking try + async backoff + 10s timeout, held across the network refresh), with a double-checked re-read so a process that waited adopts the token another already refreshed → exactly one real refresh per tenant per expiry, no RT_old reuse. Uses libc::flock (already a cfg(unix) dep); rustix is NOT in-tree (ADR text was optimistic). New test asserts the locked RMW merges codex + MCP tenants without lost-update. fmt + clippy -D warnings + test (191 passed) green. Known gap (for review): the MCP *network* refresh is owned by rmcp's AuthorizationManager (it refreshes then calls CredentialStore::save), so MCP writes get the file-integrity lock but MCP refreshes are not yet tenant- serialised. Closing that needs an rmcp-level hook — proposed as follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Round-2 code review (Mira): - §7 pending-entry GC: add created_at to PendingPasteLogin and sweep AuthEntry::Pending older than 15 min inside with_auth_locked on every write, so abandoned /auth two-step attempts don't accumulate. PendingPasteLogin is currently a legacy tombstone with no live writer (PKCE state lives in rmcp's in-memory StateStore); the field + GC land now per ADR §7 and are forward-compatible with the forthcoming /auth two-step flow, and meanwhile sweep legacy stray entries (created_at default 0 reads as ancient). New test covers stale-swept / fresh-kept / real-tenant-untouched. - flock_try_exclusive: match std::io::ErrorKind::WouldBlock instead of a single raw EWOULDBLOCK errno, covering EAGAIN/EWOULDBLOCK across libc/BSD. - MCP network-refresh serialisation stays a follow-up (rmcp owns the refresh) — Mira concurs. fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… gap) Closes the known gap from f370110: MCP refreshes now get the same per-tenant serialisation as codex. rmcp's CredentialStore exposes no pre-refresh hook, but openab drives the MCP refresh explicitly in McpRuntimeManager::resolve_oauth_dial via client.get_access_token(). Wrap that call (per-server) with the existing auth::lock_tenant_refresh so only one process refreshes a given server at a time. No explicit double-check needed: rmcp's get_access_token re-load()s auth.json each call and returns the cached token without a network refresh when remaining >= REFRESH_BUFFER (rmcp auth.rs:1238), so a process that loses the race adopts the token the winner wrote to the shared file — no second RT_old presentation, no OAuth 2.1 §10.4 family revocation. rmcp already single-flights within one process via its AuthorizationManager Mutex; this closes the cross-process gap. - auth.rs: lock_tenant_refresh + AuthFileLock made pub(crate) for the mcp module. - ADR §5.4: document the resolve_oauth_dial serialisation point. fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…h fd Quality-only cleanups from a 4-angle /simplify pass (no behaviour change): - Extract lock_global(path) -> Result<Option<AuthFileLock>> and route both with_auth_locked and McpCredentialStore::clear through it, so the "global" sidecar name + the cfg(unix) acquire live in one place instead of two copy-pasted blocks. clear flattens (drop the inner run closure). - lock_tenant_refresh opens the lock fd once and re-issues flock on it each retry, instead of re-opening (and re-create_dir_all-ing) the file every 100ms under contention. Removes the now-unused flock_try_exclusive helper. - Document lock_tenant_refresh's double-check contract (the lock only serialises; callers must re-check freshness after acquiring). fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Doc-only clarifications from a /code-review pass (no behaviour change): - PendingPasteLogin.created_at: warn loudly that any writer of a fresh Pending entry MUST stamp created_at, else gc_stale_pending sweeps it on the next locked write (latent footgun for the forthcoming /auth two-step writer). - lock_tenant_refresh: correct the contract — reuse-safety comes from loading the refresh token INSIDE the lock (so force_refresh, which always refreshes on a 401, is reuse-safe too); the post-lock expiry re-check is only an optimisation to skip a redundant refresh. Review also surfaced a pre-existing, out-of-scope namespacing gap (an MCP server literally named "codex" collides with the codex tenant's auth.json key and refresh lock — committed MCP creds use the bare server name, not mcp:<server> as ADR §5.4 describes) — flagged for a separate change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ci-openab-agent.yml runs cargo fmt/clippy/test/build with working-directory: openab-agent, but the crate is not a member of the parent openab workspace (members = crates/openab-core, openab-gateway) and was not excluded, so cargo errors 'current package believes it's in a workspace when it's not' and every step fails before doing any work. openab-agent is intentionally standalone (own version + dual reqwest 0.12/0.13 for rmcp), so the correct fix is an empty [workspace] table making it its own root. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

This reverts commit b40693c.

ci-openab-agent.yml ran cargo from working-directory: openab-agent without the [workspace] table the crate needs (it's standalone, not a parent-workspace member), so every step failed with 'believes it's in a workspace when it's not' — the workflow had been red independently of any PR. Dockerfile.unified already works around this by appending the table at build time; replicate that in the workflow rather than committing it to Cargo.toml (which would double-append in the Dockerfile and break the image build). Also harden the ACP smoke test: build the release binary in its own step and bump the response timeout 5s -> 30s so a loaded runner doesn't flake the agent's first-response window (the binary itself responds in <1s locally, verified incl. a clean-HOME release build). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…osis The smoke test produces empty stdout only in CI (the binary responds with agentInfo in <1s locally across debug/release/clean-HOME/CI-env-var runs). Capture stderr + exit code so the next CI run reveals why stdout is empty.

brettchien · 2026-06-26T17:02:20Z

All review findings addressed and re-verified — CI fully green (build-builder, check, 16 smoke-test variants), chaodu-agent LGTM x2. No open items on my side; handing off for maintainer review.

Resolves the Changes-Requested review on the multi-vendor OAuth ADR + cross-process auth.json locking PR. No behavioural change beyond log level and a non-unix diagnostic. ADR (docs/adr/openab-agent-oauth.md) - Fix lock-file names to match code: auth.json.global.lock and auth.json.refresh.<tenant>.lock (were auth.json.lock / auth.json.<tenant>.lock). - Add the path parameter to the with_auth_locked pseudocode signature. - Correct the MCP refresh note: initialize_from_store() does the disk reload, not get_access_token. - Correct the crate note: libc::flock directly (rustix is not in-tree). - Document the deliberate fail-open trade-off on tenant-lock timeout. - Mark the default-model removal (Decision 1) as a follow-up; this PR ships the ADR + locking only. Code - auth.rs: escalate the tenant-lock timeout log from warn! to error! and document the fail-open trade-off on lock_tenant_refresh. - auth.rs: non-unix lock_global no-op now warns once instead of silently providing zero cross-process protection. - mcp/runtime.rs: fix the rmcp reload comment and document the cross-module invariant that the refresh lock and credential entry share the server name. CI (ci-openab-agent.yml) - ACP smoke test now fails on a non-zero/timeout exit code, and the agentInfo assertions use { } so exit fails the step rather than just a subshell. Gate: cargo fmt --check, clippy -D warnings, test (192 passed) all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Replaces the fail-open tenant-lock timeout with a bounded-refresh + fail-closed design, so a contended lock can no longer reintroduce the double-refresh it exists to prevent. - Bound the refresh network call with an explicit 8s HTTP timeout on both the codex client (auth.rs) and the MCP AuthClient (mcp/runtime.rs), strictly shorter than the 10s lock-acquire deadline. With flock(2) auto-release on death, a live holder always frees the tenant lock before a waiter's deadline, so a lock-acquire timeout is genuinely abnormal. - lock_tenant_refresh now returns RefreshLock { Held, Unavailable, TimedOut }. On TimedOut callers fail closed instead of refreshing unserialised: - codex get_valid_token / force_refresh return a retryable error; - MCP resolve_oauth_dial returns a new OauthDialError::Transient that leaves the server retryable WITHOUT forcing re-login (NeedsAuth) or tripping the circuit breaker (auth-level failures still map to NeedsAuth as before). A filesystem error opening the sidecar returns Unavailable and degrades to a best-effort unserialised refresh rather than blocking every refresh. - Add a fail-closed timeout test (injectable deadline) and update ADR §5.4 to document the bounded-refresh + fail-closed invariant, superseding the earlier fail-open note. Gate: cargo fmt --check, clippy -D warnings, test (193 passed) all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Mira review (PR openabdev#1190) found the fail-closed invariant held only for the codex path. The MCP path holds the tenant lock across TWO sequential bounded calls — rmcp's initialize_from_store() (AS discovery) then get_access_token() (refresh) — so the worst-case lock-hold is ~2 x REFRESH_HTTP_TIMEOUT (16s), which could exceed the fixed 10s lock deadline and fail a waiter closed while the holder is still legitimately progressing. Derive REFRESH_LOCK_TIMEOUT from the bound instead of hardcoding it: REFRESH_LOCK_TIMEOUT = MAX_REFRESH_ROUND_TRIPS (2) * REFRESH_HTTP_TIMEOUT + 4s = 20s so the deadline is always above the worst-case hold and only a genuinely stuck holder trips it. Normal-case latency is unchanged (the waiter polls every 100ms and acquires as soon as the holder releases). Update the fn doc and ADR §5.4 to state the multi-call hold explicitly. Gate: cargo fmt --check, clippy -D warnings, test (193 passed) all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

brettchien · 2026-06-26T21:00:18Z

All review feedback addressed across three rounds — chaodu-agent's 8 findings (ADR/code drift, CI exit-code, fail-open→fail-closed) and the follow-up fail-closed lock-timeout bound. Second-agent review (Mira) confirms all points. CI fully green. Handing off for maintainer review.

…403) (#1219) #1218 added a pull_request trigger, but fork PRs run with a read-only GITHUB_TOKEN, so creating the 'OpenAB PR Review' status 403s and the run fails (observed on fork PR #1190). Guard the job to run on pull_request events only for same-repo PRs; forks continue to be handled by the cron poller (which runs with full permissions). schedule/workflow_dispatch always run. Fixes the failing runs introduced by #1218. Co-authored-by: chaodu-agent <chaodu-agent@users.noreply.github.com>

chaodu-agent · 2026-06-27T02:22:11Z

LGTM ✅ — Well-engineered cross-process locking for auth.json that fixes a real release-blocker-class OAuth RTR race, paired with a thorough ADR.

What This PR Does

Fixes a latent concurrency bug where multiple openab-agent processes (one per Discord thread) could concurrently refresh the same OAuth token, triggering OAuth 2.1 §10.4 token-family revocation (fleet-wide logout). Lands both the ADR documenting the multi-vendor OAuth pattern and the implementation of the two-lock file-locking scheme.

How It Works

Two complementary flock(2) locks on sidecar files:

Global lock (auth.json.global.lock) — held across the re-read → mutate → atomic-write cycle in with_auth_locked, so concurrent writers (codex save_tokens + MCP CredentialStore) merge instead of lost-updating.
Per-tenant lock (auth.json.refresh.<tenant>.lock) — held across the network refresh (lock_tenant_refresh), with non-blocking try + async backoff + bounded timeout. Fail-closed on timeout (surface retryable error rather than risk double-RT presentation).

CI workflow improvements: workspace isolation fix, separated build step, robust ACP smoke test with proper exit-code propagation.

Findings

#	Severity	Finding	Location
1	🟢	Correct two-lock separation avoids both lost-update and RTR hazards without head-of-line blocking	`auth.rs:227-485`
2	🟢	Fail-closed timeout design with bounded refresh is properly reasoned — a timeout genuinely signals an abnormal state	`auth.rs:380-420`
3	🟢	MCP integration correctly uses `OauthDialError::Transient` to avoid tripping the circuit breaker or forcing re-login on lock contention	`mcp/runtime.rs:646-680`
4	🟢	Pending-entry GC (ADR §7) is a good defensive measure against stale verifiers accumulating	`auth.rs:310-320`
5	🟢	CI smoke test hardening: exit-code capture + proper error propagation fixes silent timeout/crash failures	`ci-openab-agent.yml`

What's Good (🟢)

Two-lock design is the right call. A single global lock held across network I/O would serialize all tenants — the per-tenant lock neatly avoids that head-of-line blocking while still preventing RTR.
flock(2) on sidecars — kernel auto-release on fd close / process death eliminates stale lock concerns. Sidecar placement is correct since atomic rename swaps the inode of auth.json itself.
Double-check pattern in get_valid_token — re-reads after acquiring the tenant lock so the lock-loser adopts the winner's token. MCP path gets this free from rmcp's initialize_from_store() reload.
Fail-closed with bounded timeout — REFRESH_LOCK_TIMEOUT is derived from REFRESH_HTTP_TIMEOUT × MAX_REFRESH_ROUND_TRIPS + margin, making the invariant (lock-hold always shorter than waiter's deadline) provable and drift-resistant.
Non-unix no-op with one-time warn — pragmatic given the de-facto unix-only deployment.
Test coverage — with_auth_locked_merges_concurrent_tenants_no_lost_update, GC test, and the fail-closed contention test directly exercise the core invariants.
ADR is comprehensive — the vendor feasibility matrix, prior art survey (OpenClaw, Hermes, rclone), and §9 decisions give future contributors clear context for incremental vendor additions.
CI improvements — the workspace isolation (printf '\\n[workspace]\\n' >> Cargo.toml) matches the Dockerfile.unified pattern, and the reworked smoke test no longer silently swallows timeout/crash exits.

Baseline Check

PR opened: 2026-06-24
Main already has: auth.json with atomic tmp+rename in write_auth_file, but no cross-process locking — save_tokens and McpCredentialStore::save both do unlocked read-modify-write
Net-new value: (1) Global file-integrity lock preventing lost-updates between codex and MCP writers; (2) Per-tenant refresh serialisation preventing RTR family revocation; (3) ADR establishing the multi-vendor OAuth pattern; (4) CI robustness improvements

chaodu-agent

LGTM ✅ — Cross-process auth.json locking is well-engineered and correctly addresses the RTR race condition. ADR is thorough. Approving.

…auth tenant Brett asked to cover the new tenant the same way openabdev#1190's codex lock tests do, so single-flight / fail-closed is proven for it rather than only structurally similar: - `with_auth_locked_merges_anthropic_tenant_no_lost_update` — a concurrent codex write does not clobber a just-written `anthropic-oauth` token (the new tenant rides the same locked RMW funnel). - `lock_tenant_refresh_fails_closed_for_anthropic_and_is_per_tenant` — a held anthropic refresh lock makes a second anthropic acquire fail closed (`TimedOut`), and does NOT block codex (per-tenant isolation — the reason §5.4 uses a per-tenant lock, not the global one). 215 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

brettchien and others added 12 commits June 25, 2026 00:03

brettchien requested a review from thepagent as a code owner June 24, 2026 19:53