Skip to content

openab-agent: multi-vendor OAuth ADR + cross-process auth.json locking (codex + MCP)#1190

Merged
thepagent merged 24 commits into
openabdev:mainfrom
brettchien:docs/adr-openab-agent-oauth
Jun 27, 2026
Merged

openab-agent: multi-vendor OAuth ADR + cross-process auth.json locking (codex + MCP)#1190
thepagent merged 24 commits into
openabdev:mainfrom
brettchien:docs/adr-openab-agent-oauth

Conversation

@brettchien

Copy link
Copy Markdown
Contributor

What problem does this solve?

openab-agent's auth.json is a shared, multi-writer credential file with an unlocked read-modify-write, and openab-agent runs one process per Discord thread. So ordinary concurrent usage = concurrent processes refreshing the same OAuth token → refresh-token-rotation reuse → worst case OAuth 2.1 §10.4 token-family revocation = fleet-wide logout. API-key users never hit this; OAuth adoption (PR #1187, Anthropic) is what activates it.

This PR lands the ADR that sets the multi-vendor OAuth pattern (so every future provider stops hand-rolling its own flow) and the concurrency fix for the release-blocker-class storage race — for both the codex tenant and the MCP CredentialStore.

Closes #

Discord Discussion URL: https://discord.com/channels/1490643539682529300/1519372543377539173

At a Glance

auth.json  (multi-tenant: codex / anthropic-oauth / mcp:<server>×N, one process per Discord thread)

 (a) file integrity — ONE global lock          (b) refresh-token rotation — ONE lock per tenant
 ┌─ with_auth_locked(path, f) ──────────┐       ┌─ lock_tenant_refresh(path, tenant) ───────────┐
 │ flock EX  auth.json.global.lock       │       │ try-lock  auth.json.refresh.<tenant>.lock     │
 │ re-read → f(&mut map) → gc_pending →   │       │  + async backoff + 10s timeout (never wedge)  │
 │ atomic tmp+rename                      │       │ held ACROSS the network refresh               │
 └───────────────────────────────────────┘       └───────────────────────────────────────────────┘
   ▲ save_tokens (codex)                            ▲ get_valid_token / force_refresh (codex)
   ▲ McpCredentialStore::save / clear (MCP)         ▲ resolve_oauth_dial → get_access_token (MCP)
   = writers MERGE, never lost-update                = exactly ONE network refresh per tenant; the
                                                        lock-loser re-load()s the winner's token and
                                                        skips its own refresh (no RT_old reuse)

Prior Art & Industry Research

OpenClaw: API keys + subscription OAuth; stores per-profile {access,refresh,expires,accountId} and treats the profile file as a token sink refreshed under a file lock — direct corroboration for the locked-RMW decision.

Hermes Agent (NousResearch): PROVIDER_REGISTRY dataclasses declare each provider's auth type + URLs + env vars behind one resolve_runtime_provider() entry point; auth.json guarded with fcntl/msvcrt file locks — corroborates both the registry shape (OAuthVendor) and the file-lock decision.

Other references:

  • Pi (earendil-works/pi) — ported flow for feat(native-agent): Anthropic OAuth (Claude Pro/Max) login for openab-agent #1187; CLAUDE_CODE_OAUTH_TOKEN (#3591); Kiro/Cursor/xAI provider extensions.
  • rclonercloneEncryptedClientSecret + runtime obscure.MustReveal(): precedent for the ADR §9 Q2 encode-at-rest of a non-confidential bundled OAuth secret (scanner-evasion, explicitly not encryption).
  • Antigravity (agy) OAuth ecosystem — GitHub survey 2026-06-24 (≥326 repos hardcode the same client_id/secret; 99/107 sampled commit it plaintext) grounding the agy go/no-go + secret-storage decisions.

Proposed Solution

ADR (docs/adr/openab-agent-oauth.md): a two-axis model — auth (OAuthVendor descriptor + a shared driver on the official oauth2 crate) kept orthogonal to inference (one provider per wire format) — plus a 14-variant vendor feasibility matrix, the CLAUDE_CODE_OAUTH_TOKEN env route, the /auth two-step UX (per-user-keyed PKCE pending state), and the storage invariant below.

Implementation (this PR):

  • with_auth_locked — global flock(2) on an auth.json.global.lock sidecar across re-read → mutate → atomic write. All writers funnel through it (codex save_tokens + MCP McpCredentialStore::save/clear), so they merge onto the latest on-disk state instead of lost-updating.
  • lock_tenant_refresh — per-tenant flock (non-blocking try + async backoff + 10s timeout) held across the network refresh, with a double-checked re-read, so concurrent processes do exactly one real refresh per tenant. Codex uses it in get_valid_token/force_refresh; MCP uses it in resolve_oauth_dial around get_access_token().
  • flock(2) on sidecar files (kernel auto-releases on fd close / process death — no stale lock); libc::flock under cfg(unix), non-unix no-op.
  • ADR §7 pending-entry GC: created_at + a 15-min sweep inside with_auth_locked.

Why this approach?

  • Two locks, two hazards. The global lock fixes file lost-updates; the per-tenant lock fixes RTR revocation. A single global lock held across the network refresh would serialize one tenant's slow refresh in front of every other tenant (head-of-line blocking MCP servers / Codex), so refresh I/O is held under the per-tenant lock only.
  • flock(2) over a sentinel lockfile — the kernel releases it on process death, so a crashed holder never wedges the file (no orphan cleanup / manual timeout).
  • MCP single-flight without an rmcp fork. rmcp's CredentialStore exposes no pre-refresh hook, but openab drives the MCP refresh explicitly in resolve_oauth_dial; rmcp's get_access_token re-load()s auth.json and skips the network refresh when the token is already fresh, so the per-tenant lock + rmcp's own fresh-load gives cross-process single-flight with no extra double-check code.
  • Known limitation: rustix is not in-tree (the ADR text was optimistic), so this uses libc::flock directly.

Alternatives Considered

  • In-process Mutex / tokio single-flight — useless across the per-thread processes.
  • Sentinel lockfile (create→delete) — deadlocks if a holder dies; flock(2) auto-releases.
  • Refresh outside the lock, re-read on commit (an earlier ADR draft) — rejected: every process has already sent the refresh with the same RT_old before it commits, so RTR still fires.
  • Forcing everything through rmcp CredentialStore — lossy; the shared layer sits below both stores (file RMW).

Validation

Rust changes:

  • cargo check passes
  • cargo test passes (192 passed, 0 failed, 11 ignored) — incl. new with_auth_locked_merges_concurrent_tenants_no_lost_update and with_auth_locked_gcs_stale_pending_but_keeps_fresh_and_tokens
  • cargo clippy -- -D warnings clean
  • cargo fmt --check clean

All PRs:

  • Manual testing — ran the full gate from openab-agent/ (the crate is built standalone, matching ci-openab-agent.yml). Reviewed end-to-end with a second agent across two ADR-review rounds and two code-review rounds; all findings (RTR refresh race, per-user PKCE keying, pending GC, portable WouldBlock, MCP cross-process refresh) addressed and re-verified.

brettchien and others added 12 commits June 25, 2026 00:03
Proposed ADR for the openab-agent LLM-provider OAuth revamp: a two-axis
OAuthVendor adapter (auth flow vs inference transport), a cross-process
flock-guarded credential-store invariant for auth.json, the
CLAUDE_CODE_OAUTH_TOKEN env route, a 14-variant vendor feasibility matrix,
and the /auth (PR openabdev#1185) auth-trigger model. Surfaced while reviewing
PR openabdev#1187 (first OAuth vendor).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… names

- Build the OAuthVendor driver on the official `oauth2` crate (already in-tree
  via the MCP side) instead of a hand-rolled PKCE/exchange/refresh flow; the
  Anthropic JSON-token-body quirk is applied via the crate's custom http-client
  hook. Reframe §8 accordingly (hand-rolled flow is the rejected alternative).
- Remove the project Rollout-plan section (internal sequencing, not ADR
  material); keep the race-window mitigation in §5.4.
- Use vendor names only; drop internal fleet-agent references from the matrix.
- Replace unexplained "ToS-gray" with "ToS-risk" + a definition.
- Fix cross-references (crate-qualified paths, §-refs, line numbers) and move
  the settled model-default decision under "Decisions & open questions".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…w r1)

Fold in two release-blocker-class concurrency fixes from Mira's review:

- §5.4 refresh-token rotation: the prior "refresh outside lock, re-read on
  commit" claim was wrong — N processes each send a refresh with the same
  RT_old before committing, tripping OAuth 2.1 §10.4 reuse -> token-family
  revocation. Replace with a per-tenant exclusive lock: network refresh held
  under the tenant lock only (not the global lock), so exactly one refresh per
  tenant per expiry with no head-of-line blocking across tenants. Extends to
  mcp:<server> tenants.
- §7 /auth: key the pending PKCE verifier+state by the initiating Discord user
  id instead of a single global entry — prevents concurrent-user overwrite
  (PKCE mismatch) and session hijack.
- §5.1 OAuthVendor::redirect() -> Option, since device flows have no redirect.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- §6/§9 Q2: correct the stale "git-safe" claim on the gemini GOCSPX-
  secret. The value is non-confidential by RFC 8252 / Google docs, but
  GitHub now push-protects Google secrets by default (changelog 2026-03)
  and partner-scans them for auto-revoke, so a raw literal is not safe in a
  public repo. Decide: encode-at-rest (scanner-evasion for a non-secret, NOT
  a security control) or env-inject at runtime — not raw text.
- §7: pending PKCE entries get created_at + a 15-min GC sweep in
  with_auth_locked so abandoned /auth attempts don't accumulate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rett decisions r3)

- §9 Q2 DECIDED: env-injection (b) is the default for the gemini/agy bundled
  client_secret; encode-at-rest (a) is the fallback for bundled zero-config
  binaries only, framed as scanner-evasion (not security). Cite rclone
  rcloneEncryptedClientSecret + obscure.MustReveal() as the canonical
  precedent (§10).
- §9 Q3 DECIDED: GO gemini/grok (first wave) + agy (experimental, opt-in,
  ToS caveat — shares gemini's Code-Assist provider, residual risk is ToS not
  secret storage); No-Go cursor/kiro. Mirror as a build-decision line under §6.
- §10: add rclone + GitHub/Google secret-scanning references.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GitHub survey 2026-06-24 grounds the agy GO decision:
- §6/§9 Q2: agy client_secret requirement CONFIRMED — it needs a GOCSPX-
  secret, a public constant >=20 antigravity-auth repos hardcode verbatim
  (NoeFabris/opencode-antigravity-auth, router-for-me/CLIProxyAPI, ...).
  Literal deliberately NOT pasted into the doc — would trip the very §9 Q2
  push-protection, so we dogfood the env/encode decision. Redirect confirmed
  localhost:51121/oauth-callback.
- §9 Q3: ecosystem evidence — agy OAuth is widely ported (opencode/pi/hermes/
  openclaw plugins + proxies), proving the integration, while the same
  ecosystem's anti-ban / quota-locking / multi-account-rotation tooling
  empirically confirms the ToS-ban + 429 risks, reinforcing the opt-in gate.
- §10: add the antigravity OAuth ecosystem reference set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… in scope (r5)

Grounded by a codebase survey (2026-06-24):

- §4/§6: clarify "agy as a GO vendor" means a native OAuthVendor + Code-Assist
  inference provider — it does NOT run the agy CLI. agy speaks no ACP; the
  existing `antigravity` runtime variant (Mira/ECS) only works via a dedicated
  agy-acp adapter that shells out to the agy binary per prompt and polls its
  SQLite DB. The provider path sidesteps ACP entirely and supersedes the
  CLI-wrapper for native use, so agy's lack of ACP doesn't block the GO.
- §5.4/§9 Q4: make the MCP CredentialStore revamp explicitly in-scope. auth.json
  has NO lock today (only atomic rename); provider save_tokens (auth.rs:234) and
  McpCredentialStore::save/clear (auth.rs:284-328) are two independent unlocked
  RMW callers, so with_auth_locked must wrap BOTH or the race persists. Also
  correct the stale save_tokens_for name and note with_auth_locked is new.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The §9 Q4 tail referenced 'openab-agent-mcp.md open items openabdev#1 (reqwest
0.12/0.13 split) and openabdev#8 (doctor/runtime two-store split)' — but that ADR's
§10 Open Questions has only two items (mcp.json location; native-vs-broker
parity), neither matching, and no such numbered items / terms exist anywhere
in it. The phantom reference dated to the original draft. Replace with an
accurate statement: McpCredentialStore reuses the same TokenStore/auth.json
storage (openab-agent-mcp.md §6.1), so the lock lands once and serves both;
the reqwest version conflict is the rmcp-OAuth dependency issue surfaced on
the feat/openab-agent-mcp-resilience PR, not an mcp-ADR open item.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ive (Brett)

Brett's call given the 93%-plaintext ecosystem survey: keep the bundled
zero-config UX via encode-at-rest (obscure, rclone obscure.MustReveal style)
as the DEFAULT, with env-injection as the alternative for fleet/pod. Reverses
the prior (b)-default ordering. Framing unchanged: encode-at-rest is
scanner-evasion for a non-confidential value, not a security control. Record
the survey numbers (99/107 plaintext; auto-revoke largely unrealized) and that
the real risk mitigated is org-repo push-protection friction, not credential
loss.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implements the ADR §5.4 invariant. auth.json had NO lock: provider
save_tokens and McpCredentialStore::save/clear each did an independent
unlocked read-modify-write, so a concurrent codex refresh and MCP save
last-writer-wins the whole map, and N processes (one openab-agent per Discord
thread) could each refresh the codex token with the same RT_old → OAuth 2.1
§10.4 token-family revocation (fleet-wide logout).

Two locks, flock(2) on sidecar files (kernel auto-releases on death), cfg(unix)
with a non-unix no-op:

- with_auth_locked: global exclusive lock across the re-read -> mutate ->
  atomic-write. ALL writers funnel through it — save_tokens (codex) and
  McpCredentialStore::save/clear (MCP) — so writers merge onto the latest
  on-disk state instead of lost-updating. Held only for the fast file RMW,
  never across network I/O.
- lock_tenant_refresh: per-tenant refresh serialisation. get_valid_token /
  force_refresh take the codex tenant lock (non-blocking try + async backoff +
  10s timeout, held across the network refresh), with a double-checked re-read
  so a process that waited adopts the token another already refreshed → exactly
  one real refresh per tenant per expiry, no RT_old reuse.

Uses libc::flock (already a cfg(unix) dep); rustix is NOT in-tree (ADR text was
optimistic). New test asserts the locked RMW merges codex + MCP tenants without
lost-update. fmt + clippy -D warnings + test (191 passed) green.

Known gap (for review): the MCP *network* refresh is owned by rmcp's
AuthorizationManager (it refreshes then calls CredentialStore::save), so MCP
writes get the file-integrity lock but MCP refreshes are not yet tenant-
serialised. Closing that needs an rmcp-level hook — proposed as follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Round-2 code review (Mira):
- §7 pending-entry GC: add created_at to PendingPasteLogin and sweep
  AuthEntry::Pending older than 15 min inside with_auth_locked on every write,
  so abandoned /auth two-step attempts don't accumulate. PendingPasteLogin is
  currently a legacy tombstone with no live writer (PKCE state lives in rmcp's
  in-memory StateStore); the field + GC land now per ADR §7 and are
  forward-compatible with the forthcoming /auth two-step flow, and meanwhile
  sweep legacy stray entries (created_at default 0 reads as ancient). New test
  covers stale-swept / fresh-kept / real-tenant-untouched.
- flock_try_exclusive: match std::io::ErrorKind::WouldBlock instead of a single
  raw EWOULDBLOCK errno, covering EAGAIN/EWOULDBLOCK across libc/BSD.
- MCP network-refresh serialisation stays a follow-up (rmcp owns the refresh) —
  Mira concurs.

fmt + clippy -D warnings + test (192 passed) green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… gap)

Closes the known gap from f370110: MCP refreshes now get the same per-tenant
serialisation as codex.

rmcp's CredentialStore exposes no pre-refresh hook, but openab drives the MCP
refresh explicitly in McpRuntimeManager::resolve_oauth_dial via
client.get_access_token(). Wrap that call (per-server) with the existing
auth::lock_tenant_refresh so only one process refreshes a given server at a
time. No explicit double-check needed: rmcp's get_access_token re-load()s
auth.json each call and returns the cached token without a network refresh when
remaining >= REFRESH_BUFFER (rmcp auth.rs:1238), so a process that loses the
race adopts the token the winner wrote to the shared file — no second RT_old
presentation, no OAuth 2.1 §10.4 family revocation. rmcp already single-flights
within one process via its AuthorizationManager Mutex; this closes the
cross-process gap.

- auth.rs: lock_tenant_refresh + AuthFileLock made pub(crate) for the mcp module.
- ADR §5.4: document the resolve_oauth_dial serialisation point.

fmt + clippy -D warnings + test (192 passed) green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@brettchien brettchien requested a review from thepagent as a code owner June 24, 2026 19:53
…h fd

Quality-only cleanups from a 4-angle /simplify pass (no behaviour change):
- Extract lock_global(path) -> Result<Option<AuthFileLock>> and route both
  with_auth_locked and McpCredentialStore::clear through it, so the "global"
  sidecar name + the cfg(unix) acquire live in one place instead of two
  copy-pasted blocks. clear flattens (drop the inner run closure).
- lock_tenant_refresh opens the lock fd once and re-issues flock on it each
  retry, instead of re-opening (and re-create_dir_all-ing) the file every
  100ms under contention. Removes the now-unused flock_try_exclusive helper.
- Document lock_tenant_refresh's double-check contract (the lock only
  serialises; callers must re-check freshness after acquiring).

fmt + clippy -D warnings + test (192 passed) green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@chaodu-agent

This comment has been minimized.

brettchien and others added 5 commits June 25, 2026 04:07
Doc-only clarifications from a /code-review pass (no behaviour change):
- PendingPasteLogin.created_at: warn loudly that any writer of a fresh Pending
  entry MUST stamp created_at, else gc_stale_pending sweeps it on the next
  locked write (latent footgun for the forthcoming /auth two-step writer).
- lock_tenant_refresh: correct the contract — reuse-safety comes from loading
  the refresh token INSIDE the lock (so force_refresh, which always refreshes
  on a 401, is reuse-safe too); the post-lock expiry re-check is only an
  optimisation to skip a redundant refresh.

Review also surfaced a pre-existing, out-of-scope namespacing gap (an MCP
server literally named "codex" collides with the codex tenant's auth.json key
and refresh lock — committed MCP creds use the bare server name, not
mcp:<server> as ADR §5.4 describes) — flagged for a separate change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ci-openab-agent.yml runs cargo fmt/clippy/test/build with
working-directory: openab-agent, but the crate is not a member of the parent
openab workspace (members = crates/openab-core, openab-gateway) and was not
excluded, so cargo errors 'current package believes it's in a workspace when
it's not' and every step fails before doing any work. openab-agent is
intentionally standalone (own version + dual reqwest 0.12/0.13 for rmcp), so
the correct fix is an empty [workspace] table making it its own root.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ci-openab-agent.yml ran cargo from working-directory: openab-agent without the
[workspace] table the crate needs (it's standalone, not a parent-workspace
member), so every step failed with 'believes it's in a workspace when it's
not' — the workflow had been red independently of any PR. Dockerfile.unified
already works around this by appending the table at build time; replicate that
in the workflow rather than committing it to Cargo.toml (which would
double-append in the Dockerfile and break the image build).

Also harden the ACP smoke test: build the release binary in its own step and
bump the response timeout 5s -> 30s so a loaded runner doesn't flake the
agent's first-response window (the binary itself responds in <1s locally,
verified incl. a clean-HOME release build).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…osis

The smoke test produces empty stdout only in CI (the binary responds with
agentInfo in <1s locally across debug/release/clean-HOME/CI-env-var runs).
Capture stderr + exit code so the next CI run reveals why stdout is empty.
@chaodu-agent

This comment has been minimized.

@brettchien

Copy link
Copy Markdown
Contributor Author

All review findings addressed and re-verified — CI fully green (build-builder, check, 16 smoke-test variants), chaodu-agent LGTM x2. No open items on my side; handing off for maintainer review.

@chaodu-agent

This comment has been minimized.

@chaodu-agent

This comment has been minimized.

@chaodu-agent

This comment has been minimized.

brettchien and others added 3 commits June 27, 2026 03:18
Resolves the Changes-Requested review on the multi-vendor OAuth ADR +
cross-process auth.json locking PR. No behavioural change beyond log level
and a non-unix diagnostic.

ADR (docs/adr/openab-agent-oauth.md)
- Fix lock-file names to match code: auth.json.global.lock and
  auth.json.refresh.<tenant>.lock (were auth.json.lock / auth.json.<tenant>.lock).
- Add the path parameter to the with_auth_locked pseudocode signature.
- Correct the MCP refresh note: initialize_from_store() does the disk reload,
  not get_access_token.
- Correct the crate note: libc::flock directly (rustix is not in-tree).
- Document the deliberate fail-open trade-off on tenant-lock timeout.
- Mark the default-model removal (Decision 1) as a follow-up; this PR ships
  the ADR + locking only.

Code
- auth.rs: escalate the tenant-lock timeout log from warn! to error! and
  document the fail-open trade-off on lock_tenant_refresh.
- auth.rs: non-unix lock_global no-op now warns once instead of silently
  providing zero cross-process protection.
- mcp/runtime.rs: fix the rmcp reload comment and document the cross-module
  invariant that the refresh lock and credential entry share the server name.

CI (ci-openab-agent.yml)
- ACP smoke test now fails on a non-zero/timeout exit code, and the agentInfo
  assertions use { } so exit fails the step rather than just a subshell.

Gate: cargo fmt --check, clippy -D warnings, test (192 passed) all clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replaces the fail-open tenant-lock timeout with a bounded-refresh + fail-closed
design, so a contended lock can no longer reintroduce the double-refresh it
exists to prevent.

- Bound the refresh network call with an explicit 8s HTTP timeout on both the
  codex client (auth.rs) and the MCP AuthClient (mcp/runtime.rs), strictly
  shorter than the 10s lock-acquire deadline. With flock(2) auto-release on
  death, a live holder always frees the tenant lock before a waiter's deadline,
  so a lock-acquire timeout is genuinely abnormal.

- lock_tenant_refresh now returns RefreshLock { Held, Unavailable, TimedOut }.
  On TimedOut callers fail closed instead of refreshing unserialised:
  - codex get_valid_token / force_refresh return a retryable error;
  - MCP resolve_oauth_dial returns a new OauthDialError::Transient that leaves
    the server retryable WITHOUT forcing re-login (NeedsAuth) or tripping the
    circuit breaker (auth-level failures still map to NeedsAuth as before).
  A filesystem error opening the sidecar returns Unavailable and degrades to a
  best-effort unserialised refresh rather than blocking every refresh.

- Add a fail-closed timeout test (injectable deadline) and update ADR §5.4 to
  document the bounded-refresh + fail-closed invariant, superseding the earlier
  fail-open note.

Gate: cargo fmt --check, clippy -D warnings, test (193 passed) all clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mira review (PR openabdev#1190) found the fail-closed invariant held only for the codex
path. The MCP path holds the tenant lock across TWO sequential bounded calls —
rmcp's initialize_from_store() (AS discovery) then get_access_token() (refresh)
— so the worst-case lock-hold is ~2 x REFRESH_HTTP_TIMEOUT (16s), which could
exceed the fixed 10s lock deadline and fail a waiter closed while the holder is
still legitimately progressing.

Derive REFRESH_LOCK_TIMEOUT from the bound instead of hardcoding it:
  REFRESH_LOCK_TIMEOUT = MAX_REFRESH_ROUND_TRIPS (2) * REFRESH_HTTP_TIMEOUT + 4s
                       = 20s
so the deadline is always above the worst-case hold and only a genuinely stuck
holder trips it. Normal-case latency is unchanged (the waiter polls every 100ms
and acquires as soon as the holder releases). Update the fn doc and ADR §5.4 to
state the multi-call hold explicitly.

Gate: cargo fmt --check, clippy -D warnings, test (193 passed) all clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@chaodu-agent

This comment has been minimized.

@brettchien

Copy link
Copy Markdown
Contributor Author

All review feedback addressed across three rounds — chaodu-agent's 8 findings (ADR/code drift, CI exit-code, fail-open→fail-closed) and the follow-up fail-closed lock-timeout bound. Second-agent review (Mira) confirms all points. CI fully green. Handing off for maintainer review.

@chaodu-agent

This comment has been minimized.

thepagent pushed a commit that referenced this pull request Jun 27, 2026
…403) (#1219)

#1218 added a pull_request trigger, but fork PRs run with a read-only
GITHUB_TOKEN, so creating the 'OpenAB PR Review' status 403s and the run
fails (observed on fork PR #1190). Guard the job to run on pull_request
events only for same-repo PRs; forks continue to be handled by the cron
poller (which runs with full permissions). schedule/workflow_dispatch
always run.

Fixes the failing runs introduced by #1218.

Co-authored-by: chaodu-agent <chaodu-agent@users.noreply.github.com>
@chaodu-agent

Copy link
Copy Markdown
Collaborator

LGTM ✅ — Well-engineered cross-process locking for auth.json that fixes a real release-blocker-class OAuth RTR race, paired with a thorough ADR.

What This PR Does

Fixes a latent concurrency bug where multiple openab-agent processes (one per Discord thread) could concurrently refresh the same OAuth token, triggering OAuth 2.1 §10.4 token-family revocation (fleet-wide logout). Lands both the ADR documenting the multi-vendor OAuth pattern and the implementation of the two-lock file-locking scheme.

How It Works

Two complementary flock(2) locks on sidecar files:

  1. Global lock (auth.json.global.lock) — held across the re-read → mutate → atomic-write cycle in with_auth_locked, so concurrent writers (codex save_tokens + MCP CredentialStore) merge instead of lost-updating.
  2. Per-tenant lock (auth.json.refresh.<tenant>.lock) — held across the network refresh (lock_tenant_refresh), with non-blocking try + async backoff + bounded timeout. Fail-closed on timeout (surface retryable error rather than risk double-RT presentation).

CI workflow improvements: workspace isolation fix, separated build step, robust ACP smoke test with proper exit-code propagation.

Findings

# Severity Finding Location
1 🟢 Correct two-lock separation avoids both lost-update and RTR hazards without head-of-line blocking auth.rs:227-485
2 🟢 Fail-closed timeout design with bounded refresh is properly reasoned — a timeout genuinely signals an abnormal state auth.rs:380-420
3 🟢 MCP integration correctly uses OauthDialError::Transient to avoid tripping the circuit breaker or forcing re-login on lock contention mcp/runtime.rs:646-680
4 🟢 Pending-entry GC (ADR §7) is a good defensive measure against stale verifiers accumulating auth.rs:310-320
5 🟢 CI smoke test hardening: exit-code capture + proper error propagation fixes silent timeout/crash failures ci-openab-agent.yml
What's Good (🟢)
  • Two-lock design is the right call. A single global lock held across network I/O would serialize all tenants — the per-tenant lock neatly avoids that head-of-line blocking while still preventing RTR.
  • flock(2) on sidecars — kernel auto-release on fd close / process death eliminates stale lock concerns. Sidecar placement is correct since atomic rename swaps the inode of auth.json itself.
  • Double-check pattern in get_valid_token — re-reads after acquiring the tenant lock so the lock-loser adopts the winner's token. MCP path gets this free from rmcp's initialize_from_store() reload.
  • Fail-closed with bounded timeoutREFRESH_LOCK_TIMEOUT is derived from REFRESH_HTTP_TIMEOUT × MAX_REFRESH_ROUND_TRIPS + margin, making the invariant (lock-hold always shorter than waiter's deadline) provable and drift-resistant.
  • Non-unix no-op with one-time warn — pragmatic given the de-facto unix-only deployment.
  • Test coveragewith_auth_locked_merges_concurrent_tenants_no_lost_update, GC test, and the fail-closed contention test directly exercise the core invariants.
  • ADR is comprehensive — the vendor feasibility matrix, prior art survey (OpenClaw, Hermes, rclone), and §9 decisions give future contributors clear context for incremental vendor additions.
  • CI improvements — the workspace isolation (printf '\\n[workspace]\\n' >> Cargo.toml) matches the Dockerfile.unified pattern, and the reworked smoke test no longer silently swallows timeout/crash exits.
Baseline Check
  • PR opened: 2026-06-24
  • Main already has: auth.json with atomic tmp+rename in write_auth_file, but no cross-process lockingsave_tokens and McpCredentialStore::save both do unlocked read-modify-write
  • Net-new value: (1) Global file-integrity lock preventing lost-updates between codex and MCP writers; (2) Per-tenant refresh serialisation preventing RTR family revocation; (3) ADR establishing the multi-vendor OAuth pattern; (4) CI robustness improvements

@chaodu-agent chaodu-agent left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅ — Cross-process auth.json locking is well-engineered and correctly addresses the RTR race condition. ADR is thorough. Approving.

@thepagent thepagent enabled auto-merge (squash) June 27, 2026 02:45
@thepagent thepagent disabled auto-merge June 27, 2026 02:47
@thepagent thepagent merged commit e5c1136 into openabdev:main Jun 27, 2026
21 checks passed
canyugs pushed a commit to canyugs/openab that referenced this pull request Jun 27, 2026
…auth tenant

Brett asked to cover the new tenant the same way openabdev#1190's codex lock tests do, so
single-flight / fail-closed is proven for it rather than only structurally similar:

- `with_auth_locked_merges_anthropic_tenant_no_lost_update` — a concurrent codex
  write does not clobber a just-written `anthropic-oauth` token (the new tenant
  rides the same locked RMW funnel).
- `lock_tenant_refresh_fails_closed_for_anthropic_and_is_per_tenant` — a held
  anthropic refresh lock makes a second anthropic acquire fail closed (`TimedOut`),
  and does NOT block codex (per-tenant isolation — the reason §5.4 uses a
  per-tenant lock, not the global one).

215 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants