Skip to content

Support identity-based mutual TLS — Closes #249#250

Draft
conradbzura wants to merge 5 commits into
wool-labs:mainfrom
conradbzura:249-identity-based-mtls
Draft

Support identity-based mutual TLS — Closes #249#250
conradbzura wants to merge 5 commits into
wool-labs:mainfrom
conradbzura:249-identity-based-mtls

Conversation

@conradbzura

@conradbzura conradbzura commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Close the three gaps that block Wool's mutual TLS on dynamic-address orchestrators (Kubernetes, ECS/Fargate), while keeping the existing certificate-authority trust model and WorkerCredentials exactly as they are. The change is additive and opt-in: a bare WorkerCredentials is coerced into an identity-free static provider everywhere, so static-address mTLS, one-way TLS, and plaintext deployments behave byte-for-byte as before, and no protobuf change is required.

The approach introduces a CredentialsProviderLike seam — the unit the runtime consults for current credential material — and threads it through the Worker Subsystem (client and server) and the Load Balancing layer:

  • Identity-based verification keys client channels on a content fingerprint and adds gRPC's ssl_target_name_override, so a worker reached at a dynamically assigned address is verified against a stable logical identity (its certificate SAN) rather than the dialed address. Full chain and SAN verification are preserved — verification is strengthened, not relaxed.
  • Rotation without restart re-reads file-backed material on change: the fingerprinted client pool yields fresh channels for rotated material while in-flight dispatches finish on their existing channel, and the worker server adopts new material per connection via grpc.dynamic_ssl_server_credentials.
  • Diagnosable failures classify a failed handshake as a typed HandshakeError (with a reason enum). A handshake-failing worker is skipped without eviction — so a worker mid-rotation self-heals on a later dispatch — and each rejection is emitted as a structured log carrying the classified reason; a drained dispatch raises the plain NoWorkersAvailable. A typed aggregate of the per-worker failures is deferred to a follow-up (see Review remediation).

One trade-off worth noting: rotation replaces cert/key/CA bytes but not the mutual-TLS mode, which grpc.dynamic_ssl_server_credentials fixes at construction. A second: the handshake classifier keys control flow on the gRPC status code plus the presence of any TLS/cert token in the error text — a deliberately broad gate that degrades a tokenless TLS failure to plain transient-skip rather than risking a false eviction. A handshake failure is skipped without eviction, so a worker mid-rotation self-heals on a later dispatch rather than being removed until re-discovery.

Closes #249

Proposed changes

Credential provider abstraction (worker/auth.py, Worker Subsystem)

Add CredentialsProviderLike (a resolve() -> CredentialsSnapshot protocol with a reloadable flag) and the fingerprinted CredentialsSnapshot, with one public implementation — FileCredentialsProvider (re-reads PEM files on change, validates before caching, returns the last-good snapshot on a transient or malformed read, and is lock-guarded for the off-loop server fetcher) — plus an internal _StaticCredentialsProvider that a bare WorkerCredentials coerces into. WorkerCredentials.provider_from_files(..., identity=, reload=) is the ergonomic entry point; _coerce_provider is the single seam that normalizes a bare WorkerCredentials into a provider. A blank identity normalizes to None (address-based path). CredentialsContext is widened to carry either form. The public surface is deliberately small: pass a CredentialsProviderLike (typically a FileCredentialsProvider, or your own custom provider), or a WorkerCredentials for the trivial static case.

Identity verification and fingerprint-keyed pool (worker/connection.py, Worker Subsystem)

Re-key the client channel pool on (target, credential fingerprint + identity, options) via _CredentialKey so unchanged material reuses a pooled channel and rotated material yields a fresh one. _channel_factory adds ssl_target_name_override only when an identity is configured; WorkerConnection holds a provider and resolves it per dispatch. Self-dispatch over the loopback UDS keeps its insecure key.

Diagnosable handshake failures (worker/connection.py + loadbalancer/, Worker Subsystem + Load Balancing)

Add HandshakeError(RpcError) with a typed Reason, classified structurally from the gRPC status code and error text (including the ssl_target_name_override hostname-verification mismatch → IDENTITY_MISMATCH and CA-verification failures → CERT_VERIFY). RoundRobinLoadBalancer skips a handshake-failing worker without evicting it (advancing to the next candidate and logging a structured warning that carries the classified reason), so a rotated worker recovers on a later dispatch. A drained dispatch raises the plain NoWorkersAvailable. HandshakeError.details is redacted to a fixed message rather than gRPC's verbose debug blob.

Provider wiring, client and server (worker/proxy.py, pool.py, local.py, process.py, base.py)

WorkerProxy, WorkerPool, LocalWorker, WorkerProcess, and the factory protocols accept either a WorkerCredentials or a provider. The proxy resolves its provider from the argument or the ambient context and forwards it to each connection; WorkerProcess serves rotating credentials via a per-connection fetcher for a reloadable provider and the unchanged static path otherwise; LocalWorker's stop RPC applies the identity override. The insecure self-dispatch UDS socket is confined to a per-worker 0700 directory under a short base ($XDG_RUNTIME_DIR or /tmp, within the AF_UNIX path limit).

Public API and docs (__init__.py, READMEs)

Export CredentialsProviderLike, CredentialsSnapshot, FileCredentialsProvider, and HandshakeError (the static wrapper stays internal as _StaticCredentialsProvider). Document identity verification, rotation, the diagnosable handshake errors, the self-dispatch socket, and the discovery-plane trust boundary in the top-level and worker READMEs.

Test cases

# Test Suite Given When Then Coverage Target
1 TestCredentialsSnapshot Credential material and an identity A snapshot is built Its fingerprint is stable for identical material and changes for any byte, identity, or mutual-flag difference Fingerprint determinism
2 TestStaticCredentialsProvider A static provider over fixed material resolve() is called repeatedly It returns the same snapshot, reports reloadable False, and survives a pickle roundtrip Back-compatible provider
3 TestFileCredentialsProvider A file provider over PEM files The files are unchanged, rotated, fail to re-read, or are replaced with malformed PEM It reuses, re-reads, or keeps the last-good snapshot respectively; an in-place same-size rewrite is still detected; concurrent resolves stay consistent Rotation source robustness
4 TestWorkerCredentials PEM paths with reload and identity (incl. blank) provider_from_files is called It returns a static or reloading provider; a blank identity normalizes to None Provider entry point
5 TestHandshakeError A status code, details, and reason A HandshakeError is constructed and pickled It is a non-transient RpcError exposing the reason, and survives a serialization roundtrip Error taxonomy + wire-safety
6 TestWorkerConnection A stub raising UNAUTHENTICATED or UNAVAILABLE with TLS evidence (incl. a verbatim hostname-verification failure) A task is dispatched It raises HandshakeError with the matching reason (IDENTITY_MISMATCH for the hostname case); a plain UNAVAILABLE stays transient; details are redacted Handshake classification + drift canary
7 TestWorkerConnection A provider with or without an identity A task is dispatched The secure channel carries ssl_target_name_override only when an identity is set Identity override
8 TestWorkerConnection A reloading provider, dispatched then rotated A second task is dispatched; a primed stream is iterated after rotation A new channel is built for the rotated fingerprint while an in-flight stream finishes on its original channel Fingerprint-keyed pool + in-flight survives rotation
9 TestWorkerConnection A secure connection self-dispatching over UDS A task is dispatched It routes over the insecure loopback and never builds a secure channel Loopback safety
10 TestWorkerProxy A provider supplied directly or via the credential context The proxy starts It admits only secure workers and forwards the provider to each connection Proxy provider wiring
11 TestRoundRobinLoadBalancer A pool whose workers all fail the handshake A task is dispatched It raises NoWorkersAvailable with the workers left in the pool (skipped, not evicted) Skip-without-evict on drain
12 TestRoundRobinLoadBalancer One handshake-failing worker and one healthy worker (and, separately, a surviving transient worker) A task is dispatched The failing worker is skipped (not evicted) and the dispatch succeeds on the healthy one; with a transient survivor the result is a plain NoWorkersAvailable Skip-without-evict + over-claim guard
13 TestRoundRobinLoadBalancer A worker that fails the handshake once then its connection recovers A second dispatch runs It dispatches successfully without re-discovery, and each rejection is logged Recoverability + rejection observability
14 TestWorkerProcess A reloading versus a static provider The worker server starts It builds dynamic server credentials for the former and the static path for the latter; the self-dispatch socket lives in a 0700 directory within the path limit Server-side rotation + UDS hardening
15 TestLocalWorker A worker whose provider carries an identity stop() is called The stop channel carries the identity override Stop-RPC identity
16 test_identity_mtls (integration) A real worker with an identity certificate at an ephemeral address A routine is dispatched with a matching identity; with an untrusted CA; with a mismatched identity; and after rotating the CA on disk The matching dispatch succeeds; the untrusted CA and the mismatched identity each drain to NoWorkersAvailable with the classified reason (CERT_VERIFY / IDENTITY_MISMATCH) logged; and a rotated CA is adopted without restarting the worker End-to-end identity, diagnosability, and rotation

Review remediation (post-review hardening)

A 20-agent application-security panel reviewed this PR (0 blocking; security core verified sound). The findings are remediated in the commits from fix(loadbalancer): recover handshake-failed workers … through test+docs(mtls): …:

  • Load balancer recoverability — handshake failures skip-without-evict (a rotated worker self-heals); a drained dispatch raises the plain NoWorkersAvailable, with each rejection logged.
  • Diagnostics — dropped the in-memory rejection ledger (unbounded, keyed by per-restart uid) in favour of per-rejection structured logs carrying the classified reason; redacted gRPC's debug blob from HandshakeError.details. The dedicated AllWorkersUnauthenticated signal was removed in favour of the plain NoWorkersAvailable (its name over-claimed, and it only covered the all-or-nothing case); a typed aggregate of the per-worker failures — likely an ExceptionGroup-based NoWorkersAvailable variant — is deferred to a follow-up issue.
  • Rotation robustness — validate-before-cache (a malformed PEM never overwrites good material), log on last-good fallback, detect in-place/same-size rewrites and inode swaps (st_ino + st_ctime_ns), and a threading.Lock around the off-loop resolve.
  • Contractsreloadable is now a protocol member; a blank identity normalizes to None.
  • UDS hardening — the insecure self-dispatch socket is confined to a per-worker 0700 directory under a short base directory (the canonical fix for the AF_UNIX path-length limit; man 7 unix), not a world-readable temp path.
  • Diagnosability end-to-end — an ssl_target_name_override mismatch now classifies as IDENTITY_MISMATCH (the real AioRpcError says "Hostname Verification Check failed", not the C-core log's "no match found for server name"); a drift canary pins the gRPC strings.
  • Docs — documented the discovery-plane trust boundary: WorkerMetadata (incl. the secure flag) is self-advertised over an unauthenticated, forgeable plane, so the security filter is a compatibility gate, not a trust boundary; trust rests on the mTLS handshake.

New e2e coverage: real rotation-without-restart, negative-identity rejection, wire-safety pickle round-trips, in-flight-survives-rotation, and the classifier drift canary. Full suite: 1368 passed, 98.10% coverage.

@conradbzura conradbzura self-assigned this Jun 17, 2026
@conradbzura conradbzura force-pushed the 249-identity-based-mtls branch 5 times, most recently from 2fa1a09 to b7afa60 Compare June 19, 2026 13:15
Extend WorkerCredentials with a logical identity and add
WorkerCredentialsProvider, a callable-backed provider that resolves
credential material per use. A reloadable provider re-reads its factory
on each resolve, so a long-running pool adopts rotated certificates
without a restart; a non-reloadable one caches at construction and ships
its snapshot across the worker-spawn pickle boundary.
WorkerCredentials.as_provider and the coerce classmethod give callers
one mechanism for the static and the rotating case alike.
Add HandshakeError, a non-transient RpcError marking a reachable worker
whose secure handshake failed, and a structural classifier:
UNAUTHENTICATED is always a handshake failure, while an ambiguous
UNAVAILABLE is promoted only on broad TLS evidence in the error text.
Key the channel pool on the resolved WorkerCredentials value and, when
it carries an identity, verify the worker certificate against that
logical SAN rather than the dialed address. Credentials resolve per
dispatch so rotated material is adopted on the next connection. Expose
HandshakeError and WorkerCredentialsProvider from the package root.
Accept a WorkerCredentials value or a WorkerCredentialsProvider on the
pool, proxy, process, and local-worker entry points, coercing a bare
value to a provider and serving rotating server credentials per
connection without a restart. The worker factory protocols widen their
credentials parameter to the same union.
A worker that fails the secure handshake is skipped without eviction and
logged with its gRPC code and details, so a credential misconfiguration
is diagnosable rather than collapsing into a bare "no workers available".
The NoWorkersAvailable docstring notes that a pool draining entirely on
handshake failures still raises the bare condition.
Cover the credential provider, identity verification, and credential
rotation in the package and worker READMEs, and add the end-to-end
integration test proving identity-based mTLS dispatch, the diagnosable
rejection signals, and credential rotation without a restart.
@conradbzura conradbzura force-pushed the 249-identity-based-mtls branch from 67a650b to c04e336 Compare June 30, 2026 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support identity-based mutual TLS

1 participant