Support identity-based mutual TLS — Closes #249#250
Draft
conradbzura wants to merge 5 commits into
Draft
Conversation
2fa1a09 to
b7afa60
Compare
Extend WorkerCredentials with a logical identity and add WorkerCredentialsProvider, a callable-backed provider that resolves credential material per use. A reloadable provider re-reads its factory on each resolve, so a long-running pool adopts rotated certificates without a restart; a non-reloadable one caches at construction and ships its snapshot across the worker-spawn pickle boundary. WorkerCredentials.as_provider and the coerce classmethod give callers one mechanism for the static and the rotating case alike.
Add HandshakeError, a non-transient RpcError marking a reachable worker whose secure handshake failed, and a structural classifier: UNAUTHENTICATED is always a handshake failure, while an ambiguous UNAVAILABLE is promoted only on broad TLS evidence in the error text. Key the channel pool on the resolved WorkerCredentials value and, when it carries an identity, verify the worker certificate against that logical SAN rather than the dialed address. Credentials resolve per dispatch so rotated material is adopted on the next connection. Expose HandshakeError and WorkerCredentialsProvider from the package root.
Accept a WorkerCredentials value or a WorkerCredentialsProvider on the pool, proxy, process, and local-worker entry points, coercing a bare value to a provider and serving rotating server credentials per connection without a restart. The worker factory protocols widen their credentials parameter to the same union.
A worker that fails the secure handshake is skipped without eviction and logged with its gRPC code and details, so a credential misconfiguration is diagnosable rather than collapsing into a bare "no workers available". The NoWorkersAvailable docstring notes that a pool draining entirely on handshake failures still raises the bare condition.
Cover the credential provider, identity verification, and credential rotation in the package and worker READMEs, and add the end-to-end integration test proving identity-based mTLS dispatch, the diagnosable rejection signals, and credential rotation without a restart.
67a650b to
c04e336
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Close the three gaps that block Wool's mutual TLS on dynamic-address orchestrators (Kubernetes, ECS/Fargate), while keeping the existing certificate-authority trust model and
WorkerCredentialsexactly as they are. The change is additive and opt-in: a bareWorkerCredentialsis coerced into an identity-free static provider everywhere, so static-address mTLS, one-way TLS, and plaintext deployments behave byte-for-byte as before, and no protobuf change is required.The approach introduces a
CredentialsProviderLikeseam — the unit the runtime consults for current credential material — and threads it through the Worker Subsystem (client and server) and the Load Balancing layer:ssl_target_name_override, so a worker reached at a dynamically assigned address is verified against a stable logical identity (its certificate SAN) rather than the dialed address. Full chain and SAN verification are preserved — verification is strengthened, not relaxed.grpc.dynamic_ssl_server_credentials.HandshakeError(with areasonenum). A handshake-failing worker is skipped without eviction — so a worker mid-rotation self-heals on a later dispatch — and each rejection is emitted as a structured log carrying the classified reason; a drained dispatch raises the plainNoWorkersAvailable. A typed aggregate of the per-worker failures is deferred to a follow-up (see Review remediation).One trade-off worth noting: rotation replaces cert/key/CA bytes but not the mutual-TLS mode, which
grpc.dynamic_ssl_server_credentialsfixes at construction. A second: the handshake classifier keys control flow on the gRPC status code plus the presence of any TLS/cert token in the error text — a deliberately broad gate that degrades a tokenless TLS failure to plain transient-skip rather than risking a false eviction. A handshake failure is skipped without eviction, so a worker mid-rotation self-heals on a later dispatch rather than being removed until re-discovery.Closes #249
Proposed changes
Credential provider abstraction (
worker/auth.py, Worker Subsystem)Add
CredentialsProviderLike(aresolve() -> CredentialsSnapshotprotocol with areloadableflag) and the fingerprintedCredentialsSnapshot, with one public implementation —FileCredentialsProvider(re-reads PEM files on change, validates before caching, returns the last-good snapshot on a transient or malformed read, and is lock-guarded for the off-loop server fetcher) — plus an internal_StaticCredentialsProviderthat a bareWorkerCredentialscoerces into.WorkerCredentials.provider_from_files(..., identity=, reload=)is the ergonomic entry point;_coerce_provideris the single seam that normalizes a bareWorkerCredentialsinto a provider. A blank identity normalizes toNone(address-based path).CredentialsContextis widened to carry either form. The public surface is deliberately small: pass aCredentialsProviderLike(typically aFileCredentialsProvider, or your own custom provider), or aWorkerCredentialsfor the trivial static case.Identity verification and fingerprint-keyed pool (
worker/connection.py, Worker Subsystem)Re-key the client channel pool on
(target, credential fingerprint + identity, options)via_CredentialKeyso unchanged material reuses a pooled channel and rotated material yields a fresh one._channel_factoryaddsssl_target_name_overrideonly when an identity is configured;WorkerConnectionholds a provider and resolves it per dispatch. Self-dispatch over the loopback UDS keeps its insecure key.Diagnosable handshake failures (
worker/connection.py+loadbalancer/, Worker Subsystem + Load Balancing)Add
HandshakeError(RpcError)with a typedReason, classified structurally from the gRPC status code and error text (including thessl_target_name_overridehostname-verification mismatch →IDENTITY_MISMATCHand CA-verification failures →CERT_VERIFY).RoundRobinLoadBalancerskips a handshake-failing worker without evicting it (advancing to the next candidate and logging a structured warning that carries the classified reason), so a rotated worker recovers on a later dispatch. A drained dispatch raises the plainNoWorkersAvailable.HandshakeError.detailsis redacted to a fixed message rather than gRPC's verbose debug blob.Provider wiring, client and server (
worker/proxy.py,pool.py,local.py,process.py,base.py)WorkerProxy,WorkerPool,LocalWorker,WorkerProcess, and the factory protocols accept either aWorkerCredentialsor a provider. The proxy resolves its provider from the argument or the ambient context and forwards it to each connection;WorkerProcessserves rotating credentials via a per-connection fetcher for a reloadable provider and the unchanged static path otherwise;LocalWorker's stop RPC applies the identity override. The insecure self-dispatch UDS socket is confined to a per-worker0700directory under a short base ($XDG_RUNTIME_DIRor/tmp, within theAF_UNIXpath limit).Public API and docs (
__init__.py, READMEs)Export
CredentialsProviderLike,CredentialsSnapshot,FileCredentialsProvider, andHandshakeError(the static wrapper stays internal as_StaticCredentialsProvider). Document identity verification, rotation, the diagnosable handshake errors, the self-dispatch socket, and the discovery-plane trust boundary in the top-level and worker READMEs.Test cases
TestCredentialsSnapshotTestStaticCredentialsProviderresolve()is called repeatedlyreloadableFalse, and survives a pickle roundtripTestFileCredentialsProviderTestWorkerCredentialsreloadandidentity(incl. blank)provider_from_filesis calledNoneTestHandshakeErrorHandshakeErroris constructed and pickledRpcErrorexposing the reason, and survives a serialization roundtripTestWorkerConnectionUNAUTHENTICATEDorUNAVAILABLEwith TLS evidence (incl. a verbatim hostname-verification failure)HandshakeErrorwith the matching reason (IDENTITY_MISMATCHfor the hostname case); a plainUNAVAILABLEstays transient; details are redactedTestWorkerConnectionssl_target_name_overrideonly when an identity is setTestWorkerConnectionTestWorkerConnectionTestWorkerProxyTestRoundRobinLoadBalancerNoWorkersAvailablewith the workers left in the pool (skipped, not evicted)TestRoundRobinLoadBalancerNoWorkersAvailableTestRoundRobinLoadBalancerTestWorkerProcess0700directory within the path limitTestLocalWorkerstop()is calledtest_identity_mtls(integration)NoWorkersAvailablewith the classified reason (CERT_VERIFY/IDENTITY_MISMATCH) logged; and a rotated CA is adopted without restarting the workerReview remediation (post-review hardening)
A 20-agent application-security panel reviewed this PR (0 blocking; security core verified sound). The findings are remediated in the commits from
fix(loadbalancer): recover handshake-failed workers …throughtest+docs(mtls): …:NoWorkersAvailable, with each rejection logged.HandshakeError.details. The dedicatedAllWorkersUnauthenticatedsignal was removed in favour of the plainNoWorkersAvailable(its name over-claimed, and it only covered the all-or-nothing case); a typed aggregate of the per-worker failures — likely anExceptionGroup-basedNoWorkersAvailablevariant — is deferred to a follow-up issue.st_ino+st_ctime_ns), and athreading.Lockaround the off-loop resolve.reloadableis now a protocol member; a blank identity normalizes toNone.0700directory under a short base directory (the canonical fix for theAF_UNIXpath-length limit;man 7 unix), not a world-readable temp path.ssl_target_name_overridemismatch now classifies asIDENTITY_MISMATCH(the realAioRpcErrorsays "Hostname Verification Check failed", not the C-core log's "no match found for server name"); a drift canary pins the gRPC strings.WorkerMetadata(incl. thesecureflag) is self-advertised over an unauthenticated, forgeable plane, so the security filter is a compatibility gate, not a trust boundary; trust rests on the mTLS handshake.New e2e coverage: real rotation-without-restart, negative-identity rejection, wire-safety pickle round-trips, in-flight-survives-rotation, and the classifier drift canary. Full suite: 1368 passed, 98.10% coverage.