feat(networking): native mTLS with subject-name authorization for fabric inter-node communication#4681
feat(networking): native mTLS with subject-name authorization for fabric inter-node communication#4681rushabhvaria wants to merge 15 commits into
Conversation
…nication Add optional TLS/mTLS configuration for Restate's fabric port (5122). This enables securing inter-node communication at the application layer without relying on Kubernetes NetworkPolicy or external service meshes. Configuration lives under [networking.tls] with support for: - Strict mode (TLS only) and optional mode (accepts both plaintext and TLS) - Mutual TLS with configurable client certificate requirements - Periodic certificate hot-reload from disk (default: 1h) - Client config inheritance from server config when not specified separately - Scheme-based signaling (https:// in advertised-address) Key changes: - Add FabricTlsOptions, FabricTlsClientOptions, TlsMode config structs - Add TlsCertResolver with ArcSwap-based lock-free cert rotation - Modify run_hyper_server to support TLS accept and protocol sniffing - Modify GrpcConnector to use ClientTlsConfig for https:// peers - Extend PeerNetAddress with is_tls() and derive_from_bind_address_with_tls() - Add tokio-rustls, rustls-pemfile workspace dependencies Without [networking.tls] configuration, behavior is identical to today.
- Config parsing tests: TOML deserialization, defaults, mode parsing, client inheritance fallback, client override - TLS resolver tests: cert loading from PEM, missing file errors, empty cert file errors, invalid key handling, mismatched cert/key rejection - Address tests: is_tls() for https/http/UDS, derive_from_bind_address_with_tls() Also restores inline comments in derive_from_bind_address_with_tls that were inadvertently dropped during refactoring.
feat(networking): native mTLS for fabric inter-node communication
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
Add cluster-level integration tests that verify multi-node Restate clusters form correctly with TLS-secured fabric communication. Tests: - fabric_tls_strict_cluster: 3-node cluster with strict mTLS, verifies all nodes connect and cluster becomes healthy - fabric_tls_optional_mode: 3-node cluster with optional TLS mode, verifies nodes form cluster accepting both TLS and plaintext Uses rcgen to generate test CA + per-node certificates at runtime. Nodes use random TCP ports (not UDS) since TLS applies to TCP only.
test(networking): add integration tests for fabric mTLS
mTLS authenticates the peer but doesn't authorize them. In environments where a shared CA issues certs to many services (e.g., SPIFFE), any service could connect to the fabric port. This adds an optional `allowed-sans` config that checks the peer certificate's Subject Alternative Names (DNS names and URIs) against glob patterns after the TLS handshake succeeds. Config example: [networking.tls] allowed-sans = ["spiffe://svc.pin220.com/restate-agents/*"] Implementation: - SanCheckingVerifier wraps WebPkiClientVerifier, adding SAN check after chain validation passes - Uses x509-parser to extract SANs from DER certificates - Supports * glob wildcards for flexible pattern matching - When allowed-sans is empty (default), behavior is unchanged Tests: - glob_match: exact, trailing wildcard, middle wildcard, prefix, multi - Config parsing with allowed-sans field
…d add CN matching Rename `allowed-sans` to `allowed-subject-names` to better reflect that both the Subject Common Name (CN) and Subject Alternative Names (DNS/URI) are checked against the allowed patterns. The verifier now checks CN first, then SANs. This handles certs that use CN alone (without SANs) and provides a more complete authorization model. Tests added: - test_subject_verifier_accepts_matching_cn: CN-only cert accepted - test_subject_verifier_cn_fallback_when_no_san: CN match when no SANs present - test_subject_verifier_rejects_no_match_anywhere: neither CN nor SANs match
feat(networking): add SAN-based authorization for fabric mTLS
… is enabled Prevent accidental fail-open: when require-client-auth is true, allowed-subject-names must be explicitly set. Operators who want CA-only trust (no identity checking) set allowed-subject-names = ["*"] to make the choice explicit. An empty list with client auth enabled is now a configuration error that prevents node startup. This addresses feedback that the previous default (empty = allow all) could lead to unintended access when using a shared CA. Changes: - Add FabricTlsOptions::validate() with startup-time check - Call validate() during node initialization before TLS setup - Treat ["*"] as explicit CA-only trust (skip SubjectNameVerifier) - Update integration tests to use allowed-subject-names = ["*"] - 4 new validation unit tests Config that now fails: [networking.tls] require-client-auth = true # missing allowed-subject-names → startup error Config that works: [networking.tls] require-client-auth = true allowed-subject-names = ["*"] # explicit CA-only trust # OR allowed-subject-names = ["spiffe://dom/*"] # identity-based authz
…subject-names feat(networking): require allowed-subject-names when mTLS client auth…
nickpan47
left a comment
There was a problem hiding this comment.
Overall lgtm. Minor comment on duplicated code section.
|
Thanks a lot for adding mTLS support to Restate @rushabhvaria. It looks like a great contribution. Right now the team is a little bit busy with finalizing the 1.7 release and that's why we probably need a bit of time to give your PR the deserved attention. So please bear with us. |
Extract serve_connection() helper to eliminate repeated connection error-handling blocks across TLS, plaintext, and UDS code paths. Also simplify the TLS/plaintext branching by resolving the TLS acceptor first, then handling the connection in two clean branches instead of five duplicated blocks. Addresses review feedback from nickpan47 on PR restatedev#4681.
Extract serve_connection() helper to eliminate repeated connection error-handling blocks across TLS, plaintext, and UDS code paths. Also simplify the TLS/plaintext branching by resolving the TLS acceptor first, then handling the connection in two clean branches instead of five duplicated blocks. Addresses review feedback from nickpan47 on PR restatedev#4681.
…dler Fix/mtls dedup connection handler
|
@AhmedSoliman can you approve the workflow to have claude code review the pull request? also could you help provide an ETA for potentially merging this? Thank you |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0b449679a4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Fix three issues identified in PR review: 1. Auto-advertise https:// when TLS is configured (P1) When [networking.tls] is set but no explicit advertised-address is configured, the auto-guessed address now uses https:// scheme. Previously it always used http://, causing peers to attempt plaintext connections that fail in strict mode. 2. Add TLS handshake timeout to prevent accept-loop blocking (P1) The TLS handshake was awaited inline in the listener loop, allowing a stalled client to block all new connections. Now bounded by a 5s timeout — stalled handshakes are dropped without blocking peers. 3. Strip IPv6 brackets before constructing ServerName (P2) uri.host() returns bracketed IPv6 (e.g., "[::1]") which is invalid for rustls ServerName. Now strips brackets before TLS connection.
|
@rushabhvaria sorry for the slow response time. We are currently in the process of creating the v1.7 release which requires most of our attention. At the latest, we are going to review this PR once the release is created. |
Closes #3306
Related: #3583
Summary
allowed-subject-names): after mTLS authentication, verify the peer's Subject CN and SANs match allowed patterns — prevents unauthorized services from connecting when using a shared CAallowed-subject-namesis required whenrequire-client-authis true — prevents accidental fail-open. Use["*"]to explicitly opt into CA-only trustMotivation
Restate's security docs state: "You are expected to secure access to [the fabric port] using the network and proxy layers available in your deployment environment." The recommended approach is Kubernetes NetworkPolicy — but many production environments don't support it (shared clusters, certain CNI plugins, platform constraints). Most distributed systems (etcd, CockroachDB, Consul) offer built-in inter-node TLS — this brings Restate to parity, especially for enterprise environments.
The authorization layer addresses feedback that mTLS alone is insufficient when using a shared CA (e.g., SPIFFE). Without identity checking, any service holding a cert from the same CA could connect to the fabric port.
Configuration
Without
[networking.tls], behavior is identical to today (plaintext).Authorization behavior
require-client-authallowed-subject-namestruetrue["*"]true["spiffe://domain/*"]falseDesign
Encryption and Authentication (mTLS):
tokio-rustls::TlsAcceptorwrapsTcpStreambefore hyper0x16(TLS ClientHello) routes to TLS, else plaintexttower::service_fnconnector usingtokio_rustls::TlsConnector, reads latest certs fromArcSwapper-connectionArcSwap(lock-free)https://— peers use the scheme to decide connection typeAuthorization (subject-name verification):
SubjectNameVerifierwrapsWebPkiClientVerifier— delegates chain validation, then checks identityx509-parserfor DER certificate parsing["*"]explicitly skips identity checking (CA-only trust, noSubjectNameVerifieroverhead)allowed-subject-nameswhen client auth is enabledRolling upgrade path:
mode = "optional"and TLS certs — nodes advertisehttps://, accept bothmode = "strict"— plaintext rejectedNote on restatectl compatibility (related to #3583):
Port 5122 currently serves both internal (
CoreNodeSvc) and external (ClusterCtrlSvc,NodeCtlSvc) gRPC services. Inoptionalmode,restatectlconnects via plaintext while inter-node traffic uses TLS. Once #3583 splits these into separate ports,strictmode can be applied to the internal port without affectingrestatectl.Changes
crates/types/src/config/networking.rsFabricTlsOptions,TlsMode,allowed-subject-names,validate()+ 11 unit testscrates/types/src/net/address.rsPeerNetAddress::is_tls(),derive_from_bind_address_with_tls()+ 2 testscrates/core/src/network/tls.rsTlsCertResolver,SubjectNameVerifier, cert loading, hot-reload,glob_match+ 19 unit testscrates/core/src/network/net_util.rscrates/core/src/network/grpc/connector.rshttps://peerscrates/core/src/network/server_builder.rsTlsCertResolverto listenercrates/core/src/network/networking.rscrates/node/src/lib.rscrates/admin/src/service.rsNonefor admin port (no TLS on admin)server/tests/fabric_tls.rsVerification
cargo check— all modified crates compilecargo clippy -D warnings— zero warningscargo fmt --check— cleanTest plan
Config and validation (11 tests):
allowed-subject-names["*"]= OK, no client auth = skip, specific patterns = OKTLS core (19 tests):
is_tls()detection: https, http, bare host, UDSderive_from_bind_address_with_tls(): http:// vs https:// schemercgen):Integration (2 tests):
fabric_tls_strict_cluster: 3-node cluster with strict mTLSfabric_tls_optional_mode: 3-node cluster with optional TLS