[CRE] [1/5] Gateway handler for confidential relay#21638
[CRE] [1/5] Gateway handler for confidential relay#21638
Conversation
|
👋 nadahalli, thanks for creating this pull request! To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team. Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks! |
|
I see you updated files related to
|
|
✅ No conflicts with other open PRs targeting |
There was a problem hiding this comment.
Pull request overview
Risk Rating: MEDIUM
Adds a new Gateway handler type (confidential-compute-relay) to accept JSON-RPC requests from enclaves, fan them out to relay DON nodes, and aggregate node responses via an F+1 quorum.
Changes:
- Register new handler type/service name in deployment job generation and the gateway handler factory.
- Implement confidential relay gateway handler + F+1 quorum aggregator.
- Add unit tests for fan-out, quorum behavior, timeouts, duplicate IDs, and node rate limiting.
Reviewed changes
Copilot reviewed 7 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
go.mod |
Bumps deps needed for the new handler integration. |
go.sum |
Corresponding checksum updates. |
deployment/go.mod |
Aligns deployment module deps with root changes. |
deployment/go.sum |
Corresponding checksum updates for deployment module. |
deployment/cre/jobs/pkg/gateway_job.go |
Adds handler type + default config wiring for gateway job specs. |
core/services/gateway/handler_factory.go |
Registers the new handler type in the factory. |
core/services/gateway/handlers/confidentialrelay/handler.go |
New gateway handler: request tracking, fan-out, rate limiting, timeouts, metrics, callback response. |
core/services/gateway/handlers/confidentialrelay/aggregator.go |
Implements F+1 digest-based quorum selection and “quorum unobtainable” detection. |
core/services/gateway/handlers/confidentialrelay/handler_test.go |
Coverage for quorum success/divergence, timeouts, duplicate IDs, and rate limiting behavior. |
Scrupulous human review areas:
- Timeout behavior across layers (gateway request timeout vs handler
requestTimeoutSec) and its impact onactiveRequestsgrowth. - Quorum aggregation correctness (digest semantics, behavior with errors/invalid digests).
- Rate-limiting behavior (dropping node responses) and its effect on quorum attainability.
Reviewer recommendations (per CODEOWNERS):
@smartcontractkit/coreand@smartcontractkit/foundations(repo-wide owners;go.mod/go.sumalso owned by these teams).- For deployment changes under
/deployment/cre:@smartcontractkit/keystoneand@smartcontractkit/operations-platform.
| func newDefaultConfidentialRelayHandler() handler { | ||
| return handler{ | ||
| Name: GatewayHandlerTypeConfidentialRelay, | ||
| ServiceName: "confidential", | ||
| Config: confidentialRelayHandlerConfig{ | ||
| NodeRateLimiter: nodeRateLimiterConfig{ | ||
| GlobalBurst: 10, | ||
| GlobalRPS: 50, | ||
| PerSenderBurst: 10, | ||
| PerSenderRPS: 10, | ||
| }, |
There was a problem hiding this comment.
newDefaultConfidentialRelayHandler doesn’t set requestTimeoutSec, so the handler will fall back to its internal default (30s). If the gateway-level request timeout is configured lower (the job enforces a minimum of 5s), the gateway can time out first while the handler keeps the request in activeRequests until its own timeout/cleanup, which can cause unnecessary memory growth under load. Consider adding a RequestTimeoutSec field (similar to vaultHandlerConfig) and setting it to g.RequestTimeoutSec - 1 when constructing the default config so the handler always times out before the gateway.
Also, this handler hardcodes ServiceName: "confidential" instead of using ServiceNameConfidential, which risks future drift if the constant changes.
CORA - Analysis SkippedReason: The number of code owners (4) is less than the minimum required (5) and/or the number of CODEOWNERS entries with changed files (2) is less than the minimum required (2). |
985dd59 to
52a1f51
Compare
|
eb015a2 to
f3bd818
Compare
core/scripts/go.mod
Outdated
| github.com/smartcontractkit/chainlink-automation v0.8.1 | ||
| github.com/smartcontractkit/chainlink-ccip v0.1.1-solana.0.20260317185256-d5f7db87ae70 | ||
| github.com/smartcontractkit/chainlink-common v0.11.0 | ||
| github.com/smartcontractkit/chainlink-common v0.11.1-0.20260323163826-2c5b95089478 |
There was a problem hiding this comment.
Is this a canonical branch or a temporary branch?
There was a problem hiding this comment.
This is a temporary branch. I have to wait for the new canonical branch to be cut - not sure when. It must be v0.12.0.
There was a problem hiding this comment.
Apparently linking to commit hashes is ok.
There was a problem hiding this comment.
yeah no problem with commit hash, just making sure this hash is on the main branch in chainlink-common.
There was a problem hiding this comment.
Yes, this was merged as a part of chainlink-common#1903
| Name: GatewayHandlerTypeConfidentialRelay, | ||
| ServiceName: ServiceNameConfidential, | ||
| Config: confidentialRelayHandlerConfig{ | ||
| RequestTimeoutSec: requestTimeoutSec - 1, |
There was a problem hiding this comment.
Copied from the vault handler code. They even have a comment explaining why:
// must be lower than the overall gateway request timeout. // so we allow for the response to be sent back.
I'll add the same comment here.
There was a problem hiding this comment.
This seems like a problematic config generally, especially to be copied over along many services. Where did this come from? cc @bolekk
At the very least we could use a global buffer number instead of "1". Not particularly in scope for this PR.
There was a problem hiding this comment.
this is a very weird mathematics - it is very confusing to get this override !inside! the newDefaultConfidentialRelayHandler which only takes one parameter requestTimeoutSec. Could you please move this -1 update to the place where the function is called? (Something like https://github.com/smartcontractkit/chainlink/pull/20087/changes did.)
There was a problem hiding this comment.
Done. Moved the -1 to the call site with a comment explaining why. Also added a TODO to unify with the vault handler which does the same subtraction internally.
|
|
||
| h.mu.Lock() | ||
| defer h.mu.Unlock() | ||
| delete(h.activeRequests, userRequest.req.ID) |
There was a problem hiding this comment.
Are we cleaning up active requests that never get serviced? Else there is a minor memory leak.
There was a problem hiding this comment.
Yes. removeExpiredRequests removes them.
There was a problem hiding this comment.
sorry, could you please say which line in removeExpiredRequests deletes them?
There was a problem hiding this comment.
Line 226 in core/services/gateway/handlers/confidentialrelay/handler.go has a sendResponse call, which actually deletes them. I had added a comment above that call to explicitly call this out.
There was a problem hiding this comment.
ok, so if sendResponse errs at userRequest.SendResponse(resp), what happens to these requests? will they ever be cleaned?
There was a problem hiding this comment.
+1 this was my concern is that there is no default garbage collection.
There was a problem hiding this comment.
Done now. Renamed the method to sendResponseAndCleanup.
There was a problem hiding this comment.
Sorry, I am still not sure how the overall message flow works:
Currently sendResponseAndCleanup just sends the response and ignoring the send result deletes it from the active requests. This is applicable for both success and error responses. Is this acceptable behavior for the application? (For example, http_trigger_handler uses sendWithRetries to ensure that the send operation is retried in case of errors].)
There was a problem hiding this comment.
replied below in the main thread.
| remainingResponses := donMembersCount - len(resps) | ||
| if maxShaToCount+remainingResponses < requiredQuorum { | ||
| l.Warnw("quorum unattainable for request", "requiredQuorum", requiredQuorum, "remainingResponses", remainingResponses, "maxShaToCount", maxShaToCount) | ||
| return nil, errors.New(errQuorumUnobtainable.Error() + ". RequiredQuorum=" + strconv.Itoa(requiredQuorum) + ". maxShaToCount=" + strconv.Itoa(maxShaToCount) + " remainingResponses=" + strconv.Itoa(remainingResponses)) |
There was a problem hiding this comment.
is there a reason you avoid fmt.Errorf ?
There was a problem hiding this comment.
Fixed in an earlier commit. All errors.New with string concatenation replaced with fmt.Errorf throughout the handler and aggregator.
| func (a *aggregator) Aggregate(resps map[string]jsonrpc.Response[json.RawMessage], donF int, donMembersCount int, l logger.Logger) (*jsonrpc.Response[json.RawMessage], error) { | ||
| // F+1 is sufficient: each honest node independently validates the enclave's | ||
| // Nitro attestation, so F+1 matching responses guarantees at least one | ||
| // honest node vouched for the result. |
There was a problem hiding this comment.
Sorry, this comment does not really explain what is going on. It just says that out of F+1 replies at least 1 is honest - this is understandable, but the surrounding logic (that the honest parties cannot disagree is not explained).
For example, here https://github.com/smartcontractkit/libocr/blob/a03701e2c02e2331921bfa6887e2257dea4e6084/quorumhelper/quorumhelper.go#L8 we have multiple quorums that can be used. Is there a design doc where this step is explained?
There was a problem hiding this comment.
Good question, and thanks for the libocr quorumhelper link. This is QuorumFPlusOne in libocr terms.
F+1 is correct here because each relay DON node calls the target DON (Vault or capability) through CRE's standard capability dispatch, which includes DON-level consensus. Every honest relay node receives the same consensus-aggregated response, then performs deterministic translation (hex to base64, JSON marshalling). So honest relay nodes produce byte-identical gateway responses. F+1 matching guarantees at least one honest node vouched for the result.
The relay DON node handler is in #21639 ([2/5]) if you want to see the request handling. Updated the comment to explain this.
| ) gwhandlers.UserCallbackPayload { | ||
| switch errorCode { | ||
| case api.FatalError: | ||
| case api.NodeReponseEncodingError: |
There was a problem hiding this comment.
sorry, what is the purpose of this error parsing? Are you saying that the error itself does not contain enough information and need to be augmented?
There was a problem hiding this comment.
This mirrors the vault handler's existing pattern (vault/handler.go has the same switch). The gateway framework uses api.ErrorCode to control what goes back over JSON-RPC vs what stays in logs. For example, NodeReponseEncodingError logs the real error but sends back a generic string to the caller. It's the gateway's convention for sanitizing internal errors on the wire.
There was a problem hiding this comment.
Thanks, I see that vault handler also does that. But I don't see other handlers doing that. How come? (or am I missing something)
There was a problem hiding this comment.
The capabilities handler (capabilities/handler.go) inlines its error encoding at each call site rather than extracting a shared method. The vault and relay handlers have more error code variants and wanted consistent logging, so they centralized it. Same underlying logic, just different code organization.
There was a problem hiding this comment.
ok, given that you have both sendResponseAndCleanup and errorResponse can send response with errors, I would propose to keep only one function - sendResponseAndCleanup.
There was a problem hiding this comment.
I am very sorry, this is still not good. You have the function called sendSuccessResponseAndCleanup which can still send error if response does not encode successfully. Could you please make one function sendResponseAndCleanup that treats all the responses (also with errors) equally?
There was a problem hiding this comment.
Sorry, it was still too complicated - I have made the last commit in this branch to simplify sendResponseAndCleanup. If this is fine with you please ack it here, otherwise feel free to drop it and we can check again how sendResponseAndCleanup can be refactored.
There was a problem hiding this comment.
I am happy to accept the change. Thanks for taking the time to send the commit.
There was a problem hiding this comment.
Who can review the code now? We need someone from core, foundations, keystone, and/or operations-platform.
dc7c357 to
8ac0d36
Compare
…d constructErrorResponse
| return nil, fmt.Errorf("failed to unmarshal method config: %w", err) | ||
| } | ||
|
|
||
| if cfg.RequestTimeoutSec == 0 { |
There was a problem hiding this comment.
Why is this not managed by a CRE settings limit?
There was a problem hiding this comment.
RequestTimeoutSec is intentionally tied to the enclosing gateway request timeout (
). It should remain a job/gateway config rather than a CRE setting.The rate limits fit better as CRE settings, so I split that out here: smartcontractkit/chainlink-common#1950 ( please approve ;-))
| func (h *handler) fanOutToNodes(ctx context.Context, l logger.Logger, ar *activeRequest) error { | ||
| var nodeErrors []error | ||
| for _, node := range h.donConfig.Members { | ||
| err := h.don.SendToNode(ctx, node.Address, &ar.req) |
There was a problem hiding this comment.
This is not really a fan out of this is sequential. this should be async. Seems like a go err group is what you want here.
ChrisAmora
left a comment
There was a problem hiding this comment.
Approving @smartcontractkit/operations-platform path.
2b1f199
| func (d *barrierDON) forceRelease() { | ||
| d.releaseOnce.Do(func() { close(d.allStarted) }) | ||
| } |
There was a problem hiding this comment.
nit but this isn't necessary, the SendToNode accepts a context, which you can just cancel in the test to clean everything up.
| ch := d.allStarted | ||
| d.mu.Unlock() | ||
|
|
||
| <-ch |
There was a problem hiding this comment.
why not a select statement that waits on context to cancel or the channel to close?
|
| if nodeErrors == len(h.donConfig.Members) && nodeErrors > 0 { | ||
| return h.sendResponseAndCleanup(ctx, ar, h.constructErrorResponse(ar.req, api.FatalError, errors.New("failed to forward user request to nodes"))) |
There was a problem hiding this comment.
Can you explain the error strategy here, why do we only clean up if all nodes have errored? Should we not clean up if F+1 nodes error?
| nodeErrors int | ||
| nodeErrorsMu sync.Mutex |
There was a problem hiding this comment.
nit use sync/atomic.Uint32


Note on duplication with vault handler
This handler shares ~200 lines of structural code with the vault handler (activeRequest tracking, fanOutToNodes, sendResponse, errorResponse, metrics, HandleNodeMessage skeleton). Extracting a shared base type is a natural follow-up but out of scope here to avoid touching the vault handler.
Context
Part of #21635 (confidential workflow execution). [1/5] in the series.
Can be reviewed and merged independently.
What this does
Adds a new gateway handler type
confidential-compute-relaythat acceptsJSON-RPC requests from enclaves, fans them out to relay DON nodes, and
aggregates responses using F+1 quorum. Supports
secrets_getandcapability_execmethods.F+1 is sufficient because each relay DON node independently calls the
target DON (Vault or capability) through CRE's standard capability
dispatch, which includes DON-level consensus. Each honest relay node
receives the same consensus-aggregated response, then performs
deterministic translation (hex to base64 encoding, JSON marshalling)
before forwarding it back through the gateway. Since honest relay
nodes produce byte-identical responses, F+1 matching guarantees at
least one honest node vouched for the result.
See #21639 ([2/5]) for the relay DON node handler that processes these
fanned-out requests.
Dependencies
None. This PR is self-contained.