[CRE] [1/5] Gateway handler for confidential relay by nadahalli · Pull Request #21638 · smartcontractkit/chainlink

nadahalli · 2026-03-23T17:00:26Z

Note on duplication with vault handler

This handler shares ~200 lines of structural code with the vault handler (activeRequest tracking, fanOutToNodes, sendResponse, errorResponse, metrics, HandleNodeMessage skeleton). Extracting a shared base type is a natural follow-up but out of scope here to avoid touching the vault handler.

Context

Part of #21635 (confidential workflow execution). [1/5] in the series.
Can be reviewed and merged independently.

What this does

Adds a new gateway handler type confidential-compute-relay that accepts
JSON-RPC requests from enclaves, fans them out to relay DON nodes, and
aggregates responses using F+1 quorum. Supports secrets_get and
capability_exec methods.

F+1 is sufficient because each relay DON node independently calls the
target DON (Vault or capability) through CRE's standard capability
dispatch, which includes DON-level consensus. Each honest relay node
receives the same consensus-aggregated response, then performs
deterministic translation (hex to base64 encoding, JSON marshalling)
before forwarding it back through the gateway. Since honest relay
nodes produce byte-identical responses, F+1 matching guarantees at
least one honest node vouched for the result.

See #21639 ([2/5]) for the relay DON node handler that processes these
fanned-out requests.

Dependencies

None. This PR is self-contained.

github-actions · 2026-03-23T17:00:39Z

👋 nadahalli, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

github-actions · 2026-03-23T17:01:39Z

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

#added For any new functionality added.
#breaking_change For any functionality that requires manual action for the node to boot.
#bugfix For bug fixes.
#changed For any change to the existing functionality.
#db_update For any feature that introduces updates to database schema.
#deprecation_notice For any upcoming deprecation functionality.
#internal For changesets that need to be excluded from the final changelog.
#nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
#removed For any functionality/config that is removed.
#updated For any functionality that is updated.
#wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

github-actions · 2026-03-23T17:01:40Z

✅ No conflicts with other open PRs targeting develop

Copilot

Pull request overview

Risk Rating: MEDIUM

Adds a new Gateway handler type (confidential-compute-relay) to accept JSON-RPC requests from enclaves, fan them out to relay DON nodes, and aggregate node responses via an F+1 quorum.

Changes:

Register new handler type/service name in deployment job generation and the gateway handler factory.
Implement confidential relay gateway handler + F+1 quorum aggregator.
Add unit tests for fan-out, quorum behavior, timeouts, duplicate IDs, and node rate limiting.

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`go.mod`	Bumps deps needed for the new handler integration.
`go.sum`	Corresponding checksum updates.
`deployment/go.mod`	Aligns deployment module deps with root changes.
`deployment/go.sum`	Corresponding checksum updates for deployment module.
`deployment/cre/jobs/pkg/gateway_job.go`	Adds handler type + default config wiring for gateway job specs.
`core/services/gateway/handler_factory.go`	Registers the new handler type in the factory.
`core/services/gateway/handlers/confidentialrelay/handler.go`	New gateway handler: request tracking, fan-out, rate limiting, timeouts, metrics, callback response.
`core/services/gateway/handlers/confidentialrelay/aggregator.go`	Implements F+1 digest-based quorum selection and “quorum unobtainable” detection.
`core/services/gateway/handlers/confidentialrelay/handler_test.go`	Coverage for quorum success/divergence, timeouts, duplicate IDs, and rate limiting behavior.

Scrupulous human review areas:

Timeout behavior across layers (gateway request timeout vs handler requestTimeoutSec) and its impact on activeRequests growth.
Quorum aggregation correctness (digest semantics, behavior with errors/invalid digests).
Rate-limiting behavior (dropping node responses) and its effect on quorum attainability.

Reviewer recommendations (per CODEOWNERS):

@smartcontractkit/core and @smartcontractkit/foundations (repo-wide owners; go.mod/go.sum also owned by these teams).
For deployment changes under /deployment/cre: @smartcontractkit/keystone and @smartcontractkit/operations-platform.

Copilot · 2026-03-23T17:07:20Z

deployment/cre/jobs/pkg/gateway_job.go

+func newDefaultConfidentialRelayHandler() handler {
+	return handler{
+		Name:        GatewayHandlerTypeConfidentialRelay,
+		ServiceName: "confidential",
+		Config: confidentialRelayHandlerConfig{
+			NodeRateLimiter: nodeRateLimiterConfig{
+				GlobalBurst:    10,
+				GlobalRPS:      50,
+				PerSenderBurst: 10,
+				PerSenderRPS:   10,
+			},


newDefaultConfidentialRelayHandler doesn’t set requestTimeoutSec, so the handler will fall back to its internal default (30s). If the gateway-level request timeout is configured lower (the job enforces a minimum of 5s), the gateway can time out first while the handler keeps the request in activeRequests until its own timeout/cleanup, which can cause unnecessary memory growth under load. Consider adding a RequestTimeoutSec field (similar to vaultHandlerConfig) and setting it to g.RequestTimeoutSec - 1 when constructing the default config so the handler always times out before the gateway.

Also, this handler hardcodes ServiceName: "confidential" instead of using ServiceNameConfidential, which risks future drift if the constant changes.

github-actions · 2026-03-23T17:19:04Z

CORA - Analysis Skipped

Reason: The number of code owners (4) is less than the minimum required (5) and/or the number of CODEOWNERS entries with changed files (2) is less than the minimum required (2).

trunk-io · 2026-03-23T18:05:56Z

Failed Test	Failure Summary	Logs
`Test_CRE_V2_EVM_Read_StateQueries`	The test failed after waiting for a transaction to be mined, indicating a possible issue with transaction confirmation or network response.	Logs ↗︎
`Test_CRE_V2_EVM_Read_StateQueries/[v2]_EVM_Read_(state-queries)_-_workflow-gateway-capabilities`	The test failed without providing specific error details.	Logs ↗︎
`Test_CRE_V2_EVM_Read_StateQueries/[v2]_EVM_Read_(state-queries)_-_workflow-gateway-capabilities/Read_EVMReadBalance`	The test failed because the nonce was too low, causing authorization to fail during state query.	Logs ↗︎

_{View Full Report ↗︎ ⋅ Docs}

vreff · 2026-03-24T11:22:22Z

core/scripts/go.mod

 	github.com/smartcontractkit/chainlink-automation v0.8.1
 	github.com/smartcontractkit/chainlink-ccip v0.1.1-solana.0.20260317185256-d5f7db87ae70
-	github.com/smartcontractkit/chainlink-common v0.11.0
+	github.com/smartcontractkit/chainlink-common v0.11.1-0.20260323163826-2c5b95089478


Is this a canonical branch or a temporary branch?

This is a temporary branch. I have to wait for the new canonical branch to be cut - not sure when. It must be v0.12.0.

Apparently linking to commit hashes is ok.

yeah no problem with commit hash, just making sure this hash is on the main branch in chainlink-common.

Yes, this was merged as a part of chainlink-common#1903

vreff · 2026-03-24T11:26:32Z

deployment/cre/jobs/pkg/gateway_job.go

+		Name:        GatewayHandlerTypeConfidentialRelay,
+		ServiceName: ServiceNameConfidential,
+		Config: confidentialRelayHandlerConfig{
+			RequestTimeoutSec: requestTimeoutSec - 1,


Copied from the vault handler code. They even have a comment explaining why:

// must be lower than the overall gateway request timeout. // so we allow for the response to be sent back.

I'll add the same comment here.

This seems like a problematic config generally, especially to be copied over along many services. Where did this come from? cc @bolekk

At the very least we could use a global buffer number instead of "1". Not particularly in scope for this PR.

this is a very weird mathematics - it is very confusing to get this override !inside! the newDefaultConfidentialRelayHandler which only takes one parameter requestTimeoutSec. Could you please move this -1 update to the place where the function is called? (Something like https://github.com/smartcontractkit/chainlink/pull/20087/changes did.)

Done. Moved the -1 to the call site with a comment explaining why. Also added a TODO to unify with the vault handler which does the same subtraction internally.

deployment/cre/jobs/pkg/gateway_job.go

vreff · 2026-03-24T11:29:57Z

core/services/gateway/handlers/confidentialrelay/handler.go

+
+	h.mu.Lock()
+	defer h.mu.Unlock()
+	delete(h.activeRequests, userRequest.req.ID)


Are we cleaning up active requests that never get serviced? Else there is a minor memory leak.

Yes. removeExpiredRequests removes them.

sorry, could you please say which line in removeExpiredRequests deletes them?

Line 226 in core/services/gateway/handlers/confidentialrelay/handler.go has a sendResponse call, which actually deletes them. I had added a comment above that call to explicitly call this out.

ok, so if sendResponse errs at userRequest.SendResponse(resp), what happens to these requests? will they ever be cleaned?

+1 this was my concern is that there is no default garbage collection.

Done now. Renamed the method to sendResponseAndCleanup.

Sorry, I am still not sure how the overall message flow works:

Currently sendResponseAndCleanup just sends the response and ignoring the send result deletes it from the active requests. This is applicable for both success and error responses. Is this acceptable behavior for the application? (For example, http_trigger_handler uses sendWithRetries to ensure that the send operation is retried in case of errors].)

replied below in the main thread.

pavel-raykov · 2026-03-24T13:28:43Z

core/services/gateway/handlers/confidentialrelay/aggregator.go

+	remainingResponses := donMembersCount - len(resps)
+	if maxShaToCount+remainingResponses < requiredQuorum {
+		l.Warnw("quorum unattainable for request", "requiredQuorum", requiredQuorum, "remainingResponses", remainingResponses, "maxShaToCount", maxShaToCount)
+		return nil, errors.New(errQuorumUnobtainable.Error() + ". RequiredQuorum=" + strconv.Itoa(requiredQuorum) + ". maxShaToCount=" + strconv.Itoa(maxShaToCount) + " remainingResponses=" + strconv.Itoa(remainingResponses))


is there a reason you avoid fmt.Errorf ?

Fixed in an earlier commit. All errors.New with string concatenation replaced with fmt.Errorf throughout the handler and aggregator.

pavel-raykov · 2026-03-24T13:28:56Z

core/services/gateway/handlers/confidentialrelay/aggregator.go

+func (a *aggregator) Aggregate(resps map[string]jsonrpc.Response[json.RawMessage], donF int, donMembersCount int, l logger.Logger) (*jsonrpc.Response[json.RawMessage], error) {
+	// F+1 is sufficient: each honest node independently validates the enclave's
+	// Nitro attestation, so F+1 matching responses guarantees at least one
+	// honest node vouched for the result.


Sorry, this comment does not really explain what is going on. It just says that out of F+1 replies at least 1 is honest - this is understandable, but the surrounding logic (that the honest parties cannot disagree is not explained).

For example, here https://github.com/smartcontractkit/libocr/blob/a03701e2c02e2331921bfa6887e2257dea4e6084/quorumhelper/quorumhelper.go#L8 we have multiple quorums that can be used. Is there a design doc where this step is explained?

Good question, and thanks for the libocr quorumhelper link. This is QuorumFPlusOne in libocr terms.

F+1 is correct here because each relay DON node calls the target DON (Vault or capability) through CRE's standard capability dispatch, which includes DON-level consensus. Every honest relay node receives the same consensus-aggregated response, then performs deterministic translation (hex to base64, JSON marshalling). So honest relay nodes produce byte-identical gateway responses. F+1 matching guarantees at least one honest node vouched for the result.

The relay DON node handler is in #21639 ([2/5]) if you want to see the request handling. Updated the comment to explain this.

pavel-raykov · 2026-03-24T13:35:52Z

core/services/gateway/handlers/confidentialrelay/handler.go

+) gwhandlers.UserCallbackPayload {
+	switch errorCode {
+	case api.FatalError:
+	case api.NodeReponseEncodingError:


sorry, what is the purpose of this error parsing? Are you saying that the error itself does not contain enough information and need to be augmented?

This mirrors the vault handler's existing pattern (vault/handler.go has the same switch). The gateway framework uses api.ErrorCode to control what goes back over JSON-RPC vs what stays in logs. For example, NodeReponseEncodingError logs the real error but sends back a generic string to the caller. It's the gateway's convention for sanitizing internal errors on the wire.

Thanks, I see that vault handler also does that. But I don't see other handlers doing that. How come? (or am I missing something)

The capabilities handler (capabilities/handler.go) inlines its error encoding at each call site rather than extracting a shared method. The vault and relay handlers have more error code variants and wanted consistent logging, so they centralized it. Same underlying logic, just different code organization.

ok, given that you have both sendResponseAndCleanup and errorResponse can send response with errors, I would propose to keep only one function - sendResponseAndCleanup.

I am very sorry, this is still not good. You have the function called sendSuccessResponseAndCleanup which can still send error if response does not encode successfully. Could you please make one function sendResponseAndCleanup that treats all the responses (also with errors) equally?

Sorry, it was still too complicated - I have made the last commit in this branch to simplify sendResponseAndCleanup. If this is fine with you please ack it here, otherwise feel free to drop it and we can check again how sendResponseAndCleanup can be refactored.

I am happy to accept the change. Thanks for taking the time to send the commit.

Who can review the code now? We need someone from core, foundations, keystone, and/or operations-platform.

core/services/gateway/handlers/confidentialrelay/handler.go

…seAndCleanup

…d constructErrorResponse

MStreet3 · 2026-03-31T16:52:35Z

core/services/gateway/handlers/confidentialrelay/handler.go

+		return nil, fmt.Errorf("failed to unmarshal method config: %w", err)
+	}
+
+	if cfg.RequestTimeoutSec == 0 {


Why is this not managed by a CRE settings limit?

RequestTimeoutSec is intentionally tied to the enclosing gateway request timeout (

chainlink/deployment/cre/jobs/pkg/gateway_job.go

Line 233 in b19313f

case GatewayHandlerTypeConfidentialRelay:

). It should remain a job/gateway config rather than a CRE setting.

The rate limits fit better as CRE settings, so I split that out here: smartcontractkit/chainlink-common#1950 ( please approve ;-))

MStreet3 · 2026-03-31T16:56:38Z

core/services/gateway/handlers/confidentialrelay/handler.go

+func (h *handler) fanOutToNodes(ctx context.Context, l logger.Logger, ar *activeRequest) error {
+	var nodeErrors []error
+	for _, node := range h.donConfig.Members {
+		err := h.don.SendToNode(ctx, node.Address, &ar.req)


This is not really a fan out of this is sequential. this should be async. Seems like a go err group is what you want here.

core/services/gateway/handlers/confidentialrelay/handler.go

ChrisAmora

Approving @smartcontractkit/operations-platform path.

MStreet3 · 2026-04-01T12:46:12Z

core/services/gateway/handlers/confidentialrelay/handler_test.go

+func (d *barrierDON) forceRelease() {
+	d.releaseOnce.Do(func() { close(d.allStarted) })
+}


nit but this isn't necessary, the SendToNode accepts a context, which you can just cancel in the test to clean everything up.

MStreet3 · 2026-04-01T12:46:56Z

core/services/gateway/handlers/confidentialrelay/handler_test.go

+	ch := d.allStarted
+	d.mu.Unlock()
+
+	<-ch


why not a select statement that waits on context to cancel or the channel to close?

cl-sonarqube-production · 2026-04-01T16:36:05Z

Quality Gate failed

Failed conditions
11.5% Duplication on New Code (required ≤ 10%)

See analysis details on SonarQube

MStreet3 · 2026-04-02T09:28:47Z

core/services/gateway/handlers/confidentialrelay/handler.go

+	if nodeErrors == len(h.donConfig.Members) && nodeErrors > 0 {
 		return h.sendResponseAndCleanup(ctx, ar, h.constructErrorResponse(ar.req, api.FatalError, errors.New("failed to forward user request to nodes")))


Can you explain the error strategy here, why do we only clean up if all nodes have errored? Should we not clean up if F+1 nodes error?

MStreet3 · 2026-04-02T09:36:29Z

core/services/gateway/handlers/confidentialrelay/handler.go

+		nodeErrors   int
+		nodeErrorsMu sync.Mutex


nit use sync/atomic.Uint32

Copilot AI review requested due to automatic review settings March 23, 2026 17:00

nadahalli requested review from a team as code owners March 23, 2026 17:00

product-security-plaid-production bot requested a review from ChrisAmora March 23, 2026 17:00

product-security-plaid-production bot requested review from MStreet3 and pavel-raykov March 23, 2026 17:00

Copilot started reviewing on behalf of nadahalli March 23, 2026 17:01 View session

This was referenced Mar 23, 2026

[CRE] [5/5] Wire confidential workflow execution into CRE #21642

Open

[CRE] Confidential workflow execution #21635

Open

Copilot AI reviewed Mar 23, 2026

View reviewed changes

nadahalli requested a review from a team as a code owner March 23, 2026 17:18

nadahalli force-pushed the tejaswi/cw-1-gateway-handler branch from 985dd59 to 52a1f51 Compare March 23, 2026 17:25

nadahalli force-pushed the tejaswi/cw-1-gateway-handler branch from eb015a2 to f3bd818 Compare March 23, 2026 19:08

vreff reviewed Mar 24, 2026

View reviewed changes

deployment/cre/jobs/pkg/gateway_job.go Show resolved Hide resolved

vreff previously approved these changes Mar 24, 2026

View reviewed changes

vreff reviewed Mar 24, 2026

View reviewed changes

pavel-raykov reviewed Mar 24, 2026

View reviewed changes

core/services/gateway/handlers/confidentialrelay/handler.go Outdated Show resolved Hide resolved

nadahalli dismissed vreff’s stale review via 0a281b8 March 24, 2026 15:20

nadahalli force-pushed the tejaswi/cw-1-gateway-handler branch from dc7c357 to 8ac0d36 Compare March 26, 2026 15:01

pavel-raykov reviewed Mar 26, 2026

View reviewed changes

core/services/gateway/handlers/confidentialrelay/handler.go Show resolved Hide resolved

Handle errQuorumUnobtainable explicitly in aggregation switch

9a879e3

vreff previously approved these changes Mar 30, 2026

View reviewed changes

Merge errorResponse into sendErrorResponseAndCleanup

f8e3779

nadahalli dismissed vreff’s stale review via f8e3779 March 30, 2026 16:00

nadahalli and others added 6 commits March 30, 2026 18:47

Move error sanitization into sendResponseAndCleanup

9517e0c

Inline send+cleanup into sendResponseAndCleanup and sendSuccessRespon…

364acea

…seAndCleanup

Unify sendResponseAndCleanup to handle both success and error paths

50d4196

Simplify sendResponseAndCleanup.

326f97e

Fix exhaustive lint: restore missing switch cases in recordMetrics an…

f85ccc8

…d constructErrorResponse

Suppress exhaustive switch warning.

b19313f

pavel-raykov previously approved these changes Mar 31, 2026

View reviewed changes

vreff previously approved these changes Mar 31, 2026

View reviewed changes

MStreet3 reviewed Mar 31, 2026

View reviewed changes

ChrisAmora previously approved these changes Mar 31, 2026

View reviewed changes

justinkaseman previously approved these changes Mar 31, 2026

View reviewed changes

MStreet3 requested changes Apr 1, 2026

View reviewed changes

fan out relay requests to don nodes concurrently

2b1f199

nadahalli dismissed stale reviews from justinkaseman, ChrisAmora, vreff, and pavel-raykov via 2b1f199 April 1, 2026 11:45

MStreet3 reviewed Apr 1, 2026

View reviewed changes

nadahalli added 2 commits April 1, 2026 15:05

Clean up confidential relay concurrency test helper

376f3d9

Remove redundant loop variable copy in relay fanout

d093aef

MStreet3 reviewed Apr 2, 2026

View reviewed changes

		if nodeErrors == len(h.donConfig.Members) && nodeErrors > 0 {
		return h.sendResponseAndCleanup(ctx, ar, h.constructErrorResponse(ar.req, api.FatalError, errors.New("failed to forward user request to nodes")))

Conversation

nadahalli commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note on duplication with vault handler

Context

What this does

Dependencies

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CORA - Analysis Skipped

Uh oh!

trunk-io bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vreff Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nadahalli Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

nadahalli commented Mar 23, 2026 •

edited

Loading

github-actions bot commented Mar 23, 2026 •

edited

Loading

github-actions bot commented Mar 23, 2026 •

edited

Loading

trunk-io bot commented Mar 23, 2026 •

edited

Loading

vreff Mar 24, 2026 •

edited

Loading

nadahalli Mar 25, 2026 •

edited

Loading