Skip to content

[CRE] [1/5] Gateway handler for confidential relay#21638

Open
nadahalli wants to merge 21 commits intodevelopfrom
tejaswi/cw-1-gateway-handler
Open

[CRE] [1/5] Gateway handler for confidential relay#21638
nadahalli wants to merge 21 commits intodevelopfrom
tejaswi/cw-1-gateway-handler

Conversation

@nadahalli
Copy link
Copy Markdown
Contributor

@nadahalli nadahalli commented Mar 23, 2026

Note on duplication with vault handler

This handler shares ~200 lines of structural code with the vault handler (activeRequest tracking, fanOutToNodes, sendResponse, errorResponse, metrics, HandleNodeMessage skeleton). Extracting a shared base type is a natural follow-up but out of scope here to avoid touching the vault handler.

Context

Part of #21635 (confidential workflow execution). [1/5] in the series.
Can be reviewed and merged independently.

What this does

Adds a new gateway handler type confidential-compute-relay that accepts
JSON-RPC requests from enclaves, fans them out to relay DON nodes, and
aggregates responses using F+1 quorum. Supports secrets_get and
capability_exec methods.

F+1 is sufficient because each relay DON node independently calls the
target DON (Vault or capability) through CRE's standard capability
dispatch, which includes DON-level consensus. Each honest relay node
receives the same consensus-aggregated response, then performs
deterministic translation (hex to base64 encoding, JSON marshalling)
before forwarding it back through the gateway. Since honest relay
nodes produce byte-identical responses, F+1 matching guarantees at
least one honest node vouched for the result.

See #21639 ([2/5]) for the relay DON node handler that processes these
fanned-out requests.

Dependencies

None. This PR is self-contained.

Copilot AI review requested due to automatic review settings March 23, 2026 17:00
@nadahalli nadahalli requested review from a team as code owners March 23, 2026 17:00
@github-actions
Copy link
Copy Markdown
Contributor

👋 nadahalli, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

@github-actions
Copy link
Copy Markdown
Contributor

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

✅ No conflicts with other open PRs targeting develop

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Risk Rating: MEDIUM

Adds a new Gateway handler type (confidential-compute-relay) to accept JSON-RPC requests from enclaves, fan them out to relay DON nodes, and aggregate node responses via an F+1 quorum.

Changes:

  • Register new handler type/service name in deployment job generation and the gateway handler factory.
  • Implement confidential relay gateway handler + F+1 quorum aggregator.
  • Add unit tests for fan-out, quorum behavior, timeouts, duplicate IDs, and node rate limiting.

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
go.mod Bumps deps needed for the new handler integration.
go.sum Corresponding checksum updates.
deployment/go.mod Aligns deployment module deps with root changes.
deployment/go.sum Corresponding checksum updates for deployment module.
deployment/cre/jobs/pkg/gateway_job.go Adds handler type + default config wiring for gateway job specs.
core/services/gateway/handler_factory.go Registers the new handler type in the factory.
core/services/gateway/handlers/confidentialrelay/handler.go New gateway handler: request tracking, fan-out, rate limiting, timeouts, metrics, callback response.
core/services/gateway/handlers/confidentialrelay/aggregator.go Implements F+1 digest-based quorum selection and “quorum unobtainable” detection.
core/services/gateway/handlers/confidentialrelay/handler_test.go Coverage for quorum success/divergence, timeouts, duplicate IDs, and rate limiting behavior.

Scrupulous human review areas:

  • Timeout behavior across layers (gateway request timeout vs handler requestTimeoutSec) and its impact on activeRequests growth.
  • Quorum aggregation correctness (digest semantics, behavior with errors/invalid digests).
  • Rate-limiting behavior (dropping node responses) and its effect on quorum attainability.

Reviewer recommendations (per CODEOWNERS):

  • @smartcontractkit/core and @smartcontractkit/foundations (repo-wide owners; go.mod/go.sum also owned by these teams).
  • For deployment changes under /deployment/cre: @smartcontractkit/keystone and @smartcontractkit/operations-platform.

Comment on lines +453 to +463
func newDefaultConfidentialRelayHandler() handler {
return handler{
Name: GatewayHandlerTypeConfidentialRelay,
ServiceName: "confidential",
Config: confidentialRelayHandlerConfig{
NodeRateLimiter: nodeRateLimiterConfig{
GlobalBurst: 10,
GlobalRPS: 50,
PerSenderBurst: 10,
PerSenderRPS: 10,
},
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newDefaultConfidentialRelayHandler doesn’t set requestTimeoutSec, so the handler will fall back to its internal default (30s). If the gateway-level request timeout is configured lower (the job enforces a minimum of 5s), the gateway can time out first while the handler keeps the request in activeRequests until its own timeout/cleanup, which can cause unnecessary memory growth under load. Consider adding a RequestTimeoutSec field (similar to vaultHandlerConfig) and setting it to g.RequestTimeoutSec - 1 when constructing the default config so the handler always times out before the gateway.

Also, this handler hardcodes ServiceName: "confidential" instead of using ServiceNameConfidential, which risks future drift if the constant changes.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@nadahalli nadahalli requested a review from a team as a code owner March 23, 2026 17:18
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

CORA - Analysis Skipped

Reason: The number of code owners (4) is less than the minimum required (5) and/or the number of CODEOWNERS entries with changed files (2) is less than the minimum required (2).

@nadahalli nadahalli force-pushed the tejaswi/cw-1-gateway-handler branch from 985dd59 to 52a1f51 Compare March 23, 2026 17:25
@trunk-io
Copy link
Copy Markdown

trunk-io bot commented Mar 23, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

Failed Test Failure Summary Logs
Test_CRE_V2_EVM_Read_StateQueries The test failed after waiting for a transaction to be mined, indicating a possible issue with transaction confirmation or network response. Logs ↗︎
Test_CRE_V2_EVM_Read_StateQueries/[v2]_EVM_Read_(state-queries)_-_workflow-gateway-capabilities The test failed without providing specific error details. Logs ↗︎
Test_CRE_V2_EVM_Read_StateQueries/[v2]_EVM_Read_(state-queries)_-_workflow-gateway-capabilities/Read_EVMReadBalance The test failed because the nonce was too low, causing authorization to fail during state query. Logs ↗︎

View Full Report ↗︎Docs

@nadahalli nadahalli force-pushed the tejaswi/cw-1-gateway-handler branch from eb015a2 to f3bd818 Compare March 23, 2026 19:08
github.com/smartcontractkit/chainlink-automation v0.8.1
github.com/smartcontractkit/chainlink-ccip v0.1.1-solana.0.20260317185256-d5f7db87ae70
github.com/smartcontractkit/chainlink-common v0.11.0
github.com/smartcontractkit/chainlink-common v0.11.1-0.20260323163826-2c5b95089478
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a canonical branch or a temporary branch?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary branch. I have to wait for the new canonical branch to be cut - not sure when. It must be v0.12.0.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently linking to commit hashes is ok.

Copy link
Copy Markdown
Contributor

@vreff vreff Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah no problem with commit hash, just making sure this hash is on the main branch in chainlink-common.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was merged as a part of chainlink-common#1903

Name: GatewayHandlerTypeConfidentialRelay,
ServiceName: ServiceNameConfidential,
Config: confidentialRelayHandlerConfig{
RequestTimeoutSec: requestTimeoutSec - 1,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why - 1?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied from the vault handler code. They even have a comment explaining why:

// must be lower than the overall gateway request timeout. // so we allow for the response to be sent back.

I'll add the same comment here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a problematic config generally, especially to be copied over along many services. Where did this come from? cc @bolekk

At the very least we could use a global buffer number instead of "1". Not particularly in scope for this PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a very weird mathematics - it is very confusing to get this override !inside! the newDefaultConfidentialRelayHandler which only takes one parameter requestTimeoutSec. Could you please move this -1 update to the place where the function is called? (Something like https://github.com/smartcontractkit/chainlink/pull/20087/changes did.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved the -1 to the call site with a comment explaining why. Also added a TODO to unify with the vault handler which does the same subtraction internally.

vreff
vreff previously approved these changes Mar 24, 2026

h.mu.Lock()
defer h.mu.Unlock()
delete(h.activeRequests, userRequest.req.ID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we cleaning up active requests that never get serviced? Else there is a minor memory leak.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. removeExpiredRequests removes them.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, could you please say which line in removeExpiredRequests deletes them?

Copy link
Copy Markdown
Contributor Author

@nadahalli nadahalli Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 226 in core/services/gateway/handlers/confidentialrelay/handler.go has a sendResponse call, which actually deletes them. I had added a comment above that call to explicitly call this out.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so if sendResponse errs at userRequest.SendResponse(resp), what happens to these requests? will they ever be cleaned?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this was my concern is that there is no default garbage collection.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done now. Renamed the method to sendResponseAndCleanup.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am still not sure how the overall message flow works:

Currently sendResponseAndCleanup just sends the response and ignoring the send result deletes it from the active requests. This is applicable for both success and error responses. Is this acceptable behavior for the application? (For example, http_trigger_handler uses sendWithRetries to ensure that the send operation is retried in case of errors].)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replied below in the main thread.

remainingResponses := donMembersCount - len(resps)
if maxShaToCount+remainingResponses < requiredQuorum {
l.Warnw("quorum unattainable for request", "requiredQuorum", requiredQuorum, "remainingResponses", remainingResponses, "maxShaToCount", maxShaToCount)
return nil, errors.New(errQuorumUnobtainable.Error() + ". RequiredQuorum=" + strconv.Itoa(requiredQuorum) + ". maxShaToCount=" + strconv.Itoa(maxShaToCount) + " remainingResponses=" + strconv.Itoa(remainingResponses))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you avoid fmt.Errorf ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in an earlier commit. All errors.New with string concatenation replaced with fmt.Errorf throughout the handler and aggregator.

func (a *aggregator) Aggregate(resps map[string]jsonrpc.Response[json.RawMessage], donF int, donMembersCount int, l logger.Logger) (*jsonrpc.Response[json.RawMessage], error) {
// F+1 is sufficient: each honest node independently validates the enclave's
// Nitro attestation, so F+1 matching responses guarantees at least one
// honest node vouched for the result.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this comment does not really explain what is going on. It just says that out of F+1 replies at least 1 is honest - this is understandable, but the surrounding logic (that the honest parties cannot disagree is not explained).

For example, here https://github.com/smartcontractkit/libocr/blob/a03701e2c02e2331921bfa6887e2257dea4e6084/quorumhelper/quorumhelper.go#L8 we have multiple quorums that can be used. Is there a design doc where this step is explained?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, and thanks for the libocr quorumhelper link. This is QuorumFPlusOne in libocr terms.

F+1 is correct here because each relay DON node calls the target DON (Vault or capability) through CRE's standard capability dispatch, which includes DON-level consensus. Every honest relay node receives the same consensus-aggregated response, then performs deterministic translation (hex to base64, JSON marshalling). So honest relay nodes produce byte-identical gateway responses. F+1 matching guarantees at least one honest node vouched for the result.

The relay DON node handler is in #21639 ([2/5]) if you want to see the request handling. Updated the comment to explain this.

) gwhandlers.UserCallbackPayload {
switch errorCode {
case api.FatalError:
case api.NodeReponseEncodingError:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, what is the purpose of this error parsing? Are you saying that the error itself does not contain enough information and need to be augmented?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mirrors the vault handler's existing pattern (vault/handler.go has the same switch). The gateway framework uses api.ErrorCode to control what goes back over JSON-RPC vs what stays in logs. For example, NodeReponseEncodingError logs the real error but sends back a generic string to the caller. It's the gateway's convention for sanitizing internal errors on the wire.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see that vault handler also does that. But I don't see other handlers doing that. How come? (or am I missing something)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The capabilities handler (capabilities/handler.go) inlines its error encoding at each call site rather than extracting a shared method. The vault and relay handlers have more error code variants and wanted consistent logging, so they centralized it. Same underlying logic, just different code organization.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, given that you have both sendResponseAndCleanup and errorResponse can send response with errors, I would propose to keep only one function - sendResponseAndCleanup.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very sorry, this is still not good. You have the function called sendSuccessResponseAndCleanup which can still send error if response does not encode successfully. Could you please make one function sendResponseAndCleanup that treats all the responses (also with errors) equally?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :-)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it was still too complicated - I have made the last commit in this branch to simplify sendResponseAndCleanup. If this is fine with you please ack it here, otherwise feel free to drop it and we can check again how sendResponseAndCleanup can be refactored.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to accept the change. Thanks for taking the time to send the commit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who can review the code now? We need someone from core, foundations, keystone, and/or operations-platform.

@nadahalli nadahalli force-pushed the tejaswi/cw-1-gateway-handler branch from dc7c357 to 8ac0d36 Compare March 26, 2026 15:01
vreff
vreff previously approved these changes Mar 30, 2026
pavel-raykov
pavel-raykov previously approved these changes Mar 31, 2026
vreff
vreff previously approved these changes Mar 31, 2026
return nil, fmt.Errorf("failed to unmarshal method config: %w", err)
}

if cfg.RequestTimeoutSec == 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not managed by a CRE settings limit?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RequestTimeoutSec is intentionally tied to the enclosing gateway request timeout (

case GatewayHandlerTypeConfidentialRelay:
). It should remain a job/gateway config rather than a CRE setting.

The rate limits fit better as CRE settings, so I split that out here: smartcontractkit/chainlink-common#1950 ( please approve ;-))

func (h *handler) fanOutToNodes(ctx context.Context, l logger.Logger, ar *activeRequest) error {
var nodeErrors []error
for _, node := range h.donConfig.Members {
err := h.don.SendToNode(ctx, node.Address, &ar.req)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really a fan out of this is sequential. this should be async. Seems like a go err group is what you want here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ChrisAmora
ChrisAmora previously approved these changes Mar 31, 2026
Copy link
Copy Markdown
Contributor

@ChrisAmora ChrisAmora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving @smartcontractkit/operations-platform path.

justinkaseman
justinkaseman previously approved these changes Mar 31, 2026
Comment on lines +56 to +58
func (d *barrierDON) forceRelease() {
d.releaseOnce.Do(func() { close(d.allStarted) })
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but this isn't necessary, the SendToNode accepts a context, which you can just cancel in the test to clean everything up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ch := d.allStarted
d.mu.Unlock()

<-ch
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not a select statement that waits on context to cancel or the channel to close?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@cl-sonarqube-production
Copy link
Copy Markdown

Quality Gate failed Quality Gate failed

Failed conditions
11.5% Duplication on New Code (required ≤ 10%)

See analysis details on SonarQube

Comment on lines +351 to 352
if nodeErrors == len(h.donConfig.Members) && nodeErrors > 0 {
return h.sendResponseAndCleanup(ctx, ar, h.constructErrorResponse(ar.req, api.FatalError, errors.New("failed to forward user request to nodes")))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the error strategy here, why do we only clean up if all nodes have errored? Should we not clean up if F+1 nodes error?

Comment on lines +331 to +332
nodeErrors int
nodeErrorsMu sync.Mutex
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit use sync/atomic.Uint32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants