fix: recovery check for rate-limited endpoints#28
Conversation
WalkthroughHealth checks now POST a JSON-RPC 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/server/rate_limit_scheduler.go (1)
326-343:⚠️ Potential issue | 🟡 MinorUse
http.NewRequestWithContextfor proper cancellation support.The function uses
http.NewRequestwithout context, preventing graceful cancellation during shutdown. TheperformRecoveryCheckcaller has a context available that should be passed down.🔧 Proposed fix
-func (rls *RateLimitScheduler) checkEndpointHealth(endpoint config.Endpoint) bool { +func (rls *RateLimitScheduler) checkEndpointHealth(ctx context.Context, endpoint config.Endpoint) bool { if endpoint.HTTPURL == "" { return false } // Create a simple HTTP client with timeout client := &http.Client{ Timeout: 10 * time.Second, } // Create a proper JSON-RPC request (same as regular health checks) payload := []byte(`{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}`) - req, err := http.NewRequest("POST", endpoint.HTTPURL, bytes.NewBuffer(payload)) + req, err := http.NewRequestWithContext(ctx, "POST", endpoint.HTTPURL, bytes.NewBuffer(payload))Then update the call site at line 258:
- success := rls.checkEndpointHealth(endpoint) + success := rls.checkEndpointHealth(ctx, endpoint)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/server/rate_limit_scheduler.go` around lines 326 - 343, checkEndpointHealth currently creates requests without a context; change its signature to accept a context (e.g., checkEndpointHealth(ctx context.Context, endpoint config.Endpoint)), update all call sites (notably performRecoveryCheck) to pass the caller's context, and replace http.NewRequest(...) with http.NewRequestWithContext(ctx, "POST", endpoint.HTTPURL, bytes.NewBuffer(payload)); ensure imports include context and adjust any tests or callers accordingly so shutdown/cancellation propagates.
🧹 Nitpick comments (2)
internal/server/rate_limit_scheduler.go (2)
19-26: Consider consolidating the duplicatedrpcResponsetype.This struct is identical to the one defined in
internal/health/checker.go(lines 28-35). Having two copies creates a maintenance burden and violates DRY.Options to consolidate:
- Export
RpcResponsefrom the health package and import it here- Move the type to a shared internal package (e.g.,
internal/typesorinternal/rpc)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/server/rate_limit_scheduler.go` around lines 19 - 26, The rpcResponse struct is duplicated (also defined in internal/health/checker.go); remove the local rpcResponse and reuse a single shared type: either export the existing type (e.g., rename to RpcResponse in the health package) and import health.RpcResponse here, or move the definition into a new internal package (e.g., internal/rpc or internal/types) and import that single RpcResponse type; update usages in rate_limit_scheduler.go (references to rpcResponse) to the chosen exported type and remove the duplicate struct declaration.
380-388: Log message may be misleading for non-rate-limit JSON-RPC errors.The log states "still rate limited" for any JSON-RPC error, but errors like
-32601(method not found) or-32000(internal error) aren't rate-limit related. Consider adjusting the log message or differentiating based on error codes.💡 Suggested improvement
// Check for JSON-RPC errors (rate limits could come as JSON-RPC errors with 200 HTTP status) if rpcResp.Error != nil { log.Debug(). Str("url", helpers.RedactAPIKey(endpoint.HTTPURL)). Int("code", rpcResp.Error.Code). Str("message", rpcResp.Error.Message). - Msg("Recovery check received JSON-RPC error, still rate limited") + Msg("Recovery check received JSON-RPC error") return false }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/server/rate_limit_scheduler.go` around lines 380 - 388, The current log unconditionally states "still rate limited" when rpcResp.Error != nil; change it to differentiate rate-limit vs other JSON-RPC errors by inspecting rpcResp.Error.Code (and/or implement a small helper like isJSONRPCRateLimitCode(code int) used by the rate-limit scheduler) and emit distinct log messages: e.g., for known rate-limit codes log "Recovery check received JSON-RPC rate-limit error" and for others log "Recovery check received JSON-RPC error (not rate-limit)"; keep using helpers.RedactAPIKey(endpoint.HTTPURL) and include rpcResp.Error.Code and rpcResp.Error.Message as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@internal/server/rate_limit_scheduler.go`:
- Around line 326-343: checkEndpointHealth currently creates requests without a
context; change its signature to accept a context (e.g., checkEndpointHealth(ctx
context.Context, endpoint config.Endpoint)), update all call sites (notably
performRecoveryCheck) to pass the caller's context, and replace
http.NewRequest(...) with http.NewRequestWithContext(ctx, "POST",
endpoint.HTTPURL, bytes.NewBuffer(payload)); ensure imports include context and
adjust any tests or callers accordingly so shutdown/cancellation propagates.
---
Nitpick comments:
In `@internal/server/rate_limit_scheduler.go`:
- Around line 19-26: The rpcResponse struct is duplicated (also defined in
internal/health/checker.go); remove the local rpcResponse and reuse a single
shared type: either export the existing type (e.g., rename to RpcResponse in the
health package) and import health.RpcResponse here, or move the definition into
a new internal package (e.g., internal/rpc or internal/types) and import that
single RpcResponse type; update usages in rate_limit_scheduler.go (references to
rpcResponse) to the chosen exported type and remove the duplicate struct
declaration.
- Around line 380-388: The current log unconditionally states "still rate
limited" when rpcResp.Error != nil; change it to differentiate rate-limit vs
other JSON-RPC errors by inspecting rpcResp.Error.Code (and/or implement a small
helper like isJSONRPCRateLimitCode(code int) used by the rate-limit scheduler)
and emit distinct log messages: e.g., for known rate-limit codes log "Recovery
check received JSON-RPC rate-limit error" and for others log "Recovery check
received JSON-RPC error (not rate-limit)"; keep using
helpers.RedactAPIKey(endpoint.HTTPURL) and include rpcResp.Error.Code and
rpcResp.Error.Message as before.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 01f3b562-a327-4bfa-be9c-fd3fc1fcbbc1
📒 Files selected for processing (2)
internal/server/rate_limit_scheduler.gointernal/server/rate_limit_scheduler_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/health/checker.go (1)
38-67:⚠️ Potential issue | 🟠 MajorSurface JSON-RPC throttle codes to the rate-limit handler.
checkRPCErrornow sees the RPC error body, but it still collapses every non-eth_syncingfailure into a plainerror, andmakeRPCCallonly invokesHandleRateLimitFuncfor HTTP 429. A200 OKresponse with a throttle code inerror.codewill therefore be treated as generic unhealthy instead of rate-limited, so it stays on the normal health-check path instead of entering the backoff/recovery flow. Bubble up a sentinel/typed rate-limit error here and handle it alongside the HTTP 429 branch.Also applies to: 449-467
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/health/checker.go` around lines 38 - 67, checkRPCError currently turns every non-eth_syncing JSON-RPC error into a generic error, so RPC responses that are HTTP 200 but contain throttle/rate-limit codes in response.Error.Code never reach the rate-limit handler; update checkRPCError to detect known throttle/error codes (e.g., your RPC provider's throttle codes) and return a sentinel/typed error (create ErrRPCRateLimited or similar) instead of a plain errors.New, keep the existing ErrMethodNotFound behavior for eth_syncing; then update makeRPCCall (and any caller that checks HTTP 429) to treat that sentinel ErrRPCRateLimited the same as an HTTP 429 by invoking HandleRateLimitFunc and entering the backoff/recovery flow. Ensure logging still records response.Error.Code/Message and reference checkRPCError, makeRPCCall, HandleRateLimitFunc, ErrMethodNotFound, and the new ErrRPCRateLimited in your changes.internal/server/rate_limit_scheduler.go (1)
47-68:⚠️ Potential issue | 🟠 MajorThread the monitoring context into checkEndpointHealth to enable cancellation during graceful shutdown.
The
performRecoveryCheckmethod has access toctxfrom the monitoring goroutine, but it passes only theendpointtocheckEndpointHealth, which creates an HTTP request without context awareness. If the backend stalls during shutdown, the HTTP request will continue independently until the hardcoded 10-second client timeout fires, preventing graceful shutdown from meeting its deadline.Pass
ctxthrough tocheckEndpointHealthand usehttp.NewRequestWithContextto allow the monitoring context to cancel in-flight requests whenShutdown()is called:Suggested change
-func (rls *RateLimitScheduler) checkEndpointHealth(endpoint config.Endpoint) bool { +func (rls *RateLimitScheduler) checkEndpointHealth(ctx context.Context, endpoint config.Endpoint) bool { // ... - req, err := http.NewRequest("POST", endpoint.HTTPURL, bytes.NewBuffer(payload)) + req, err := http.NewRequestWithContext(ctx, http.MethodPost, endpoint.HTTPURL, bytes.NewReader(payload))// Update the call in performRecoveryCheck: success := rls.checkEndpointHealth(ctx, endpoint)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/server/rate_limit_scheduler.go` around lines 47 - 68, performRecoveryCheck currently calls checkEndpointHealth(endpoint) without passing the monitoring context, so in-flight HTTP requests can't be cancelled during Shutdown (rls.shutdownCancel) and may linger until the 10s client timeout; update performRecoveryCheck to call checkEndpointHealth(ctx, endpoint) and modify checkEndpointHealth to accept a context parameter and create its HTTP request with http.NewRequestWithContext(ctx, ...) so the monitoring goroutine's context cancels in-flight requests during graceful shutdown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/server/rate_limit_scheduler.go`:
- Around line 359-387: The recovery check currently treats a 200 JSON-RPC
response with no rpcResp.Error as success even if rpcResp.Result is missing or
null; update the recovery logic (around the JSON-RPC handling for
health.RpcResponse) to validate the eth_blockNumber result the same way
internal/health/checker.go does: ensure rpcResp.Result is non-empty, parse the
hex block number into an integer (reject null/unparseable values), and only
return true when parsing succeeds; keep existing JSON-RPC error handling and
reuse or extract the shared parsing helper used by checker.go (or call that
helper) so isJSONRPCRateLimitCode and the RpcResponse flow remain intact.
---
Outside diff comments:
In `@internal/health/checker.go`:
- Around line 38-67: checkRPCError currently turns every non-eth_syncing
JSON-RPC error into a generic error, so RPC responses that are HTTP 200 but
contain throttle/rate-limit codes in response.Error.Code never reach the
rate-limit handler; update checkRPCError to detect known throttle/error codes
(e.g., your RPC provider's throttle codes) and return a sentinel/typed error
(create ErrRPCRateLimited or similar) instead of a plain errors.New, keep the
existing ErrMethodNotFound behavior for eth_syncing; then update makeRPCCall
(and any caller that checks HTTP 429) to treat that sentinel ErrRPCRateLimited
the same as an HTTP 429 by invoking HandleRateLimitFunc and entering the
backoff/recovery flow. Ensure logging still records response.Error.Code/Message
and reference checkRPCError, makeRPCCall, HandleRateLimitFunc,
ErrMethodNotFound, and the new ErrRPCRateLimited in your changes.
In `@internal/server/rate_limit_scheduler.go`:
- Around line 47-68: performRecoveryCheck currently calls
checkEndpointHealth(endpoint) without passing the monitoring context, so
in-flight HTTP requests can't be cancelled during Shutdown (rls.shutdownCancel)
and may linger until the 10s client timeout; update performRecoveryCheck to call
checkEndpointHealth(ctx, endpoint) and modify checkEndpointHealth to accept a
context parameter and create its HTTP request with
http.NewRequestWithContext(ctx, ...) so the monitoring goroutine's context
cancels in-flight requests during graceful shutdown.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9f855dac-6cac-42d2-ba09-74d9f86a119f
📒 Files selected for processing (2)
internal/health/checker.gointernal/server/rate_limit_scheduler.go
There was a problem hiding this comment.
🧹 Nitpick comments (2)
internal/server/rate_limit_scheduler.go (2)
397-401: Rate limit code coverage may need extension for other providers.Currently only
-32005is detected. While this covers Infura, Alchemy, and other major providers, some RPC providers may use different error codes (e.g.,-32000range server errors with rate-limit messages). Consider extending this if you encounter providers returning rate limits via different codes.For now, the implementation correctly handles the most common case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/server/rate_limit_scheduler.go` around lines 397 - 401, The current isJSONRPCRateLimitCode only matches -32005; update it to also treat other common server-error codes as rate limits by expanding checks in isJSONRPCRateLimitCode to include a small range (e.g., any code in the -32000 to -32005 range) and/or explicitly include other known provider codes if discovered; optionally also accept a secondary indicator by checking the error message for rate-limit keywords (e.g., "rate limit", "request limit", "exceeded") when available, so the function first checks code inclusively (range or set) and falls back to message keyword matching when present.
319-395: Consider extracting shared JSON-RPC call logic.This function duplicates much of
health.Checker.makeRPCCall: HTTP request construction, header setting, response parsing, and JSON-RPC error handling. The differences (timeout, error handling semantics) could be parameterized.A shared helper could reduce maintenance burden and ensure consistent behavior. However, the current implementation is correct and functional.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/server/rate_limit_scheduler.go` around lines 319 - 395, checkEndpointHealth duplicates JSON-RPC request/response logic already in health.Checker.makeRPCCall; extract a shared helper (e.g., makeJSONRPCCall or rpcClient.Call) that accepts parameters for URL, payload/method, timeout, and a flag controlling rate-limit semantics, then replace the logic in checkEndpointHealth to call this helper and reuse health.Checker.makeRPCCall where appropriate; update checkEndpointHealth to call the shared helper, then use health.RpcResponse (and health.ParseBlockNumber) and isJSONRPCRateLimitCode only for post-call validation and logging (use helpers.RedactAPIKey for logs) so behavior (10s timeout and distinct rate-limit handling) is preserved while avoiding duplicated HTTP creation, header setting, body reading, and JSON unmarshalling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/server/rate_limit_scheduler.go`:
- Around line 397-401: The current isJSONRPCRateLimitCode only matches -32005;
update it to also treat other common server-error codes as rate limits by
expanding checks in isJSONRPCRateLimitCode to include a small range (e.g., any
code in the -32000 to -32005 range) and/or explicitly include other known
provider codes if discovered; optionally also accept a secondary indicator by
checking the error message for rate-limit keywords (e.g., "rate limit", "request
limit", "exceeded") when available, so the function first checks code
inclusively (range or set) and falls back to message keyword matching when
present.
- Around line 319-395: checkEndpointHealth duplicates JSON-RPC request/response
logic already in health.Checker.makeRPCCall; extract a shared helper (e.g.,
makeJSONRPCCall or rpcClient.Call) that accepts parameters for URL,
payload/method, timeout, and a flag controlling rate-limit semantics, then
replace the logic in checkEndpointHealth to call this helper and reuse
health.Checker.makeRPCCall where appropriate; update checkEndpointHealth to call
the shared helper, then use health.RpcResponse (and health.ParseBlockNumber) and
isJSONRPCRateLimitCode only for post-call validation and logging (use
helpers.RedactAPIKey for logs) so behavior (10s timeout and distinct rate-limit
handling) is preserved while avoiding duplicated HTTP creation, header setting,
body reading, and JSON unmarshalling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f20896df-e8c1-48dd-96f6-344369b58af8
📒 Files selected for processing (3)
internal/health/checker.gointernal/server/rate_limit_scheduler.gointernal/server/rate_limit_scheduler_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/server/rate_limit_scheduler_test.go
Summary by CodeRabbit
Bug Fixes
Tests