Skip to content

fix: recovery check for rate-limited endpoints#28

Merged
sanbotto merged 6 commits intomainfrom
fix/rate-limit-check
Mar 9, 2026
Merged

fix: recovery check for rate-limited endpoints#28
sanbotto merged 6 commits intomainfrom
fix/rate-limit-check

Conversation

@sanbotto
Copy link
Copy Markdown
Member

@sanbotto sanbotto commented Mar 6, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Improved endpoint health and recovery checks to validate JSON‑RPC responses, detect RPC-level rate limits, and handle HTTP 429 conditions so rate‑limited or malfunctioning endpoints are recovered more reliably.
  • Tests

    • Updated health check tests to use valid JSON‑RPC responses and verify RPC and rate‑limit handling.

@sanbotto sanbotto self-assigned this Mar 6, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 6, 2026

Walkthrough

Health checks now POST a JSON-RPC eth_blockNumber payload and validate HTTP status, JSON-RPC body, and parsed block number. HTTP 429 and JSON-RPC errors (including rate-limit codes) are detected. RpcResponse was exported and extended with Error.Message; checkRPCError and RPC call sites updated. ParseBlockNumber is exported. checkEndpointHealth and performRecoveryCheck now accept a context.Context. RateLimitScheduler gained shutdownCancel context.CancelFunc and wg sync.WaitGroup. Tests updated to return JSON-RPC application/json responses.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: implementing recovery checks for rate-limited endpoints, which aligns with the primary focus across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/rate-limit-check

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/server/rate_limit_scheduler.go (1)

326-343: ⚠️ Potential issue | 🟡 Minor

Use http.NewRequestWithContext for proper cancellation support.

The function uses http.NewRequest without context, preventing graceful cancellation during shutdown. The performRecoveryCheck caller has a context available that should be passed down.

🔧 Proposed fix
-func (rls *RateLimitScheduler) checkEndpointHealth(endpoint config.Endpoint) bool {
+func (rls *RateLimitScheduler) checkEndpointHealth(ctx context.Context, endpoint config.Endpoint) bool {
 	if endpoint.HTTPURL == "" {
 		return false
 	}

 	// Create a simple HTTP client with timeout
 	client := &http.Client{
 		Timeout: 10 * time.Second,
 	}

 	// Create a proper JSON-RPC request (same as regular health checks)
 	payload := []byte(`{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}`)
-	req, err := http.NewRequest("POST", endpoint.HTTPURL, bytes.NewBuffer(payload))
+	req, err := http.NewRequestWithContext(ctx, "POST", endpoint.HTTPURL, bytes.NewBuffer(payload))

Then update the call site at line 258:

-	success := rls.checkEndpointHealth(endpoint)
+	success := rls.checkEndpointHealth(ctx, endpoint)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/server/rate_limit_scheduler.go` around lines 326 - 343,
checkEndpointHealth currently creates requests without a context; change its
signature to accept a context (e.g., checkEndpointHealth(ctx context.Context,
endpoint config.Endpoint)), update all call sites (notably performRecoveryCheck)
to pass the caller's context, and replace http.NewRequest(...) with
http.NewRequestWithContext(ctx, "POST", endpoint.HTTPURL,
bytes.NewBuffer(payload)); ensure imports include context and adjust any tests
or callers accordingly so shutdown/cancellation propagates.
🧹 Nitpick comments (2)
internal/server/rate_limit_scheduler.go (2)

19-26: Consider consolidating the duplicated rpcResponse type.

This struct is identical to the one defined in internal/health/checker.go (lines 28-35). Having two copies creates a maintenance burden and violates DRY.

Options to consolidate:

  1. Export RpcResponse from the health package and import it here
  2. Move the type to a shared internal package (e.g., internal/types or internal/rpc)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/server/rate_limit_scheduler.go` around lines 19 - 26, The
rpcResponse struct is duplicated (also defined in internal/health/checker.go);
remove the local rpcResponse and reuse a single shared type: either export the
existing type (e.g., rename to RpcResponse in the health package) and import
health.RpcResponse here, or move the definition into a new internal package
(e.g., internal/rpc or internal/types) and import that single RpcResponse type;
update usages in rate_limit_scheduler.go (references to rpcResponse) to the
chosen exported type and remove the duplicate struct declaration.

380-388: Log message may be misleading for non-rate-limit JSON-RPC errors.

The log states "still rate limited" for any JSON-RPC error, but errors like -32601 (method not found) or -32000 (internal error) aren't rate-limit related. Consider adjusting the log message or differentiating based on error codes.

💡 Suggested improvement
 	// Check for JSON-RPC errors (rate limits could come as JSON-RPC errors with 200 HTTP status)
 	if rpcResp.Error != nil {
 		log.Debug().
 			Str("url", helpers.RedactAPIKey(endpoint.HTTPURL)).
 			Int("code", rpcResp.Error.Code).
 			Str("message", rpcResp.Error.Message).
-			Msg("Recovery check received JSON-RPC error, still rate limited")
+			Msg("Recovery check received JSON-RPC error")
 		return false
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/server/rate_limit_scheduler.go` around lines 380 - 388, The current
log unconditionally states "still rate limited" when rpcResp.Error != nil;
change it to differentiate rate-limit vs other JSON-RPC errors by inspecting
rpcResp.Error.Code (and/or implement a small helper like
isJSONRPCRateLimitCode(code int) used by the rate-limit scheduler) and emit
distinct log messages: e.g., for known rate-limit codes log "Recovery check
received JSON-RPC rate-limit error" and for others log "Recovery check received
JSON-RPC error (not rate-limit)"; keep using
helpers.RedactAPIKey(endpoint.HTTPURL) and include rpcResp.Error.Code and
rpcResp.Error.Message as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@internal/server/rate_limit_scheduler.go`:
- Around line 326-343: checkEndpointHealth currently creates requests without a
context; change its signature to accept a context (e.g., checkEndpointHealth(ctx
context.Context, endpoint config.Endpoint)), update all call sites (notably
performRecoveryCheck) to pass the caller's context, and replace
http.NewRequest(...) with http.NewRequestWithContext(ctx, "POST",
endpoint.HTTPURL, bytes.NewBuffer(payload)); ensure imports include context and
adjust any tests or callers accordingly so shutdown/cancellation propagates.

---

Nitpick comments:
In `@internal/server/rate_limit_scheduler.go`:
- Around line 19-26: The rpcResponse struct is duplicated (also defined in
internal/health/checker.go); remove the local rpcResponse and reuse a single
shared type: either export the existing type (e.g., rename to RpcResponse in the
health package) and import health.RpcResponse here, or move the definition into
a new internal package (e.g., internal/rpc or internal/types) and import that
single RpcResponse type; update usages in rate_limit_scheduler.go (references to
rpcResponse) to the chosen exported type and remove the duplicate struct
declaration.
- Around line 380-388: The current log unconditionally states "still rate
limited" when rpcResp.Error != nil; change it to differentiate rate-limit vs
other JSON-RPC errors by inspecting rpcResp.Error.Code (and/or implement a small
helper like isJSONRPCRateLimitCode(code int) used by the rate-limit scheduler)
and emit distinct log messages: e.g., for known rate-limit codes log "Recovery
check received JSON-RPC rate-limit error" and for others log "Recovery check
received JSON-RPC error (not rate-limit)"; keep using
helpers.RedactAPIKey(endpoint.HTTPURL) and include rpcResp.Error.Code and
rpcResp.Error.Message as before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 01f3b562-a327-4bfa-be9c-fd3fc1fcbbc1

📥 Commits

Reviewing files that changed from the base of the PR and between cc768f9 and db46138.

📒 Files selected for processing (2)
  • internal/server/rate_limit_scheduler.go
  • internal/server/rate_limit_scheduler_test.go

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/health/checker.go (1)

38-67: ⚠️ Potential issue | 🟠 Major

Surface JSON-RPC throttle codes to the rate-limit handler.

checkRPCError now sees the RPC error body, but it still collapses every non-eth_syncing failure into a plain error, and makeRPCCall only invokes HandleRateLimitFunc for HTTP 429. A 200 OK response with a throttle code in error.code will therefore be treated as generic unhealthy instead of rate-limited, so it stays on the normal health-check path instead of entering the backoff/recovery flow. Bubble up a sentinel/typed rate-limit error here and handle it alongside the HTTP 429 branch.

Also applies to: 449-467

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/health/checker.go` around lines 38 - 67, checkRPCError currently
turns every non-eth_syncing JSON-RPC error into a generic error, so RPC
responses that are HTTP 200 but contain throttle/rate-limit codes in
response.Error.Code never reach the rate-limit handler; update checkRPCError to
detect known throttle/error codes (e.g., your RPC provider's throttle codes) and
return a sentinel/typed error (create ErrRPCRateLimited or similar) instead of a
plain errors.New, keep the existing ErrMethodNotFound behavior for eth_syncing;
then update makeRPCCall (and any caller that checks HTTP 429) to treat that
sentinel ErrRPCRateLimited the same as an HTTP 429 by invoking
HandleRateLimitFunc and entering the backoff/recovery flow. Ensure logging still
records response.Error.Code/Message and reference checkRPCError, makeRPCCall,
HandleRateLimitFunc, ErrMethodNotFound, and the new ErrRPCRateLimited in your
changes.
internal/server/rate_limit_scheduler.go (1)

47-68: ⚠️ Potential issue | 🟠 Major

Thread the monitoring context into checkEndpointHealth to enable cancellation during graceful shutdown.

The performRecoveryCheck method has access to ctx from the monitoring goroutine, but it passes only the endpoint to checkEndpointHealth, which creates an HTTP request without context awareness. If the backend stalls during shutdown, the HTTP request will continue independently until the hardcoded 10-second client timeout fires, preventing graceful shutdown from meeting its deadline.

Pass ctx through to checkEndpointHealth and use http.NewRequestWithContext to allow the monitoring context to cancel in-flight requests when Shutdown() is called:

Suggested change
-func (rls *RateLimitScheduler) checkEndpointHealth(endpoint config.Endpoint) bool {
+func (rls *RateLimitScheduler) checkEndpointHealth(ctx context.Context, endpoint config.Endpoint) bool {
 	// ...
-	req, err := http.NewRequest("POST", endpoint.HTTPURL, bytes.NewBuffer(payload))
+	req, err := http.NewRequestWithContext(ctx, http.MethodPost, endpoint.HTTPURL, bytes.NewReader(payload))
// Update the call in performRecoveryCheck:
success := rls.checkEndpointHealth(ctx, endpoint)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/server/rate_limit_scheduler.go` around lines 47 - 68,
performRecoveryCheck currently calls checkEndpointHealth(endpoint) without
passing the monitoring context, so in-flight HTTP requests can't be cancelled
during Shutdown (rls.shutdownCancel) and may linger until the 10s client
timeout; update performRecoveryCheck to call checkEndpointHealth(ctx, endpoint)
and modify checkEndpointHealth to accept a context parameter and create its HTTP
request with http.NewRequestWithContext(ctx, ...) so the monitoring goroutine's
context cancels in-flight requests during graceful shutdown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/server/rate_limit_scheduler.go`:
- Around line 359-387: The recovery check currently treats a 200 JSON-RPC
response with no rpcResp.Error as success even if rpcResp.Result is missing or
null; update the recovery logic (around the JSON-RPC handling for
health.RpcResponse) to validate the eth_blockNumber result the same way
internal/health/checker.go does: ensure rpcResp.Result is non-empty, parse the
hex block number into an integer (reject null/unparseable values), and only
return true when parsing succeeds; keep existing JSON-RPC error handling and
reuse or extract the shared parsing helper used by checker.go (or call that
helper) so isJSONRPCRateLimitCode and the RpcResponse flow remain intact.

---

Outside diff comments:
In `@internal/health/checker.go`:
- Around line 38-67: checkRPCError currently turns every non-eth_syncing
JSON-RPC error into a generic error, so RPC responses that are HTTP 200 but
contain throttle/rate-limit codes in response.Error.Code never reach the
rate-limit handler; update checkRPCError to detect known throttle/error codes
(e.g., your RPC provider's throttle codes) and return a sentinel/typed error
(create ErrRPCRateLimited or similar) instead of a plain errors.New, keep the
existing ErrMethodNotFound behavior for eth_syncing; then update makeRPCCall
(and any caller that checks HTTP 429) to treat that sentinel ErrRPCRateLimited
the same as an HTTP 429 by invoking HandleRateLimitFunc and entering the
backoff/recovery flow. Ensure logging still records response.Error.Code/Message
and reference checkRPCError, makeRPCCall, HandleRateLimitFunc,
ErrMethodNotFound, and the new ErrRPCRateLimited in your changes.

In `@internal/server/rate_limit_scheduler.go`:
- Around line 47-68: performRecoveryCheck currently calls
checkEndpointHealth(endpoint) without passing the monitoring context, so
in-flight HTTP requests can't be cancelled during Shutdown (rls.shutdownCancel)
and may linger until the 10s client timeout; update performRecoveryCheck to call
checkEndpointHealth(ctx, endpoint) and modify checkEndpointHealth to accept a
context parameter and create its HTTP request with
http.NewRequestWithContext(ctx, ...) so the monitoring goroutine's context
cancels in-flight requests during graceful shutdown.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9f855dac-6cac-42d2-ba09-74d9f86a119f

📥 Commits

Reviewing files that changed from the base of the PR and between db46138 and 06ee778.

📒 Files selected for processing (2)
  • internal/health/checker.go
  • internal/server/rate_limit_scheduler.go

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
internal/server/rate_limit_scheduler.go (2)

397-401: Rate limit code coverage may need extension for other providers.

Currently only -32005 is detected. While this covers Infura, Alchemy, and other major providers, some RPC providers may use different error codes (e.g., -32000 range server errors with rate-limit messages). Consider extending this if you encounter providers returning rate limits via different codes.

For now, the implementation correctly handles the most common case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/server/rate_limit_scheduler.go` around lines 397 - 401, The current
isJSONRPCRateLimitCode only matches -32005; update it to also treat other common
server-error codes as rate limits by expanding checks in isJSONRPCRateLimitCode
to include a small range (e.g., any code in the -32000 to -32005 range) and/or
explicitly include other known provider codes if discovered; optionally also
accept a secondary indicator by checking the error message for rate-limit
keywords (e.g., "rate limit", "request limit", "exceeded") when available, so
the function first checks code inclusively (range or set) and falls back to
message keyword matching when present.

319-395: Consider extracting shared JSON-RPC call logic.

This function duplicates much of health.Checker.makeRPCCall: HTTP request construction, header setting, response parsing, and JSON-RPC error handling. The differences (timeout, error handling semantics) could be parameterized.

A shared helper could reduce maintenance burden and ensure consistent behavior. However, the current implementation is correct and functional.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/server/rate_limit_scheduler.go` around lines 319 - 395,
checkEndpointHealth duplicates JSON-RPC request/response logic already in
health.Checker.makeRPCCall; extract a shared helper (e.g., makeJSONRPCCall or
rpcClient.Call) that accepts parameters for URL, payload/method, timeout, and a
flag controlling rate-limit semantics, then replace the logic in
checkEndpointHealth to call this helper and reuse health.Checker.makeRPCCall
where appropriate; update checkEndpointHealth to call the shared helper, then
use health.RpcResponse (and health.ParseBlockNumber) and isJSONRPCRateLimitCode
only for post-call validation and logging (use helpers.RedactAPIKey for logs) so
behavior (10s timeout and distinct rate-limit handling) is preserved while
avoiding duplicated HTTP creation, header setting, body reading, and JSON
unmarshalling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/server/rate_limit_scheduler.go`:
- Around line 397-401: The current isJSONRPCRateLimitCode only matches -32005;
update it to also treat other common server-error codes as rate limits by
expanding checks in isJSONRPCRateLimitCode to include a small range (e.g., any
code in the -32000 to -32005 range) and/or explicitly include other known
provider codes if discovered; optionally also accept a secondary indicator by
checking the error message for rate-limit keywords (e.g., "rate limit", "request
limit", "exceeded") when available, so the function first checks code
inclusively (range or set) and falls back to message keyword matching when
present.
- Around line 319-395: checkEndpointHealth duplicates JSON-RPC request/response
logic already in health.Checker.makeRPCCall; extract a shared helper (e.g.,
makeJSONRPCCall or rpcClient.Call) that accepts parameters for URL,
payload/method, timeout, and a flag controlling rate-limit semantics, then
replace the logic in checkEndpointHealth to call this helper and reuse
health.Checker.makeRPCCall where appropriate; update checkEndpointHealth to call
the shared helper, then use health.RpcResponse (and health.ParseBlockNumber) and
isJSONRPCRateLimitCode only for post-call validation and logging (use
helpers.RedactAPIKey for logs) so behavior (10s timeout and distinct rate-limit
handling) is preserved while avoiding duplicated HTTP creation, header setting,
body reading, and JSON unmarshalling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f20896df-e8c1-48dd-96f6-344369b58af8

📥 Commits

Reviewing files that changed from the base of the PR and between 06ee778 and 7eff057.

📒 Files selected for processing (3)
  • internal/health/checker.go
  • internal/server/rate_limit_scheduler.go
  • internal/server/rate_limit_scheduler_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/server/rate_limit_scheduler_test.go

@sanbotto sanbotto merged commit 7b7c94a into main Mar 9, 2026
6 checks passed
@sanbotto sanbotto deleted the fix/rate-limit-check branch March 9, 2026 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant