-
-
Notifications
You must be signed in to change notification settings - Fork 226
fix: set restart limits to 0 to prevent being marked as failed #1952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
samrose
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to create a testing AMI to thoroughly test these changes out. Will request @LGUG2Z to perform these tests as he's also going to be helping us find ways to automate these testing approaches.
samrose
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we ultimately merge this, we should bump the versions in ansible/vars.yml to create a release for these changes. This way, it will be a distinct change instead of bundled with other changes.
|
Hi @samrose - I've just updated the branch. Any updates on this? |
3ef31ba to
c89c805
Compare
samrose
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just needs a rebase
samrose
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like infra data @Crispy1975 or @delgado3d to review when they have some time, just being defensive about changes which could impact stability and we need more eyes on these changes
The systemd default is 10s / 5 for these values with a DefaultRestartUSec of 100ms. Most services set a RestartSec limit of 3, under most circumstances it takes 15s to restart 5 times so the limit of 10s is not exceeded. However if other system processes (salt, cloud init) restart it explicitly, or recovering system services within the --before chain trigger a restart the limit can be exceeded causing it to be marked as failed. Since no services mark gotrue.service as required it will remain offline until the next explicit restart is issued. Setting these values to 0 with Restart=always and RestartSec=3 will prevent gotrue from being marked as failed.
I've noticed all !oneshot services set a `RestartSec` of `3s` and we use the systemd defaults of `StartLimitBurst=5` and `StartLimitInterval=10s`. Together this forms a property that under typical conditions a service will be restarted indefinitely until it comes back up due to `(3s * 5) > 10s`, but it is still possible for a service to enter a failed state under some scenarios. This change defensively sets them to 0/0 to keep them in restart loops.
c89c805 to
a0f7be8
Compare
WalkthroughSeven systemd unit templates had start-rate limits disabled (StartLimitIntervalSec and StartLimitBurst set to 0). The gotrue unit was expanded with a comprehensive [Service] block (working dir, exec, reload, user, restart, memory controls, environment files and reload-related env vars). Postgres version strings were bumped. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
📜 Recent review detailsConfiguration used: Repository UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
ansible/files/gotrue.service.j2 (2)
69-70: MemoryMax=50% could cause OOM-triggered restart loops.Setting
MemoryMax=50%(of total system memory) will cause the kernel OOM killer to terminate gotrue if it exceeds this limit. Combined withRestart=alwaysand disabled start limiting, this could create a restart loop if gotrue's memory usage grows beyond 50%.While 50% is generous, consider:
- Is this limit appropriate for the expected workload and instance types?
- Does gotrue have memory leaks or scenarios where it might grow beyond 50%?
- Should there be monitoring/alerting specifically for memory-based restarts?
If gotrue's memory requirements are well-understood and 50% is intentionally generous, this configuration is acceptable. Otherwise, consider adjusting the limit or adding memory usage monitoring.
74-76: Create/etc/gotrue.envor reconsider the EnvironmentFile requirement.The EnvironmentFile directive on line 75 specifies
/etc/gotrue.envwithout the optional minus prefix, making it a required file for service startup. However,ansible/tasks/setup-gotrue.ymldoes not create this file—only the service template itself is deployed. The service will fail to start if this file is missing. Either add a task to create or copy this file in the deployment process, or change the prefix to-/etc/gotrue.envto make it optional.
🧹 Nitpick comments (7)
ansible/files/vector.service.j2 (1)
7-8: Disabling rate limiting: ensure monitoring detects restart loops.Setting both limits to 0 allows indefinite restarts. While this addresses the external restart issue documented in the PR, it could mask genuine service failures or cause resource exhaustion if Vector has a startup bug. With
RestartSec=3, a failing service will restart ~20 times per minute indefinitely.Ensure monitoring/alerting can detect when Vector enters a persistent restart loop so operators are notified of genuine failures rather than services silently restarting forever.
ansible/files/nginx.service.j2 (1)
6-8: Disabling rate limiting: ensure monitoring detects restart loops.Setting both limits to 0 allows indefinite restarts. Since nginx is a critical gateway service (proxies postgrest, gotrue, adminapi per line 3), a persistent restart loop could impact availability. With
RestartSec=3, a failing service will restart ~20 times per minute indefinitely.Ensure monitoring/alerting can detect when nginx enters a persistent restart loop so operators are notified of genuine failures.
ansible/files/adminapi.service.j2 (1)
6-7: Removing existing rate limiting protection.This changes from relatively generous limits (60s/10 bursts) to completely disabled (0/0). While this aligns with the PR's goal to prevent false failures from external restarts, it removes existing protection against rapid restart loops. With
RestartSec=3, a failing service will restart ~20 times per minute indefinitely.Ensure monitoring/alerting can detect when AdminAPI enters a persistent restart loop so operators are notified of genuine failures.
ansible/files/pg_egress_collect.service.j2 (1)
4-5: Disabling rate limiting for a root service: higher risk of resource exhaustion.Setting both limits to 0 allows indefinite restarts. Since this service runs as
User=root(line 10) and executes tcpdump (packet capture), a bug causing rapid restarts could consume significant system resources. WithRestartSec=3, a failing service will restart ~20 times per minute indefinitely.Ensure monitoring/alerting can detect when pg_egress_collect enters a persistent restart loop. Consider whether this service truly needs unlimited restart attempts or if a higher (but non-zero) limit would be safer given root execution.
ansible/files/postgres_exporter.service.j2 (1)
4-5: Disabling rate limiting: ensure monitoring detects restart loops.Setting both limits to 0 allows indefinite restarts. While postgres_exporter is observability infrastructure (less critical than the services it monitors), a persistent restart loop could impact metrics collection and visibility into database health. With
RestartSec=3, a failing service will restart ~20 times per minute indefinitely.Ensure monitoring/alerting can detect when postgres_exporter enters a persistent restart loop.
ansible/files/postgrest.service.j2 (1)
6-8: Disabling rate limiting: ensure monitoring detects restart loops.Setting both limits to 0 allows indefinite restarts. Since PostgREST is a critical API service (nginx depends on it per ansible/files/nginx.service.j2 line 3), a persistent restart loop could impact API availability. With
RestartSec=3, a failing service will restart ~20 times per minute indefinitely.Ensure monitoring/alerting can detect when postgrest enters a persistent restart loop so operators are notified of genuine failures.
ansible/files/gotrue.service.j2 (1)
43-55: Excellent documentation of the rationale for disabling rate limiting.The detailed comment clearly explains the problem (external/chained restarts exceeding systemd defaults) and the solution (disabling limits with 0/0 values). This will help future maintainers understand the architectural decision.
As with the other services, ensure monitoring/alerting can detect when gotrue enters a persistent restart loop so operators are notified of genuine failures rather than services silently restarting forever.
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
ansible/files/adminapi.service.j2ansible/files/gotrue.service.j2ansible/files/nginx.service.j2ansible/files/pg_egress_collect.service.j2ansible/files/postgres_exporter.service.j2ansible/files/postgrest.service.j2ansible/files/vector.service.j2
🔇 Additional comments (2)
ansible/files/gotrue.service.j2 (2)
59-63: The configuration is already documented and properly set up. The service file itself explicitly states (lines 61 and 78) that both v2 and v3 of gotrue support the--config-dirflag for reloading and signal-based reloading via SIGUSR1 (signal 10). The related environment variables are correctly configured:GOTRUE_RELOADING_SIGNAL_ENABLED=true,GOTRUE_RELOADING_SIGNAL_NUMBER=10, and supporting parameters for graceful reload behavior. No verification issues found.
79-98: Configuration correctly implements reload signaling with appropriate version-specific handling.The environment variables are properly supported—the code itself documents that both gotrue v2 and v3 support reloading via signals. Signal number 10 (SIGUSR1) correctly matches the
ExecReload=/bin/kill -10 $MAINPIDdirective. The conditionalGOTRUE_RELOADING_NOTIFY_ENABLEDlogic appropriately reflects the difference between versions: v3 disables filesystem notifications (enabled only in qemu_mode for v2 compatibility), while the grace period and signal settings apply consistently across both versions.
samrose
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switching review back to changes requested to prevent merge until testing complete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ansible/vars.yml
Outdated
| postgresorioledb-17: "17.6.0.024-orioledb-auth-1" | ||
| postgres17: "17.6.1.067-auth-1" | ||
| postgres15: "15.14.1.067-auth-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the full context of ansible/vars.yml
wc -l ansible/vars.ymlRepository: supabase/postgres
Length of output: 80
🏁 Script executed:
# Get full content of ansible/vars.yml to understand its purpose and scope
cat -n ansible/vars.ymlRepository: supabase/postgres
Length of output: 2663
🏁 Script executed:
# Find all systemd service files to understand the PR's actual changes
find . -name "*.service.j2" -type fRepository: supabase/postgres
Length of output: 645
🏁 Script executed:
# Search for gotrue-related files and their content
find . -name "*gotrue*" -type fRepository: supabase/postgres
Length of output: 299
🏁 Script executed:
# Check if postgres versions are referenced in service files or gotrue-related configurations
rg "17\.6\.0\.024|17\.6\.1\.067|15\.14\.1\.067|postgresorioledb|postgres17|postgres15" --type-add 'service:*.j2' -nRepository: supabase/postgres
Length of output: 241
🏁 Script executed:
# Look for postgres-related configurations in ansible directory
find ansible -type f \( -name "*.yml" -o -name "*.j2" \) | head -20Repository: supabase/postgres
Length of output: 1017
🏁 Script executed:
# Check gotrue.service.j2 to see restart limit configuration
cat -n ansible/files/gotrue.service.j2Repository: supabase/postgres
Length of output: 4946
🏁 Script executed:
# Check if postgres_release variable is used in ansible tasks
rg "postgres_release" --type-add 'yaml:*.yml' --type yaml -nRepository: supabase/postgres
Length of output: 98
🏁 Script executed:
# Search for StartLimitBurst/StartLimitIntervalSec in all service files
rg "StartLimit|Restart=" ansible/files/ -nRepository: supabase/postgres
Length of output: 1672
🏁 Script executed:
# Look for recent git changes to understand what this PR actually modifies
git diff HEAD~1..HEAD --stat 2>/dev/null | head -20 || echo "Git history not available"Repository: supabase/postgres
Length of output: 43
🏁 Script executed:
# Check if postgres_release is used in any setup tasks or deployment scripts
rg "postgres_release" --type-add 'yaml:*.yml' --type yaml --type shell -nRepository: supabase/postgres
Length of output: 90
🏁 Script executed:
# Check Dockerfile and other potential deployment files
find . -name "Dockerfile*" -o -name "*.sh" | xargs grep -l "postgres_release" 2>/dev/null || echo "No matches in Dockerfile or shell scripts"Repository: supabase/postgres
Length of output: 102
Unrelated changes mixed with the PR objective.
These Postgres version updates (lines 13-15) are unrelated to the PR's stated objective of fixing gotrue restart limits. The actual restart limit changes are in systemd service files (gotrue.service.j2 lines 54-55 with StartLimitIntervalSec=0 and StartLimitBurst=0), not in this vars file.
Mixing unrelated changes in a single PR complicates testing, rollback, and change tracking. Additionally, these version strings do not appear to be referenced anywhere else in the codebase, suggesting they may be unused or incomplete.
Separate these Postgres version updates into a dedicated PR.
🤖 Prompt for AI Agents
In ansible/vars.yml around lines 13 to 15, the Postgres version bumps
(postgresorioledb-17/postgres17/postgres15) are unrelated to the gotrue
restart-limit change and must be removed from this PR: revert those three lines
to their previous values (or delete them if they were newly added), move any
intentional Postgres version updates to a separate branch/PR, and if these
variables are intended to be kept, confirm they are actually referenced
elsewhere (remove unused entries). Ensure the current PR only contains the
systemd/gotrue service changes and open a distinct PR for Postgres version
updates with proper testing notes.
|
Local infra test result engines test: |
|
upgrade test pg 15 |
samrose
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested in local infra
* fix: set restart limits to 0 to prevent being marked as failed The systemd default is 10s / 5 for these values with a DefaultRestartUSec of 100ms. Most services set a RestartSec limit of 3, under most circumstances it takes 15s to restart 5 times so the limit of 10s is not exceeded. However if other system processes (salt, cloud init) restart it explicitly, or recovering system services within the --before chain trigger a restart the limit can be exceeded causing it to be marked as failed. Since no services mark gotrue.service as required it will remain offline until the next explicit restart is issued. Setting these values to 0 with Restart=always and RestartSec=3 will prevent gotrue from being marked as failed. * chore: set StartLimits for persistent services. I've noticed all !oneshot services set a `RestartSec` of `3s` and we use the systemd defaults of `StartLimitBurst=5` and `StartLimitInterval=10s`. Together this forms a property that under typical conditions a service will be restarted indefinitely until it comes back up due to `(3s * 5) > 10s`, but it is still possible for a service to enter a failed state under some scenarios. This change defensively sets them to 0/0 to keep them in restart loops. * chore: suffix to test * chore: bump to release --------- Co-authored-by: Chris Stockton <chris.stockton@supabase.io> Co-authored-by: Sam Rose <samuel@supabase.io> chore: bump version to correct new version for potential release
The systemd default is 10s / 5 for these values with a DefaultRestartUSec of 100ms. Most services set a RestartSec limit of 3, under most circumstances it takes 15s to restart 5 times so the limit of 10s is not exceeded. However if other system processes (salt, cloud init) restart it explicitly, or recovering system services within the --before chain trigger a restart the limit can be exceeded causing it to be marked as failed. Since no services mark gotrue.service as required it will remain offline until the next explicit restart is issued.
Setting these values to 0 with Restart=always and RestartSec=3 will prevent gotrue from being marked as failed.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.