fix: set restart limits to 0 to prevent being marked as failed #1952

cstockton · 2025-11-28T18:14:47Z

The systemd default is 10s / 5 for these values with a DefaultRestartUSec of 100ms. Most services set a RestartSec limit of 3, under most circumstances it takes 15s to restart 5 times so the limit of 10s is not exceeded. However if other system processes (salt, cloud init) restart it explicitly, or recovering system services within the --before chain trigger a restart the limit can be exceeded causing it to be marked as failed. Since no services mark gotrue.service as required it will remain offline until the next explicit restart is issued.

Setting these values to 0 with Restart=always and RestartSec=3 will prevent gotrue from being marked as failed.

Summary by CodeRabbit

Chores
- Disabled start-rate limiting across multiple system services to avoid false failure marks and improve restart behavior.
- Enhanced the authentication service with reload support, restart/reload behavior, environment-file support, user/run-directory defaults, and memory controls.
- Bumped Postgres-related package versions for three Postgres releases.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

samrose

We'll need to create a testing AMI to thoroughly test these changes out. Will request @LGUG2Z to perform these tests as he's also going to be helping us find ways to automate these testing approaches.

samrose

When we ultimately merge this, we should bump the versions in ansible/vars.yml to create a release for these changes. This way, it will be a distinct change instead of bundled with other changes.

cstockton · 2025-12-08T15:53:05Z

Hi @samrose - I've just updated the branch. Any updates on this?

samrose

Just needs a rebase

samrose

I would like infra data @Crispy1975 or @delgado3d to review when they have some time, just being defensive about changes which could impact stability and we need more eyes on these changes

The systemd default is 10s / 5 for these values with a DefaultRestartUSec of 100ms. Most services set a RestartSec limit of 3, under most circumstances it takes 15s to restart 5 times so the limit of 10s is not exceeded. However if other system processes (salt, cloud init) restart it explicitly, or recovering system services within the --before chain trigger a restart the limit can be exceeded causing it to be marked as failed. Since no services mark gotrue.service as required it will remain offline until the next explicit restart is issued. Setting these values to 0 with Restart=always and RestartSec=3 will prevent gotrue from being marked as failed.

I've noticed all !oneshot services set a `RestartSec` of `3s` and we use the systemd defaults of `StartLimitBurst=5` and `StartLimitInterval=10s`. Together this forms a property that under typical conditions a service will be restarted indefinitely until it comes back up due to `(3s * 5) > 10s`, but it is still possible for a service to enter a failed state under some scenarios. This change defensively sets them to 0/0 to keep them in restart loops.

coderabbitai · 2025-12-31T01:46:29Z

Walkthrough

Seven systemd unit templates had start-rate limits disabled (StartLimitIntervalSec and StartLimitBurst set to 0). The gotrue unit was expanded with a comprehensive [Service] block (working dir, exec, reload, user, restart, memory controls, environment files and reload-related env vars). Postgres version strings were bumped.

Changes

Cohort / File(s)	Summary
Rate limiting disabled (bulk) `ansible/files/adminapi.service.j2`, `ansible/files/nginx.service.j2`, `ansible/files/pg_egress_collect.service.j2`, `ansible/files/postgres_exporter.service.j2`, `ansible/files/postgrest.service.j2`, `ansible/files/vector.service.j2`	Added `StartLimitIntervalSec=0` and `StartLimitBurst=0` in `[Unit]` to disable systemd start-rate limiting.
Gotrue service expanded `ansible/files/gotrue.service.j2`	Changed `StartLimitIntervalSec`/`StartLimitBurst` (10/5 → 0/0) and added a full `[Service]` configuration: `WorkingDirectory`, `ExecStart`, `ExecReload`, `User`, `Restart`, `RestartSec`, `MemoryAccounting`, `MemoryMax`, multiple `EnvironmentFile` entries, several `Environment` vars for reload behavior, and a conditional `GOTRUE_RELOADING_NOTIFY_ENABLED` block.
Postgres version bumps `ansible/vars.yml`	Updated version strings: `postgresorioledb-17` 17.6.0.023→17.6.0.024-orioledb, `postgres17` 17.6.1.066→17.6.1.067, `postgres15` 15.14.1.066→15.14.1.067.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I hopped through unit files late and neat,

zeros where the start-limits used to meet,
GoTrue now wakes with env and guarded RAM,
reloads that whisper, restarts that calm,
I twitch my whiskers — services hum sweet.

Pre-merge checks

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The description explains the technical rationale and systemd behavior, but does not follow the repository's pull request template structure with defined sections.	Expand the description to follow the repository template (with Default or Extension Upgrade sections as appropriate) to ensure consistency with contribution guidelines.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: disabling restart limits across multiple systemd service files to prevent services from being marked as failed.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

📜 Recent review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3f3e133 and 0646ba3.

📒 Files selected for processing (1)

ansible/vars.yml

🚧 Files skipped from review as they are similar to previous changes (1)

ansible/vars.yml

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

ansible/files/gotrue.service.j2 (2)

69-70: MemoryMax=50% could cause OOM-triggered restart loops.

Setting MemoryMax=50% (of total system memory) will cause the kernel OOM killer to terminate gotrue if it exceeds this limit. Combined with Restart=always and disabled start limiting, this could create a restart loop if gotrue's memory usage grows beyond 50%.

While 50% is generous, consider:

Is this limit appropriate for the expected workload and instance types?

Does gotrue have memory leaks or scenarios where it might grow beyond 50%?

Should there be monitoring/alerting specifically for memory-based restarts?

If gotrue's memory requirements are well-understood and 50% is intentionally generous, this configuration is acceptable. Otherwise, consider adjusting the limit or adding memory usage monitoring.

74-76: Create /etc/gotrue.env or reconsider the EnvironmentFile requirement.

The EnvironmentFile directive on line 75 specifies /etc/gotrue.env without the optional minus prefix, making it a required file for service startup. However, ansible/tasks/setup-gotrue.yml does not create this file—only the service template itself is deployed. The service will fail to start if this file is missing. Either add a task to create or copy this file in the deployment process, or change the prefix to -/etc/gotrue.env to make it optional.

🧹 Nitpick comments (7)

ansible/files/vector.service.j2 (1)

7-8: Disabling rate limiting: ensure monitoring detects restart loops.

Setting both limits to 0 allows indefinite restarts. While this addresses the external restart issue documented in the PR, it could mask genuine service failures or cause resource exhaustion if Vector has a startup bug. With RestartSec=3, a failing service will restart ~20 times per minute indefinitely.

Ensure monitoring/alerting can detect when Vector enters a persistent restart loop so operators are notified of genuine failures rather than services silently restarting forever.

ansible/files/nginx.service.j2 (1)

6-8: Disabling rate limiting: ensure monitoring detects restart loops.

Setting both limits to 0 allows indefinite restarts. Since nginx is a critical gateway service (proxies postgrest, gotrue, adminapi per line 3), a persistent restart loop could impact availability. With RestartSec=3, a failing service will restart ~20 times per minute indefinitely.

Ensure monitoring/alerting can detect when nginx enters a persistent restart loop so operators are notified of genuine failures.

ansible/files/adminapi.service.j2 (1)

6-7: Removing existing rate limiting protection.

This changes from relatively generous limits (60s/10 bursts) to completely disabled (0/0). While this aligns with the PR's goal to prevent false failures from external restarts, it removes existing protection against rapid restart loops. With RestartSec=3, a failing service will restart ~20 times per minute indefinitely.

Ensure monitoring/alerting can detect when AdminAPI enters a persistent restart loop so operators are notified of genuine failures.

ansible/files/pg_egress_collect.service.j2 (1)

4-5: Disabling rate limiting for a root service: higher risk of resource exhaustion.

Setting both limits to 0 allows indefinite restarts. Since this service runs as User=root (line 10) and executes tcpdump (packet capture), a bug causing rapid restarts could consume significant system resources. With RestartSec=3, a failing service will restart ~20 times per minute indefinitely.

Ensure monitoring/alerting can detect when pg_egress_collect enters a persistent restart loop. Consider whether this service truly needs unlimited restart attempts or if a higher (but non-zero) limit would be safer given root execution.

ansible/files/postgres_exporter.service.j2 (1)

4-5: Disabling rate limiting: ensure monitoring detects restart loops.

Setting both limits to 0 allows indefinite restarts. While postgres_exporter is observability infrastructure (less critical than the services it monitors), a persistent restart loop could impact metrics collection and visibility into database health. With RestartSec=3, a failing service will restart ~20 times per minute indefinitely.

Ensure monitoring/alerting can detect when postgres_exporter enters a persistent restart loop.

ansible/files/postgrest.service.j2 (1)

6-8: Disabling rate limiting: ensure monitoring detects restart loops.

Setting both limits to 0 allows indefinite restarts. Since PostgREST is a critical API service (nginx depends on it per ansible/files/nginx.service.j2 line 3), a persistent restart loop could impact API availability. With RestartSec=3, a failing service will restart ~20 times per minute indefinitely.

Ensure monitoring/alerting can detect when postgrest enters a persistent restart loop so operators are notified of genuine failures.

ansible/files/gotrue.service.j2 (1)

43-55: Excellent documentation of the rationale for disabling rate limiting.

The detailed comment clearly explains the problem (external/chained restarts exceeding systemd defaults) and the solution (disabling limits with 0/0 values). This will help future maintainers understand the architectural decision.

As with the other services, ensure monitoring/alerting can detect when gotrue enters a persistent restart loop so operators are notified of genuine failures rather than services silently restarting forever.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e50974 and a0f7be8.

📒 Files selected for processing (7)

ansible/files/adminapi.service.j2
ansible/files/gotrue.service.j2
ansible/files/nginx.service.j2
ansible/files/pg_egress_collect.service.j2
ansible/files/postgres_exporter.service.j2
ansible/files/postgrest.service.j2
ansible/files/vector.service.j2

🔇 Additional comments (2)

ansible/files/gotrue.service.j2 (2)

59-63: The configuration is already documented and properly set up. The service file itself explicitly states (lines 61 and 78) that both v2 and v3 of gotrue support the --config-dir flag for reloading and signal-based reloading via SIGUSR1 (signal 10). The related environment variables are correctly configured: GOTRUE_RELOADING_SIGNAL_ENABLED=true, GOTRUE_RELOADING_SIGNAL_NUMBER=10, and supporting parameters for graceful reload behavior. No verification issues found.

79-98: Configuration correctly implements reload signaling with appropriate version-specific handling.

The environment variables are properly supported—the code itself documents that both gotrue v2 and v3 support reloading via signals. Signal number 10 (SIGUSR1) correctly matches the ExecReload=/bin/kill -10 $MAINPID directive. The conditional GOTRUE_RELOADING_NOTIFY_ENABLED logic appropriately reflects the difference between versions: v3 disables filesystem notifications (enabled only in qemu_mode for v2 compatibility), while the grace period and signal settings apply consistently across both versions.

samrose

switching review back to changes requested to prevent merge until testing complete

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0f7be8 and 3f3e133.

📒 Files selected for processing (1)

ansible/vars.yml

coderabbitai · 2025-12-31T02:21:15Z

ansible/vars.yml

+  postgresorioledb-17: "17.6.0.024-orioledb-auth-1"
+  postgres17: "17.6.1.067-auth-1"
+  postgres15: "15.14.1.067-auth-1"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the full context of ansible/vars.yml wc -l ansible/vars.yml

Repository: supabase/postgres

Length of output: 80

🏁 Script executed:

# Get full content of ansible/vars.yml to understand its purpose and scope cat -n ansible/vars.yml

Repository: supabase/postgres

Length of output: 2663

🏁 Script executed:

# Find all systemd service files to understand the PR's actual changes find . -name "*.service.j2" -type f

Repository: supabase/postgres

Length of output: 645

🏁 Script executed:

# Search for gotrue-related files and their content find . -name "*gotrue*" -type f

Repository: supabase/postgres

Length of output: 299

🏁 Script executed:

# Check if postgres versions are referenced in service files or gotrue-related configurations rg "17\.6\.0\.024|17\.6\.1\.067|15\.14\.1\.067|postgresorioledb|postgres17|postgres15" --type-add 'service:*.j2' -n

Repository: supabase/postgres

Length of output: 241

🏁 Script executed:

# Look for postgres-related configurations in ansible directory find ansible -type f $ -name "*.yml" -o -name "*.j2" $ | head -20

Repository: supabase/postgres

Length of output: 1017

🏁 Script executed:

# Check gotrue.service.j2 to see restart limit configuration cat -n ansible/files/gotrue.service.j2

Repository: supabase/postgres

Length of output: 4946

🏁 Script executed:

# Check if postgres_release variable is used in ansible tasks rg "postgres_release" --type-add 'yaml:*.yml' --type yaml -n

Repository: supabase/postgres

Length of output: 98

🏁 Script executed:

# Search for StartLimitBurst/StartLimitIntervalSec in all service files rg "StartLimit|Restart=" ansible/files/ -n

Repository: supabase/postgres

Length of output: 1672

🏁 Script executed:

# Look for recent git changes to understand what this PR actually modifies git diff HEAD~1..HEAD --stat 2>/dev/null | head -20 || echo "Git history not available"

Repository: supabase/postgres

Length of output: 43

🏁 Script executed:

# Check if postgres_release is used in any setup tasks or deployment scripts rg "postgres_release" --type-add 'yaml:*.yml' --type yaml --type shell -n

Repository: supabase/postgres

Length of output: 90

🏁 Script executed:

# Check Dockerfile and other potential deployment files find . -name "Dockerfile*" -o -name "*.sh" | xargs grep -l "postgres_release" 2>/dev/null || echo "No matches in Dockerfile or shell scripts"

Repository: supabase/postgres

Length of output: 102

Unrelated changes mixed with the PR objective.

These Postgres version updates (lines 13-15) are unrelated to the PR's stated objective of fixing gotrue restart limits. The actual restart limit changes are in systemd service files (gotrue.service.j2 lines 54-55 with StartLimitIntervalSec=0 and StartLimitBurst=0), not in this vars file.

Mixing unrelated changes in a single PR complicates testing, rollback, and change tracking. Additionally, these version strings do not appear to be referenced anywhere else in the codebase, suggesting they may be unused or incomplete.

Separate these Postgres version updates into a dedicated PR.

🤖 Prompt for AI Agents

In ansible/vars.yml around lines 13 to 15, the Postgres version bumps (postgresorioledb-17/postgres17/postgres15) are unrelated to the gotrue restart-limit change and must be removed from this PR: revert those three lines to their previous values (or delete them if they were newly added), move any intentional Postgres version updates to a separate branch/PR, and if these variables are intended to be kept, confirm they are actually referenced elsewhere (remove unused entries). Ensure the current PR only contains the systemd/gotrue service changes and open a distinct PR for Postgres version updates with proper testing notes.

samrose · 2025-12-31T04:21:32Z

Local infra test result engines test:

============================================================
Supadev PostgreSQL Engine Test Suite
============================================================
[INFO] Testing PostgresEngines: 15 17 17-oriole
[INFO] Using region: us-east-1
[INFO] Project prefix: supadev-test-1767153906
[INFO] GitHub PR URL: https://github.com/supabase/postgres/pull/1952


============================================================
Testing PostgreSQL Engine: 15
============================================================
[INFO] Project name: supadev-test-1767153906-pg15
[INFO] Region: us-east-1
[INFO] Start time: 2025-12-30 23:05:06
[INFO] Using version from PR: https://github.com/supabase/postgres/pull/1952
[INFO] Resolved version: 15.14.1.067-auth-1
[STEP] Creating project with PostgresEngine 15...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

http://localhost:8082/project/rrmnmpneoaxcfkaefnzc
Database status is not yet available, waiting...
Database status is currently UNKNOWN, waiting...
Database status is currently COMING_UP, waiting...
Database status is currently COMING_UP, waiting...
Database status is now ACTIVE_HEALTHY
[SUCCESS] Project supadev-test-1767153906-pg15 created and reached ACTIVE_HEALTHY status
[INFO] Time to ACTIVE_HEALTHY: 193 seconds
[INFO] Sleeping for 10 seconds before checking project health
[STEP] Checking project health for supadev-test-1767153906-pg15...
error: unrecognized subcommand 'project-health'

  tip: some similar subcommands exist: 'project-upgrade-eligibility', 'project-information', 'project-id'

Usage: supadev <COMMAND>

For more information, try '--help'.
[STEP] Printing project information for supadev-test-1767153906-pg15
{
  "cloud_provider": "AWS",
  "db_host": "db.rrmnmpneoaxcfkaefnzc.supabase.red",
  "id": 13,
  "inserted_at": "2025-12-31T04:05:09.091067",
  "name": "supadev-test-1767153906-pg15",
  "organization_id": 1,
  "ref": "rrmnmpneoaxcfkaefnzc",
  "region": "us-east-1",
  "status": "ACTIVE_HEALTHY",
  "subscription_id": "4UBPqvTF5gpecaCB",
  "connectionString": "U2FsdGVkX1+1NiojgmGlxgEBR+V1YvGtg9fSlKDgvcYy9hf6UjePP71Li/4io/8RsJwJzP+6pwC3CrCmlq9XGRpI22UoOdHzkU0Tx6aQ7PTxweTaDI+4x79vM9flrW93VycuGlt1K/abkPQgYhiTOU+um9Leewjk+wIT5fntUxQXwBq0+XLxSNX89ulhozU7",
  "restUrl": "https://rrmnmpneoaxcfkaefnzc.supabase.red/rest/v1/",
  "volumeSizeGb": 8,
  "maxDatabasePreprovisionGb": null,
  "lastDatabaseResizeAt": null,
  "is_branch_enabled": false,
  "is_physical_backups_enabled": false,
  "infra_compute_size": "nano",
  "dbVersion": "15.14.1.067-auth-1"
}
[STEP] Deleting project supadev-test-1767153906-pg15...
Project deleted
[SUCCESS] Project supadev-test-1767153906-pg15 deleted successfully
[SUCCESS] PostgresEngine 15 test completed successfully
[INFO] Total test duration: 204 seconds


============================================================
Testing PostgreSQL Engine: 17
============================================================
[INFO] Project name: supadev-test-1767153906-pg17
[INFO] Region: us-east-1
[INFO] Start time: 2025-12-30 23:08:30
[INFO] Using version from PR: https://github.com/supabase/postgres/pull/1952
[INFO] Resolved version: 17.6.1.067-auth-1
[STEP] Creating project with PostgresEngine 17...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

http://localhost:8082/project/bdxdawfkkpfwsthwqsls
Database status is not yet available, waiting...
Database status is currently COMING_UP, waiting...
Database status is currently COMING_UP, waiting...
Database status is currently COMING_UP, waiting...
Database status is now ACTIVE_HEALTHY
[SUCCESS] Project supadev-test-1767153906-pg17 created and reached ACTIVE_HEALTHY status
[INFO] Time to ACTIVE_HEALTHY: 192 seconds
[INFO] Sleeping for 10 seconds before checking project health
[STEP] Checking project health for supadev-test-1767153906-pg17...
error: unrecognized subcommand 'project-health'

  tip: some similar subcommands exist: 'project-upgrade-eligibility', 'project-information', 'project-id'

Usage: supadev <COMMAND>

For more information, try '--help'.
[STEP] Printing project information for supadev-test-1767153906-pg17
{
  "cloud_provider": "AWS",
  "db_host": "db.bdxdawfkkpfwsthwqsls.supabase.red",
  "id": 14,
  "inserted_at": "2025-12-31T04:08:32.36092",
  "name": "supadev-test-1767153906-pg17",
  "organization_id": 1,
  "ref": "bdxdawfkkpfwsthwqsls",
  "region": "us-east-1",
  "status": "ACTIVE_HEALTHY",
  "subscription_id": "4UBPqvTF5gpecaCB",
  "connectionString": "U2FsdGVkX1/wCsIlhYgTYYylUstkaq1/5/hQJGPFyU/bLlaCg4cF4dV/cQFZlMa/5mUs/uLZ2iEi0B4RdK7ZgbOAIQFEyjMWuf/2s2zG0co4qA3zZ3p8CrBQpNIYntJ3Ys1Gind20Pbc++Ly4zuVCuQ7rdp/9tR3ztVhQ9+cBooG1T3O5xxxEyy783ja55Uv",
  "restUrl": "https://bdxdawfkkpfwsthwqsls.supabase.red/rest/v1/",
  "volumeSizeGb": 8,
  "maxDatabasePreprovisionGb": null,
  "lastDatabaseResizeAt": null,
  "is_branch_enabled": false,
  "is_physical_backups_enabled": false,
  "infra_compute_size": "nano",
  "dbVersion": "17.6.1.067-auth-1"
}
[STEP] Deleting project supadev-test-1767153906-pg17...
Project deleted
[SUCCESS] Project supadev-test-1767153906-pg17 deleted successfully
[SUCCESS] PostgresEngine 17 test completed successfully
[INFO] Total test duration: 203 seconds


============================================================
Testing PostgreSQL Engine: 17-oriole
============================================================
[INFO] Project name: supadev-test-1767153906-pg17_oriole
[INFO] Region: us-east-1
[INFO] Start time: 2025-12-30 23:11:53
[INFO] Using version from PR: https://github.com/supabase/postgres/pull/1952
[INFO] Resolved version: 17.6.0.024-orioledb-auth-1
[STEP] Creating project with PostgresEngine 17-oriole...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

http://localhost:8082/project/prntdbfyqyyluiwrfnxr
Database status is not yet available, waiting...
Database status is currently COMING_UP, waiting...
Database status is currently COMING_UP, waiting...
Database status is currently COMING_UP, waiting...
Database status is currently COMING_UP, waiting...
Database status is now ACTIVE_HEALTHY
[SUCCESS] Project supadev-test-1767153906-pg17_oriole created and reached ACTIVE_HEALTHY status
[INFO] Time to ACTIVE_HEALTHY: 253 seconds
[INFO] Sleeping for 10 seconds before checking project health
[STEP] Checking project health for supadev-test-1767153906-pg17_oriole...
error: unrecognized subcommand 'project-health'

  tip: some similar subcommands exist: 'project-upgrade-eligibility', 'project-information', 'project-id'

Usage: supadev <COMMAND>

For more information, try '--help'.
[STEP] Printing project information for supadev-test-1767153906-pg17_oriole
{
  "cloud_provider": "AWS",
  "db_host": "db.prntdbfyqyyluiwrfnxr.supabase.red",
  "id": 15,
  "inserted_at": "2025-12-31T04:11:56.140192",
  "name": "supadev-test-1767153906-pg17_oriole",
  "organization_id": 1,
  "ref": "prntdbfyqyyluiwrfnxr",
  "region": "us-east-1",
  "status": "ACTIVE_HEALTHY",
  "subscription_id": "4UBPqvTF5gpecaCB",
  "connectionString": "U2FsdGVkX1+LlTFpe4KhCtxm+vTehP2M1cxnj0z4s55x7c/2TtQi+1AJMOEdMrcsf8L7FonM78rmbIPVyYM2HiPyuwbW7lFtmrwpvi+lcFbILFiBu8CMRmjyOLwo9dGtKR4704ubGDWedf20unsePP9FDKcoYPYfIXJ6VNriLFjArPU5AqASOoqSndaBXaDy",
  "restUrl": "https://prntdbfyqyyluiwrfnxr.supabase.red/rest/v1/",
  "volumeSizeGb": 8,
  "maxDatabasePreprovisionGb": null,
  "lastDatabaseResizeAt": null,
  "is_branch_enabled": false,
  "is_physical_backups_enabled": false,
  "infra_compute_size": "nano",
  "dbVersion": "17.6.0.024-orioledb-auth-1"
}
[STEP] Deleting project supadev-test-1767153906-pg17_oriole...
Project deleted
[SUCCESS] Project supadev-test-1767153906-pg17_oriole deleted successfully
[SUCCESS] PostgresEngine 17-oriole test completed successfully
[INFO] Total test duration: 264 seconds


============================================================
Cleaning Up: Withdrawing Inserted Versions
============================================================
[STEP] Withdrawing version 15.14.1.067-auth-1 for engine 15...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

[SUCCESS] Version 15.14.1.067-auth-1 marked as withdrawn
[STEP] Withdrawing version 17.6.1.067-auth-1 for engine 17...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

[SUCCESS] Version 17.6.1.067-auth-1 marked as withdrawn
[STEP] Withdrawing version 17.6.0.024-orioledb-auth-1 for engine 17-oriole...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

[SUCCESS] Version 17.6.0.024-orioledb-auth-1 marked as withdrawn

============================================================
Test Summary
============================================================

ENGINE          RESULT
------          ------
15              PASSED (204s)
17              PASSED (203s)
17-oriole       PASSED (264s)

============================================================
Total: 3 | Passed: 3 | Failed: 0
============================================================
[SUCCESS] All tests passed!

samrose · 2025-12-31T05:33:15Z

upgrade test pg 15

Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.
Run with RUST_BACKTRACE=full to include source snippets.
[STEP] Printing project information before upgrade...
{
  "cloud_provider": "AWS",
  "db_host": "db.blcnyuntamewocrvwqih.supabase.red",
  "id": 5,
  "inserted_at": "2025-12-31T05:20:29.009218",
  "name": "supadev-upgrade-1767158427-pg15",
  "organization_id": 1,
  "ref": "blcnyuntamewocrvwqih",
  "region": "us-east-1",
  "status": "ACTIVE_HEALTHY",
  "subscription_id": "fXuXpU6NwJPotS9Q",
  "connectionString": "U2FsdGVkX1/B/Xx0Mp9LmeVbaY1BbJyMundI/LkCQr8p3NjGw3c6cBH7M+0lgtMdsOl9WHhlG0sxsW6SU48z1IPB9pz/Tq+WWgpYw8xveSxRLRNpPUoV8f/NVGCA4MreFhXhixv4+RqPlAh/Jd/Cn+B4k3cWuQHMnfYkNbSmv4M9qQQ9gbejEHy99bPpGmzk",
  "restUrl": "https://blcnyuntamewocrvwqih.supabase.red/rest/v1/",
  "volumeSizeGb": 8,
  "maxDatabasePreprovisionGb": null,
  "lastDatabaseResizeAt": null,
  "is_branch_enabled": false,
  "is_physical_backups_enabled": false,
  "infra_compute_size": "nano",
  "dbVersion": "supabase-postgres-15.14.1.063"
}
[STEP] Inserting version 17.6.1.067-auth-1 as GA (with max id for upgrade eligibility)...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

[SUCCESS] Version 17.6.1.067-auth-1 inserted as GA
[STEP] Printing project upgrade eligibility...
{
  "eligible": true,
  "current_app_version": "supabase-postgres-15.14.1.063",
  "current_app_version_release_channel": "ga",
  "latest_app_version": "supabase-postgres-17.6.1.063",
  "target_upgrade_versions": [
    {
      "postgres_version": "17",
      "release_channel": "ga",
      "app_version": "supabase-postgres-17.6.1.063"
    },
    {
      "postgres_version": "17",
      "release_channel": "internal",
      "app_version": "supabase-postgres-17.4.1.077"
    }
  ],
  "duration_estimate_hours": 1,
  "legacy_auth_custom_roles": [],
  "objects_to_be_dropped": [],
  "unsupported_extensions": [],
  "user_defined_objects_in_internal_schemas": []
}
[STEP] Upgrading project to PostgresEngine 17 version 17.6.1.067-auth-1...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

Upgrade status is not yet available, waiting...
Upgrade status is currently 1_started, waiting...
Upgrade status is currently 1_started, waiting...
Upgrade status is currently 1_started, waiting...
Upgrade status is currently 5_initiated_data_upgrade, waiting...
Upgrade status is currently 5_initiated_data_upgrade, waiting...
Upgrade status is currently 5_initiated_data_upgrade, waiting...
Upgrade status is currently 8_attached_volume_to_upgraded_instance, waiting...
Upgrade status is now 9_completed_upgrade
[SUCCESS] Project supadev-upgrade-1767158427-pg15 upgraded successfully

samrose

tested in local infra

* fix: set restart limits to 0 to prevent being marked as failed The systemd default is 10s / 5 for these values with a DefaultRestartUSec of 100ms. Most services set a RestartSec limit of 3, under most circumstances it takes 15s to restart 5 times so the limit of 10s is not exceeded. However if other system processes (salt, cloud init) restart it explicitly, or recovering system services within the --before chain trigger a restart the limit can be exceeded causing it to be marked as failed. Since no services mark gotrue.service as required it will remain offline until the next explicit restart is issued. Setting these values to 0 with Restart=always and RestartSec=3 will prevent gotrue from being marked as failed. * chore: set StartLimits for persistent services. I've noticed all !oneshot services set a `RestartSec` of `3s` and we use the systemd defaults of `StartLimitBurst=5` and `StartLimitInterval=10s`. Together this forms a property that under typical conditions a service will be restarted indefinitely until it comes back up due to `(3s * 5) > 10s`, but it is still possible for a service to enter a failed state under some scenarios. This change defensively sets them to 0/0 to keep them in restart loops. * chore: suffix to test * chore: bump to release --------- Co-authored-by: Chris Stockton <chris.stockton@supabase.io> Co-authored-by: Sam Rose <samuel@supabase.io> chore: bump version to correct new version for potential release

cstockton requested review from a team as code owners November 28, 2025 18:14

cstockton enabled auto-merge December 2, 2025 13:18

samrose requested review from darora and pcnc December 2, 2025 13:20

samrose requested changes Dec 2, 2025

View reviewed changes

samrose requested a review from LGUG2Z December 2, 2025 13:52

samrose requested changes Dec 2, 2025

View reviewed changes

cstockton force-pushed the cs/gotrue-start-limit-fix branch 3 times, most recently from 3ef31ba to c89c805 Compare December 8, 2025 17:28

samrose self-requested a review December 11, 2025 05:15

samrose approved these changes Dec 11, 2025

View reviewed changes

samrose requested changes Dec 12, 2025

View reviewed changes

samrose requested review from Crispy1975 and delgado3d December 12, 2025 15:17

samrose approved these changes Dec 31, 2025

View reviewed changes

Chris Stockton added 2 commits December 30, 2025 20:46

samrose force-pushed the cs/gotrue-start-limit-fix branch from c89c805 to a0f7be8 Compare December 31, 2025 01:46

coderabbitai bot reviewed Dec 31, 2025

View reviewed changes

samrose self-requested a review December 31, 2025 01:57

samrose requested changes Dec 31, 2025

View reviewed changes

chore: suffix to test

3f3e133

coderabbitai bot reviewed Dec 31, 2025

View reviewed changes

chore: bump to release

0646ba3

samrose approved these changes Dec 31, 2025

View reviewed changes

samrose disabled auto-merge December 31, 2025 06:35

samrose enabled auto-merge December 31, 2025 06:35

samrose added this pull request to the merge queue Dec 31, 2025

Merged via the queue into develop with commit 963c42d Dec 31, 2025
25 checks passed

samrose deleted the cs/gotrue-start-limit-fix branch December 31, 2025 07:54

coderabbitai bot mentioned this pull request Dec 31, 2025

chore: bump version to correct new version for potential release #1991

Merged

This was referenced Dec 31, 2025

Run server macos #1992

Merged

Update nixpkgs #1714

Open

Uh oh!

fix: set restart limits to 0 to prevent being marked as failed #1952

fix: set restart limits to 0 to prevent being marked as failed #1952

Uh oh!

Conversation

cstockton commented Nov 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

samrose left a comment

Choose a reason for hiding this comment

Uh oh!

samrose left a comment

Choose a reason for hiding this comment

Uh oh!

cstockton commented Dec 8, 2025

Uh oh!

samrose left a comment

Choose a reason for hiding this comment

Uh oh!

samrose left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

samrose left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

samrose commented Dec 31, 2025

Uh oh!

samrose commented Dec 31, 2025

Uh oh!

samrose left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cstockton commented Nov 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 31, 2025 •

edited

Loading