Skip to content

Conversation

@cstockton
Copy link
Contributor

@cstockton cstockton commented Nov 28, 2025

The systemd default is 10s / 5 for these values with a DefaultRestartUSec of 100ms. Most services set a RestartSec limit of 3, under most circumstances it takes 15s to restart 5 times so the limit of 10s is not exceeded. However if other system processes (salt, cloud init) restart it explicitly, or recovering system services within the --before chain trigger a restart the limit can be exceeded causing it to be marked as failed. Since no services mark gotrue.service as required it will remain offline until the next explicit restart is issued.

Setting these values to 0 with Restart=always and RestartSec=3 will prevent gotrue from being marked as failed.

Summary by CodeRabbit

  • Chores
    • Disabled start-rate limiting across multiple system services to avoid false failure marks and improve restart behavior.
    • Enhanced the authentication service with reload support, restart/reload behavior, environment-file support, user/run-directory defaults, and memory controls.
    • Bumped Postgres-related package versions for three Postgres releases.

✏️ Tip: You can customize this high-level summary in your review settings.

@cstockton cstockton requested review from a team as code owners November 28, 2025 18:14
@cstockton cstockton enabled auto-merge December 2, 2025 13:18
@samrose samrose requested review from darora and pcnc December 2, 2025 13:20
Copy link
Collaborator

@samrose samrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to create a testing AMI to thoroughly test these changes out. Will request @LGUG2Z to perform these tests as he's also going to be helping us find ways to automate these testing approaches.

@samrose samrose requested a review from LGUG2Z December 2, 2025 13:52
Copy link
Collaborator

@samrose samrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we ultimately merge this, we should bump the versions in ansible/vars.yml to create a release for these changes. This way, it will be a distinct change instead of bundled with other changes.

@cstockton
Copy link
Contributor Author

Hi @samrose - I've just updated the branch. Any updates on this?

@cstockton cstockton force-pushed the cs/gotrue-start-limit-fix branch 3 times, most recently from 3ef31ba to c89c805 Compare December 8, 2025 17:28
@samrose samrose self-requested a review December 11, 2025 05:15
Copy link
Collaborator

@samrose samrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just needs a rebase

Copy link
Collaborator

@samrose samrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like infra data @Crispy1975 or @delgado3d to review when they have some time, just being defensive about changes which could impact stability and we need more eyes on these changes

Chris Stockton added 2 commits December 30, 2025 20:46
The systemd default is 10s / 5 for these values with a DefaultRestartUSec of
100ms. Most services set a RestartSec limit of 3, under most circumstances it
takes 15s to restart 5 times so the limit of 10s is not exceeded. However if
other system processes (salt, cloud init) restart it explicitly, or recovering
system services within the --before chain trigger a restart the limit can be
exceeded causing it to be marked as failed. Since no services mark
gotrue.service as required it will remain offline until the next explicit
restart is issued.

Setting these values to 0 with Restart=always and RestartSec=3 will prevent
gotrue from being marked as failed.
I've noticed all !oneshot services set a `RestartSec` of `3s` and we use the
systemd defaults of `StartLimitBurst=5` and `StartLimitInterval=10s`. Together
this forms a property that under typical conditions a service will be restarted
indefinitely until it comes back up due to `(3s * 5) > 10s`, but it is still
possible for a service to enter a failed state under some scenarios. This change
defensively sets them to 0/0 to keep them in restart loops.
@samrose samrose force-pushed the cs/gotrue-start-limit-fix branch from c89c805 to a0f7be8 Compare December 31, 2025 01:46
@coderabbitai
Copy link

coderabbitai bot commented Dec 31, 2025

Walkthrough

Seven systemd unit templates had start-rate limits disabled (StartLimitIntervalSec and StartLimitBurst set to 0). The gotrue unit was expanded with a comprehensive [Service] block (working dir, exec, reload, user, restart, memory controls, environment files and reload-related env vars). Postgres version strings were bumped.

Changes

Cohort / File(s) Summary
Rate limiting disabled (bulk)
ansible/files/adminapi.service.j2, ansible/files/nginx.service.j2, ansible/files/pg_egress_collect.service.j2, ansible/files/postgres_exporter.service.j2, ansible/files/postgrest.service.j2, ansible/files/vector.service.j2
Added StartLimitIntervalSec=0 and StartLimitBurst=0 in [Unit] to disable systemd start-rate limiting.
Gotrue service expanded
ansible/files/gotrue.service.j2
Changed StartLimitIntervalSec/StartLimitBurst (10/5 → 0/0) and added a full [Service] configuration: WorkingDirectory, ExecStart, ExecReload, User, Restart, RestartSec, MemoryAccounting, MemoryMax, multiple EnvironmentFile entries, several Environment vars for reload behavior, and a conditional GOTRUE_RELOADING_NOTIFY_ENABLED block.
Postgres version bumps
ansible/vars.yml
Updated version strings: postgresorioledb-17 17.6.0.023→17.6.0.024-orioledb, postgres17 17.6.1.066→17.6.1.067, postgres15 15.14.1.066→15.14.1.067.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I hopped through unit files late and neat,

zeros where the start-limits used to meet,
GoTrue now wakes with env and guarded RAM,
reloads that whisper, restarts that calm,
I twitch my whiskers — services hum sweet.

Pre-merge checks

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive The description explains the technical rationale and systemd behavior, but does not follow the repository's pull request template structure with defined sections. Expand the description to follow the repository template (with Default or Extension Upgrade sections as appropriate) to ensure consistency with contribution guidelines.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: disabling restart limits across multiple systemd service files to prevent services from being marked as failed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

📜 Recent review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3f3e133 and 0646ba3.

📒 Files selected for processing (1)
  • ansible/vars.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • ansible/vars.yml

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
ansible/files/gotrue.service.j2 (2)

69-70: MemoryMax=50% could cause OOM-triggered restart loops.

Setting MemoryMax=50% (of total system memory) will cause the kernel OOM killer to terminate gotrue if it exceeds this limit. Combined with Restart=always and disabled start limiting, this could create a restart loop if gotrue's memory usage grows beyond 50%.

While 50% is generous, consider:

  • Is this limit appropriate for the expected workload and instance types?
  • Does gotrue have memory leaks or scenarios where it might grow beyond 50%?
  • Should there be monitoring/alerting specifically for memory-based restarts?

If gotrue's memory requirements are well-understood and 50% is intentionally generous, this configuration is acceptable. Otherwise, consider adjusting the limit or adding memory usage monitoring.


74-76: Create /etc/gotrue.env or reconsider the EnvironmentFile requirement.

The EnvironmentFile directive on line 75 specifies /etc/gotrue.env without the optional minus prefix, making it a required file for service startup. However, ansible/tasks/setup-gotrue.yml does not create this file—only the service template itself is deployed. The service will fail to start if this file is missing. Either add a task to create or copy this file in the deployment process, or change the prefix to -/etc/gotrue.env to make it optional.

🧹 Nitpick comments (7)
ansible/files/vector.service.j2 (1)

7-8: Disabling rate limiting: ensure monitoring detects restart loops.

Setting both limits to 0 allows indefinite restarts. While this addresses the external restart issue documented in the PR, it could mask genuine service failures or cause resource exhaustion if Vector has a startup bug. With RestartSec=3, a failing service will restart ~20 times per minute indefinitely.

Ensure monitoring/alerting can detect when Vector enters a persistent restart loop so operators are notified of genuine failures rather than services silently restarting forever.

ansible/files/nginx.service.j2 (1)

6-8: Disabling rate limiting: ensure monitoring detects restart loops.

Setting both limits to 0 allows indefinite restarts. Since nginx is a critical gateway service (proxies postgrest, gotrue, adminapi per line 3), a persistent restart loop could impact availability. With RestartSec=3, a failing service will restart ~20 times per minute indefinitely.

Ensure monitoring/alerting can detect when nginx enters a persistent restart loop so operators are notified of genuine failures.

ansible/files/adminapi.service.j2 (1)

6-7: Removing existing rate limiting protection.

This changes from relatively generous limits (60s/10 bursts) to completely disabled (0/0). While this aligns with the PR's goal to prevent false failures from external restarts, it removes existing protection against rapid restart loops. With RestartSec=3, a failing service will restart ~20 times per minute indefinitely.

Ensure monitoring/alerting can detect when AdminAPI enters a persistent restart loop so operators are notified of genuine failures.

ansible/files/pg_egress_collect.service.j2 (1)

4-5: Disabling rate limiting for a root service: higher risk of resource exhaustion.

Setting both limits to 0 allows indefinite restarts. Since this service runs as User=root (line 10) and executes tcpdump (packet capture), a bug causing rapid restarts could consume significant system resources. With RestartSec=3, a failing service will restart ~20 times per minute indefinitely.

Ensure monitoring/alerting can detect when pg_egress_collect enters a persistent restart loop. Consider whether this service truly needs unlimited restart attempts or if a higher (but non-zero) limit would be safer given root execution.

ansible/files/postgres_exporter.service.j2 (1)

4-5: Disabling rate limiting: ensure monitoring detects restart loops.

Setting both limits to 0 allows indefinite restarts. While postgres_exporter is observability infrastructure (less critical than the services it monitors), a persistent restart loop could impact metrics collection and visibility into database health. With RestartSec=3, a failing service will restart ~20 times per minute indefinitely.

Ensure monitoring/alerting can detect when postgres_exporter enters a persistent restart loop.

ansible/files/postgrest.service.j2 (1)

6-8: Disabling rate limiting: ensure monitoring detects restart loops.

Setting both limits to 0 allows indefinite restarts. Since PostgREST is a critical API service (nginx depends on it per ansible/files/nginx.service.j2 line 3), a persistent restart loop could impact API availability. With RestartSec=3, a failing service will restart ~20 times per minute indefinitely.

Ensure monitoring/alerting can detect when postgrest enters a persistent restart loop so operators are notified of genuine failures.

ansible/files/gotrue.service.j2 (1)

43-55: Excellent documentation of the rationale for disabling rate limiting.

The detailed comment clearly explains the problem (external/chained restarts exceeding systemd defaults) and the solution (disabling limits with 0/0 values). This will help future maintainers understand the architectural decision.

As with the other services, ensure monitoring/alerting can detect when gotrue enters a persistent restart loop so operators are notified of genuine failures rather than services silently restarting forever.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e50974 and a0f7be8.

📒 Files selected for processing (7)
  • ansible/files/adminapi.service.j2
  • ansible/files/gotrue.service.j2
  • ansible/files/nginx.service.j2
  • ansible/files/pg_egress_collect.service.j2
  • ansible/files/postgres_exporter.service.j2
  • ansible/files/postgrest.service.j2
  • ansible/files/vector.service.j2
🔇 Additional comments (2)
ansible/files/gotrue.service.j2 (2)

59-63: The configuration is already documented and properly set up. The service file itself explicitly states (lines 61 and 78) that both v2 and v3 of gotrue support the --config-dir flag for reloading and signal-based reloading via SIGUSR1 (signal 10). The related environment variables are correctly configured: GOTRUE_RELOADING_SIGNAL_ENABLED=true, GOTRUE_RELOADING_SIGNAL_NUMBER=10, and supporting parameters for graceful reload behavior. No verification issues found.


79-98: Configuration correctly implements reload signaling with appropriate version-specific handling.

The environment variables are properly supported—the code itself documents that both gotrue v2 and v3 support reloading via signals. Signal number 10 (SIGUSR1) correctly matches the ExecReload=/bin/kill -10 $MAINPID directive. The conditional GOTRUE_RELOADING_NOTIFY_ENABLED logic appropriately reflects the difference between versions: v3 disables filesystem notifications (enabled only in qemu_mode for v2 compatibility), while the grace period and signal settings apply consistently across both versions.

@samrose samrose self-requested a review December 31, 2025 01:57
Copy link
Collaborator

@samrose samrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switching review back to changes requested to prevent merge until testing complete

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0f7be8 and 3f3e133.

📒 Files selected for processing (1)
  • ansible/vars.yml

ansible/vars.yml Outdated
Comment on lines 13 to 15
postgresorioledb-17: "17.6.0.024-orioledb-auth-1"
postgres17: "17.6.1.067-auth-1"
postgres15: "15.14.1.067-auth-1"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the full context of ansible/vars.yml
wc -l ansible/vars.yml

Repository: supabase/postgres

Length of output: 80


🏁 Script executed:

# Get full content of ansible/vars.yml to understand its purpose and scope
cat -n ansible/vars.yml

Repository: supabase/postgres

Length of output: 2663


🏁 Script executed:

# Find all systemd service files to understand the PR's actual changes
find . -name "*.service.j2" -type f

Repository: supabase/postgres

Length of output: 645


🏁 Script executed:

# Search for gotrue-related files and their content
find . -name "*gotrue*" -type f

Repository: supabase/postgres

Length of output: 299


🏁 Script executed:

# Check if postgres versions are referenced in service files or gotrue-related configurations
rg "17\.6\.0\.024|17\.6\.1\.067|15\.14\.1\.067|postgresorioledb|postgres17|postgres15" --type-add 'service:*.j2' -n

Repository: supabase/postgres

Length of output: 241


🏁 Script executed:

# Look for postgres-related configurations in ansible directory
find ansible -type f \( -name "*.yml" -o -name "*.j2" \) | head -20

Repository: supabase/postgres

Length of output: 1017


🏁 Script executed:

# Check gotrue.service.j2 to see restart limit configuration
cat -n ansible/files/gotrue.service.j2

Repository: supabase/postgres

Length of output: 4946


🏁 Script executed:

# Check if postgres_release variable is used in ansible tasks
rg "postgres_release" --type-add 'yaml:*.yml' --type yaml -n

Repository: supabase/postgres

Length of output: 98


🏁 Script executed:

# Search for StartLimitBurst/StartLimitIntervalSec in all service files
rg "StartLimit|Restart=" ansible/files/ -n

Repository: supabase/postgres

Length of output: 1672


🏁 Script executed:

# Look for recent git changes to understand what this PR actually modifies
git diff HEAD~1..HEAD --stat 2>/dev/null | head -20 || echo "Git history not available"

Repository: supabase/postgres

Length of output: 43


🏁 Script executed:

# Check if postgres_release is used in any setup tasks or deployment scripts
rg "postgres_release" --type-add 'yaml:*.yml' --type yaml --type shell -n

Repository: supabase/postgres

Length of output: 90


🏁 Script executed:

# Check Dockerfile and other potential deployment files
find . -name "Dockerfile*" -o -name "*.sh" | xargs grep -l "postgres_release" 2>/dev/null || echo "No matches in Dockerfile or shell scripts"

Repository: supabase/postgres

Length of output: 102


Unrelated changes mixed with the PR objective.

These Postgres version updates (lines 13-15) are unrelated to the PR's stated objective of fixing gotrue restart limits. The actual restart limit changes are in systemd service files (gotrue.service.j2 lines 54-55 with StartLimitIntervalSec=0 and StartLimitBurst=0), not in this vars file.

Mixing unrelated changes in a single PR complicates testing, rollback, and change tracking. Additionally, these version strings do not appear to be referenced anywhere else in the codebase, suggesting they may be unused or incomplete.

Separate these Postgres version updates into a dedicated PR.

🤖 Prompt for AI Agents
In ansible/vars.yml around lines 13 to 15, the Postgres version bumps
(postgresorioledb-17/postgres17/postgres15) are unrelated to the gotrue
restart-limit change and must be removed from this PR: revert those three lines
to their previous values (or delete them if they were newly added), move any
intentional Postgres version updates to a separate branch/PR, and if these
variables are intended to be kept, confirm they are actually referenced
elsewhere (remove unused entries). Ensure the current PR only contains the
systemd/gotrue service changes and open a distinct PR for Postgres version
updates with proper testing notes.

@samrose
Copy link
Collaborator

samrose commented Dec 31, 2025

Local infra test result engines test:

============================================================
Supadev PostgreSQL Engine Test Suite
============================================================
[INFO] Testing PostgresEngines: 15 17 17-oriole
[INFO] Using region: us-east-1
[INFO] Project prefix: supadev-test-1767153906
[INFO] GitHub PR URL: https://github.com/supabase/postgres/pull/1952


============================================================
Testing PostgreSQL Engine: 15
============================================================
[INFO] Project name: supadev-test-1767153906-pg15
[INFO] Region: us-east-1
[INFO] Start time: 2025-12-30 23:05:06
[INFO] Using version from PR: https://github.com/supabase/postgres/pull/1952
[INFO] Resolved version: 15.14.1.067-auth-1
[STEP] Creating project with PostgresEngine 15...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

http://localhost:8082/project/rrmnmpneoaxcfkaefnzc
Database status is not yet available, waiting...
Database status is currently UNKNOWN, waiting...
Database status is currently COMING_UP, waiting...
Database status is currently COMING_UP, waiting...
Database status is now ACTIVE_HEALTHY
[SUCCESS] Project supadev-test-1767153906-pg15 created and reached ACTIVE_HEALTHY status
[INFO] Time to ACTIVE_HEALTHY: 193 seconds
[INFO] Sleeping for 10 seconds before checking project health
[STEP] Checking project health for supadev-test-1767153906-pg15...
error: unrecognized subcommand 'project-health'

  tip: some similar subcommands exist: 'project-upgrade-eligibility', 'project-information', 'project-id'

Usage: supadev <COMMAND>

For more information, try '--help'.
[STEP] Printing project information for supadev-test-1767153906-pg15
{
  "cloud_provider": "AWS",
  "db_host": "db.rrmnmpneoaxcfkaefnzc.supabase.red",
  "id": 13,
  "inserted_at": "2025-12-31T04:05:09.091067",
  "name": "supadev-test-1767153906-pg15",
  "organization_id": 1,
  "ref": "rrmnmpneoaxcfkaefnzc",
  "region": "us-east-1",
  "status": "ACTIVE_HEALTHY",
  "subscription_id": "4UBPqvTF5gpecaCB",
  "connectionString": "U2FsdGVkX1+1NiojgmGlxgEBR+V1YvGtg9fSlKDgvcYy9hf6UjePP71Li/4io/8RsJwJzP+6pwC3CrCmlq9XGRpI22UoOdHzkU0Tx6aQ7PTxweTaDI+4x79vM9flrW93VycuGlt1K/abkPQgYhiTOU+um9Leewjk+wIT5fntUxQXwBq0+XLxSNX89ulhozU7",
  "restUrl": "https://rrmnmpneoaxcfkaefnzc.supabase.red/rest/v1/",
  "volumeSizeGb": 8,
  "maxDatabasePreprovisionGb": null,
  "lastDatabaseResizeAt": null,
  "is_branch_enabled": false,
  "is_physical_backups_enabled": false,
  "infra_compute_size": "nano",
  "dbVersion": "15.14.1.067-auth-1"
}
[STEP] Deleting project supadev-test-1767153906-pg15...
Project deleted
[SUCCESS] Project supadev-test-1767153906-pg15 deleted successfully
[SUCCESS] PostgresEngine 15 test completed successfully
[INFO] Total test duration: 204 seconds


============================================================
Testing PostgreSQL Engine: 17
============================================================
[INFO] Project name: supadev-test-1767153906-pg17
[INFO] Region: us-east-1
[INFO] Start time: 2025-12-30 23:08:30
[INFO] Using version from PR: https://github.com/supabase/postgres/pull/1952
[INFO] Resolved version: 17.6.1.067-auth-1
[STEP] Creating project with PostgresEngine 17...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

http://localhost:8082/project/bdxdawfkkpfwsthwqsls
Database status is not yet available, waiting...
Database status is currently COMING_UP, waiting...
Database status is currently COMING_UP, waiting...
Database status is currently COMING_UP, waiting...
Database status is now ACTIVE_HEALTHY
[SUCCESS] Project supadev-test-1767153906-pg17 created and reached ACTIVE_HEALTHY status
[INFO] Time to ACTIVE_HEALTHY: 192 seconds
[INFO] Sleeping for 10 seconds before checking project health
[STEP] Checking project health for supadev-test-1767153906-pg17...
error: unrecognized subcommand 'project-health'

  tip: some similar subcommands exist: 'project-upgrade-eligibility', 'project-information', 'project-id'

Usage: supadev <COMMAND>

For more information, try '--help'.
[STEP] Printing project information for supadev-test-1767153906-pg17
{
  "cloud_provider": "AWS",
  "db_host": "db.bdxdawfkkpfwsthwqsls.supabase.red",
  "id": 14,
  "inserted_at": "2025-12-31T04:08:32.36092",
  "name": "supadev-test-1767153906-pg17",
  "organization_id": 1,
  "ref": "bdxdawfkkpfwsthwqsls",
  "region": "us-east-1",
  "status": "ACTIVE_HEALTHY",
  "subscription_id": "4UBPqvTF5gpecaCB",
  "connectionString": "U2FsdGVkX1/wCsIlhYgTYYylUstkaq1/5/hQJGPFyU/bLlaCg4cF4dV/cQFZlMa/5mUs/uLZ2iEi0B4RdK7ZgbOAIQFEyjMWuf/2s2zG0co4qA3zZ3p8CrBQpNIYntJ3Ys1Gind20Pbc++Ly4zuVCuQ7rdp/9tR3ztVhQ9+cBooG1T3O5xxxEyy783ja55Uv",
  "restUrl": "https://bdxdawfkkpfwsthwqsls.supabase.red/rest/v1/",
  "volumeSizeGb": 8,
  "maxDatabasePreprovisionGb": null,
  "lastDatabaseResizeAt": null,
  "is_branch_enabled": false,
  "is_physical_backups_enabled": false,
  "infra_compute_size": "nano",
  "dbVersion": "17.6.1.067-auth-1"
}
[STEP] Deleting project supadev-test-1767153906-pg17...
Project deleted
[SUCCESS] Project supadev-test-1767153906-pg17 deleted successfully
[SUCCESS] PostgresEngine 17 test completed successfully
[INFO] Total test duration: 203 seconds


============================================================
Testing PostgreSQL Engine: 17-oriole
============================================================
[INFO] Project name: supadev-test-1767153906-pg17_oriole
[INFO] Region: us-east-1
[INFO] Start time: 2025-12-30 23:11:53
[INFO] Using version from PR: https://github.com/supabase/postgres/pull/1952
[INFO] Resolved version: 17.6.0.024-orioledb-auth-1
[STEP] Creating project with PostgresEngine 17-oriole...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

http://localhost:8082/project/prntdbfyqyyluiwrfnxr
Database status is not yet available, waiting...
Database status is currently COMING_UP, waiting...
Database status is currently COMING_UP, waiting...
Database status is currently COMING_UP, waiting...
Database status is currently COMING_UP, waiting...
Database status is now ACTIVE_HEALTHY
[SUCCESS] Project supadev-test-1767153906-pg17_oriole created and reached ACTIVE_HEALTHY status
[INFO] Time to ACTIVE_HEALTHY: 253 seconds
[INFO] Sleeping for 10 seconds before checking project health
[STEP] Checking project health for supadev-test-1767153906-pg17_oriole...
error: unrecognized subcommand 'project-health'

  tip: some similar subcommands exist: 'project-upgrade-eligibility', 'project-information', 'project-id'

Usage: supadev <COMMAND>

For more information, try '--help'.
[STEP] Printing project information for supadev-test-1767153906-pg17_oriole
{
  "cloud_provider": "AWS",
  "db_host": "db.prntdbfyqyyluiwrfnxr.supabase.red",
  "id": 15,
  "inserted_at": "2025-12-31T04:11:56.140192",
  "name": "supadev-test-1767153906-pg17_oriole",
  "organization_id": 1,
  "ref": "prntdbfyqyyluiwrfnxr",
  "region": "us-east-1",
  "status": "ACTIVE_HEALTHY",
  "subscription_id": "4UBPqvTF5gpecaCB",
  "connectionString": "U2FsdGVkX1+LlTFpe4KhCtxm+vTehP2M1cxnj0z4s55x7c/2TtQi+1AJMOEdMrcsf8L7FonM78rmbIPVyYM2HiPyuwbW7lFtmrwpvi+lcFbILFiBu8CMRmjyOLwo9dGtKR4704ubGDWedf20unsePP9FDKcoYPYfIXJ6VNriLFjArPU5AqASOoqSndaBXaDy",
  "restUrl": "https://prntdbfyqyyluiwrfnxr.supabase.red/rest/v1/",
  "volumeSizeGb": 8,
  "maxDatabasePreprovisionGb": null,
  "lastDatabaseResizeAt": null,
  "is_branch_enabled": false,
  "is_physical_backups_enabled": false,
  "infra_compute_size": "nano",
  "dbVersion": "17.6.0.024-orioledb-auth-1"
}
[STEP] Deleting project supadev-test-1767153906-pg17_oriole...
Project deleted
[SUCCESS] Project supadev-test-1767153906-pg17_oriole deleted successfully
[SUCCESS] PostgresEngine 17-oriole test completed successfully
[INFO] Total test duration: 264 seconds


============================================================
Cleaning Up: Withdrawing Inserted Versions
============================================================
[STEP] Withdrawing version 15.14.1.067-auth-1 for engine 15...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

[SUCCESS] Version 15.14.1.067-auth-1 marked as withdrawn
[STEP] Withdrawing version 17.6.1.067-auth-1 for engine 17...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

[SUCCESS] Version 17.6.1.067-auth-1 marked as withdrawn
[STEP] Withdrawing version 17.6.0.024-orioledb-auth-1 for engine 17-oriole...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

[SUCCESS] Version 17.6.0.024-orioledb-auth-1 marked as withdrawn

============================================================
Test Summary
============================================================

ENGINE          RESULT
------          ------
15              PASSED (204s)
17              PASSED (203s)
17-oriole       PASSED (264s)

============================================================
Total: 3 | Passed: 3 | Failed: 0
============================================================
[SUCCESS] All tests passed!

@samrose
Copy link
Collaborator

samrose commented Dec 31, 2025

upgrade test pg 15

Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.
Run with RUST_BACKTRACE=full to include source snippets.
[STEP] Printing project information before upgrade...
{
  "cloud_provider": "AWS",
  "db_host": "db.blcnyuntamewocrvwqih.supabase.red",
  "id": 5,
  "inserted_at": "2025-12-31T05:20:29.009218",
  "name": "supadev-upgrade-1767158427-pg15",
  "organization_id": 1,
  "ref": "blcnyuntamewocrvwqih",
  "region": "us-east-1",
  "status": "ACTIVE_HEALTHY",
  "subscription_id": "fXuXpU6NwJPotS9Q",
  "connectionString": "U2FsdGVkX1/B/Xx0Mp9LmeVbaY1BbJyMundI/LkCQr8p3NjGw3c6cBH7M+0lgtMdsOl9WHhlG0sxsW6SU48z1IPB9pz/Tq+WWgpYw8xveSxRLRNpPUoV8f/NVGCA4MreFhXhixv4+RqPlAh/Jd/Cn+B4k3cWuQHMnfYkNbSmv4M9qQQ9gbejEHy99bPpGmzk",
  "restUrl": "https://blcnyuntamewocrvwqih.supabase.red/rest/v1/",
  "volumeSizeGb": 8,
  "maxDatabasePreprovisionGb": null,
  "lastDatabaseResizeAt": null,
  "is_branch_enabled": false,
  "is_physical_backups_enabled": false,
  "infra_compute_size": "nano",
  "dbVersion": "supabase-postgres-15.14.1.063"
}
[STEP] Inserting version 17.6.1.067-auth-1 as GA (with max id for upgrade eligibility)...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

[SUCCESS] Version 17.6.1.067-auth-1 inserted as GA
[STEP] Printing project upgrade eligibility...
{
  "eligible": true,
  "current_app_version": "supabase-postgres-15.14.1.063",
  "current_app_version_release_channel": "ga",
  "latest_app_version": "supabase-postgres-17.6.1.063",
  "target_upgrade_versions": [
    {
      "postgres_version": "17",
      "release_channel": "ga",
      "app_version": "supabase-postgres-17.6.1.063"
    },
    {
      "postgres_version": "17",
      "release_channel": "internal",
      "app_version": "supabase-postgres-17.4.1.077"
    }
  ],
  "duration_estimate_hours": 1,
  "legacy_auth_custom_roles": [],
  "objects_to_be_dropped": [],
  "unsupported_extensions": [],
  "user_defined_objects_in_internal_schemas": []
}
[STEP] Upgrading project to PostgresEngine 17 version 17.6.1.067-auth-1...
🧐 🐘 postgres_dba 6.0 installed. Use ":dba" to see menu

Upgrade status is not yet available, waiting...
Upgrade status is currently 1_started, waiting...
Upgrade status is currently 1_started, waiting...
Upgrade status is currently 1_started, waiting...
Upgrade status is currently 5_initiated_data_upgrade, waiting...
Upgrade status is currently 5_initiated_data_upgrade, waiting...
Upgrade status is currently 5_initiated_data_upgrade, waiting...
Upgrade status is currently 8_attached_volume_to_upgraded_instance, waiting...
Upgrade status is now 9_completed_upgrade
[SUCCESS] Project supadev-upgrade-1767158427-pg15 upgraded successfully

Copy link
Collaborator

@samrose samrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested in local infra

@samrose samrose disabled auto-merge December 31, 2025 06:35
@samrose samrose enabled auto-merge December 31, 2025 06:35
@samrose samrose added this pull request to the merge queue Dec 31, 2025
Merged via the queue into develop with commit 963c42d Dec 31, 2025
25 checks passed
@samrose samrose deleted the cs/gotrue-start-limit-fix branch December 31, 2025 07:54
samrose pushed a commit that referenced this pull request Dec 31, 2025
* fix: set restart limits to 0 to prevent being marked as failed

The systemd default is 10s / 5 for these values with a DefaultRestartUSec of
100ms. Most services set a RestartSec limit of 3, under most circumstances it
takes 15s to restart 5 times so the limit of 10s is not exceeded. However if
other system processes (salt, cloud init) restart it explicitly, or recovering
system services within the --before chain trigger a restart the limit can be
exceeded causing it to be marked as failed. Since no services mark
gotrue.service as required it will remain offline until the next explicit
restart is issued.

Setting these values to 0 with Restart=always and RestartSec=3 will prevent
gotrue from being marked as failed.

* chore: set StartLimits for persistent services.

I've noticed all !oneshot services set a `RestartSec` of `3s` and we use the
systemd defaults of `StartLimitBurst=5` and `StartLimitInterval=10s`. Together
this forms a property that under typical conditions a service will be restarted
indefinitely until it comes back up due to `(3s * 5) > 10s`, but it is still
possible for a service to enter a failed state under some scenarios. This change
defensively sets them to 0/0 to keep them in restart loops.

* chore: suffix to test

* chore: bump to release

---------

Co-authored-by: Chris Stockton <chris.stockton@supabase.io>
Co-authored-by: Sam Rose <samuel@supabase.io>

chore: bump version to correct new version for potential release
This was referenced Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants