[codex] Add deploy fallback runbook preflight#667
Conversation
17c2023 to
c7fa7ea
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c7fa7eac15
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| awsCheck( | ||
| "Secrets Manager secret name", | ||
| [ | ||
| "secretsmanager", | ||
| "describe-secret", | ||
| "--secret-id", | ||
| webhookSecretId, |
There was a problem hiding this comment.
Check all required deploy secrets in preflight
In the fallback all path, scripts/deploy.sh also resolves TRACKING_SECRET_SECRET_ID and INGESTER_JOB_TOKEN_SECRET_ID before registering the app/ingester/scheduler task definitions, but this preflight only describes webhookSecretId. A machine with the other required secret names missing or misconfigured will still print “Preflight passed” and then fail after deploy_app has already built and pushed images, so include all deploy-required secret IDs in the read-only checks.
Useful? React with 👍 / 👎.
| const appRepo = env.APP_REPO || `${product}-app`; | ||
| const ingesterRepo = env.ING_REPO || `${product}-ingester`; | ||
| const appService = env.APP_SERVICE || `${product}-app`; | ||
| const ingesterService = env.ING_SERVICE || `${product}-ingester`; |
There was a problem hiding this comment.
Avoid validating overrides deploy.sh ignores
These APP_REPO/ING_REPO/APP_SERVICE/ING_SERVICE overrides are only consumed by the new preflight/runbook; scripts/deploy.sh still assigns those names from PRODUCT unconditionally. If an operator follows the fallback contract and sets one of these for a nonstandard production target, the preflight validates the override but bash scripts/deploy.sh all deploys the default ${PRODUCT}-* resources instead, so either remove these overrides here/docs or make the deploy script honor them.
Useful? React with 👍 / 👎.
jaeyunha
left a comment
There was a problem hiding this comment.
Fixed both Codex P2 review comments in 6c745b8.
What changed:
- Preflight now validates all deploy-required Secrets Manager metadata resolved by
scripts/deploy.sh all:WEBHOOK_SECRET_ENCRYPTION_KEY_SECRET_ID/ARN,TRACKING_SECRET_SECRET_ID/ARN, andINGESTER_JOB_TOKEN_SECRET_ID/ARN. - Optional inbound token metadata is checked only when explicitly configured, matching deploy.sh's non-fatal inbound-token behavior.
- Removed the unsafe
APP_REPO/ING_REPO/APP_SERVICE/ING_SERVICEfallback contract from the preflight/runbook; repo/service targets now derive fromPRODUCT, matching deploy.sh.
Validation:
bun test tests/deploy.test.ts✅bun run deploy:fallback:preflight✅ (non-mutating metadata checks only; no secret values fetched/printed)make check✅git diff --check✅- Push guardrails during
git push✅
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6c745b813e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| const env = process.env; | ||
| const region = env.AWS_REGION || env.AWS_DEFAULT_REGION || "us-east-1"; |
There was a problem hiding this comment.
Use only AWS_REGION for the deploy region
When a workstation has AWS_DEFAULT_REGION set but not AWS_REGION, this preflight validates all AWS resources in AWS_DEFAULT_REGION and only passes that region to its child aws commands; the documented follow-up bash scripts/deploy.sh all does not read AWS_DEFAULT_REGION and defaults AWS_REGION to us-east-1, so a passing preflight can be followed by a deploy against a different region. Match scripts/deploy.sh here or fail until AWS_REGION is explicitly exported.
Useful? React with 👍 / 👎.
| ```bash | ||
| docker --version | ||
| docker buildx version | ||
| aws --version |
There was a problem hiding this comment.
Require python3 before fallback deploy
On a fallback workstation that has Docker/AWS/Bun but no python3, this checklist and the new preflight can pass, but scripts/deploy.sh all calls python3 while generating ECS task definition JSON after deploy_app has already pushed the image, so the break-glass deploy fails mid-flight. Add python3 --version to the required tools/preflight before directing operators to run the deploy.
Useful? React with 👍 / 👎.
|
Addressed the P2 deploy fallback review comments in commit Fixes:
Validation:
Leaving the PR open for controller review/merge. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0cd9afe7c8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| awsGlobalCheck("AWS CLI", ["--version"], (stdout) => stdout), | ||
| identityCheck(), | ||
| awsNameListCheck( | ||
| "ECR repositories", |
There was a problem hiding this comment.
Validate Docker can push to ECR
On a fresh fallback workstation that has AWS CLI permissions but no Docker ECR credential helper/login, this ECR check still passes because it only calls aws ecr describe-repositories; the subsequent scripts/deploy.sh build_and_push path runs docker buildx build ... --push, so the failure occurs after the image build starts instead of during preflight. Add a non-printing ECR Docker login/credential check, or document that login as a required preflight step, before reporting success.
Useful? React with 👍 / 👎.
|
Correction to my previous comment: the branch is pushed and the intended update is below (posting with shell-safe quoting so inline code is preserved). Addressed the Codex P2 Docker-to-ECR readiness comment in commit a1d52c0. What changed:
Validation:
No images were pushed and no production deploy/mutation commands were run. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a1d52c0265
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "ecs", | ||
| "describe-services", |
There was a problem hiding this comment.
Validate task definitions before the first image push
In a fallback account/role that can ecs:DescribeServices but not ecs:DescribeTaskDefinition (or when the current service points at a task definition missing the expected container), this preflight still passes here because it only checks service names. The checked scripts/deploy.sh path then runs deploy_app first, pushes the app image, and only afterward register_app_task_definition/write_app_task_definition call aws ecs describe-task-definition, so the failure happens mid-deploy after ECR has already been mutated. Add read-only task-definition lookups for the app and ingester base task definitions before reporting success.
Useful? React with 👍 / 👎.
|
Addressed the latest Codex P2 task-definition preflight comment. Changed:
Validation:
No live deploy was run; no ECR/ECS mutations were performed. |
|
Fixed the remaining deploy-fallback review blockers. Files changed:
Validation:
Head SHA: No live deploy was run; no ECR/ECS mutations were performed; no secret values were fetched or printed. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 057d85f575
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| [appService, ingesterService], | ||
| ), | ||
| ecsTaskDefinitionMetadataCheck(appService, appContainerName), | ||
| ecsTaskDefinitionMetadataCheck(ingesterService, ingesterContainerName), |
There was a problem hiding this comment.
Validate scheduler base-task secrets before passing preflight
In the all fallback path I checked scripts/deploy.sh: after this preflight allows image pushes, deploy_scheduler calls write_scheduler_task_definition, whose Python exits unless the current ingester task definition contains DATABASE_URL and BETTER_AUTH_SECRET secrets. This ingester task-definition check only verifies the container name, so a base task missing either secret still reports “Preflight passed” and then fails mid-deploy after ECR has been mutated; include these scheduler base-secret checks in the read-only task-definition metadata validation.
Useful? React with 👍 / 👎.
|
Addressed the latest Codex P2 scheduler base-secret preflight blocker. Files changed:
Validation:
Head SHA: No live deploy was run; no ECR/ECS mutations were performed; no secret values were fetched or printed. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5e80c15038
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ], | ||
| [appService, ingesterService], | ||
| ), | ||
| ecsTaskDefinitionMetadataCheck(appService, appContainerName), |
There was a problem hiding this comment.
Require DATABASE_URL on the app base task
In a fallback account where the current app task definition is missing or has renamed DATABASE_URL, this app metadata check still passes because it only verifies the container name. bash scripts/deploy.sh all then pushes/registers images before run_migrations clones the app task into the migrator, and src/lib/db/migrate.ts exits immediately when DATABASE_URL is absent, so the preflight can report success and still fail mid-deploy after ECR/ECS have been mutated. Check that the app container exposes DATABASE_URL (secret or environment name) before passing preflight.
Useful? React with 👍 / 👎.
|
Fixed the latest Codex P2 app Files changed:
Validation:
Head SHA: No live deploy was run; no ECR/ECS mutations were performed; no secret values were fetched or printed. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ede1bdc435
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| docker --version | ||
| docker buildx version | ||
| python3 --version | ||
| aws --version | ||
| ``` |
There was a problem hiding this comment.
Include Bun in fallback prerequisites
When this runbook is followed on a fresh authorized workstation that has every tool listed here but not Bun, step 4 fails immediately because the documented preflight command is bun run deploy:fallback:preflight; scripts/deploy.sh itself does not install Bun or provide a non-Bun entrypoint. Add bun --version/installation guidance to the required tools list or document an invocation that works without Bun, otherwise the break-glass path can block during the runner outage it is meant to handle.
Useful? React with 👍 / 👎.
|
Fixed the Bun prerequisite blocker for the fallback deploy runbook/preflight. Files changed:
Validation:
Head SHA: Safety confirmation: no live deploy run; no ECR/ECS mutation; no Secrets Manager value fetch or secret values printed. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7997f2244e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ecsTaskDefinitionMetadataCheck(ingesterService, ingesterContainerName, [ | ||
| "DATABASE_URL", | ||
| "BETTER_AUTH_SECRET", | ||
| ]), |
There was a problem hiding this comment.
Check ingester boot secrets before passing preflight
In fallback accounts where the current ingester task definition is missing production-required boot secrets such as WEBHOOK_SECRET_ENCRYPTION_KEY or UNSUBSCRIBE_SECRET, this still passes because the ingester metadata check only requires DATABASE_URL/BETTER_AUTH_SECRET. write_ingester_task_definition does not add those names, while packages/ingester/src/startup-checks.ts validates the production ingester environment and refuses to boot without them, so bash scripts/deploy.sh all can push/register images and only fail when the ingester rolls; include the ingester startup-required metadata here before reporting success.
Useful? React with 👍 / 👎.
|
Fixed the latest Codex P2 ingester startup metadata preflight blocker. Files changed:
Validation:
Head SHA: Safety confirmation: no live deploy run; no ECR/ECS mutation; no Secrets Manager value fetch or secret values printed. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 42f99d7277
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| appService, | ||
| appContainerName, | ||
| [], | ||
| ["DATABASE_URL"], |
There was a problem hiding this comment.
Check app startup env metadata before passing preflight
In a fallback account where the current app task is missing production app settings such as BETTER_AUTH_SECRET, BETTER_AUTH_URL, NEXT_PUBLIC_APP_URL, BETTER_AUTH_TRUSTED_ORIGINS, UNSUBSCRIBE_SECRET, or DKIM_ENCRYPTION_KEY, this app-task check still passes because it only requires DATABASE_URL. Fresh evidence is src/lib/startup-checks.ts invoking validateOpenSendEnv(..., { service: "app" }), while scripts/deploy.sh clones the app task and only injects webhook/tracking secrets, so the deploy can push/register images and then roll an app task that refuses to boot; include the production app metadata names here just like the ingester startup metadata check.
Useful? React with 👍 / 👎.
|
Fixed the latest Codex P2 blocker: the fallback preflight now checks the current app base task-definition metadata for all production app startup-required env/secret names before success. Files changed:
Validation:
Head SHA: Confirmed: no live deploy run, no ECR/ECS/Secrets Manager mutation, and no secret values fetched or printed; this was metadata-name validation only. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bac8aa3633
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| appService, | ||
| ingesterService, | ||
| "--query", | ||
| "services[].serviceName", |
There was a problem hiding this comment.
Require ECS services to be active before passing preflight
For a fallback account where ${PRODUCT}-app or ${PRODUCT}-ingester was deleted recently, this query still passes because it only checks serviceName; AWS docs note that DRAINING/INACTIVE services remain visible to DescribeServices, while UpdateService rejects inactive services. bash scripts/deploy.sh all then pushes/registers images before the later redeploy call fails on that inactive service, so include status in this read-only check and fail unless each service is ACTIVE.
Useful? React with 👍 / 👎.
|
Fixed the latest P2 ECS service-status blocker. Files changed:
Validation:
Head SHA: Confirmed: no live deploy run, no ECR/ECS/Secrets Manager mutation, and no secret values fetched or printed. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2168545a40
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| | `PRODUCT` | `opensend` | Prefix for repositories and services. | | ||
| | `ECS_CLUSTER` | `namuh` | ECS cluster name. | | ||
| | `IMAGE_TAG` | `latest` | Image tag pushed to ECR. | | ||
| | `PLATFORM` | `linux/amd64` | Docker build target platform. | |
There was a problem hiding this comment.
Validate PLATFORM before approving fallback deploy
When a fallback workstation has PLATFORM inherited as linux/arm64 (common on Apple Silicon shells), this runbook allows that override and the preflight never validates or even prints it, but scripts/deploy.sh passes ${PLATFORM} directly to docker buildx build --platform. That can produce and push images incompatible with the production linux/amd64 Fargate task definitions after preflight has reported success, so require linux/amd64 in the preflight or remove this override from the fallback contract.
Useful? React with 👍 / 👎.
| ], | ||
| [appRepo, ingesterRepo], | ||
| ), | ||
| ecsServicesActiveCheck([appService, ingesterService]), |
There was a problem hiding this comment.
Check scheduler service state before preflight passes
Checked the scripts/deploy.sh all path: after image pushes and task registration, it calls create_or_redeploy_scheduler_service, and scheduler_service_exists treats a DRAINING scheduler service as present before calling redeploy. This preflight only checks ${PRODUCT}-app and ${PRODUCT}-ingester, so a recently deleted or draining ${PRODUCT}-scheduler can still pass preflight and then fail late during the scheduler update/create step; include the scheduler service as active-or-absent metadata before reporting success.
Useful? React with 👍 / 👎.
|
Fixed the latest current-head P2 blockers. Files changed:
Validation:
Head SHA: Confirmation: no live deploy run; no ECR/ECS/Secrets Manager mutation commands run; no secret values fetched or printed. |
Summary
Refs #645.
Adds the low-risk deploy fallback slice for the Mac mini runner SPOF:
agent_docs/runbooks/deploy-fallback.md.bun run deploy:fallback:preflight, a non-mutating preflight that checks Docker/buildx, AWS identity, ECR/ECS reachability, and the required Secrets Manager secret name without reading secret values or changing production.tests/deploy.test.tswith static checks proving the runbook/preflight wiring and the remaining closure gate.Important gate
Final issue closure remains gated on one real no-op fallback deploy from a non-Mac-mini machine. This PR deliberately does not run or claim that production exercise; it only provides the documented fallback path and preflight.
Validation
bun test tests/deploy.test.tsbun run deploy:fallback:preflightmake checkgit diff --check