feat(BA-3436): Implement Blue-Green deployment strategy#9568
feat(BA-3436): Implement Blue-Green deployment strategy#9568jopemachine wants to merge 8 commits intoBA-4821from
Conversation
jopemachine
left a comment
There was a problem hiding this comment.
Code Review: Blue-Green Deployment Strategy (BEP-1049)
Reviewed the FSM implementation, evaluator changes, DB layer modifications, and test coverage. The overall design is clean and well-structured -- the pure function approach for blue_green_evaluate() is excellent for testability. However, I identified several issues that need attention before merging.
Summary of Findings
| # | Severity | File | Issue |
|---|---|---|---|
| 1 | HIGH | blue_green.py / evaluator.py |
fetch_active_routes_by_endpoint_ids excludes FAILED_TO_START and TERMINATED routes, making the rollback path (step 4) unreachable in production |
| 2 | HIGH | blue_green.py:134 |
promote_delay_seconds > 0 always returns PROGRESSING without tracking when the delay started -- the delay will never expire |
| 3 | MEDIUM | blue_green.py:114 |
Mixed healthy+failed green routes (step 5) return PROGRESSING indefinitely with no recovery path |
| 4 | MEDIUM | blue_green.py:69-70 |
UNHEALTHY and DEGRADED green routes are silently classified as "healthy" via the is_active() fallback |
| 5 | MEDIUM | db_source.py:1425-1430 |
Promote and scale-in order in DB transaction: scale-in (blue -> TERMINATING) executes before promote (green -> ACTIVE), creating a brief window where no routes are ACTIVE |
| 6 | LOW | test_blue_green.py |
No test coverage for UNHEALTHY or DEGRADED green routes |
Details are in the individual file comments below.
|
|
||
| for r in routes: | ||
| is_green = r.revision_id == deploying_rev | ||
| if not is_green: |
There was a problem hiding this comment.
[HIGH] Finding #1: Rollback path (step 4) is unreachable in production
The evaluator calls fetch_active_routes_by_endpoint_ids() which filters by RouteStatus.active_route_statuses() = {PROVISIONING, HEALTHY, UNHEALTHY, DEGRADED}. The statuses FAILED_TO_START and TERMINATED are excluded from the query results.
This means the FSM will never see green_failed routes in production. Step 4 (all green failed -> rollback) is dead code in the integrated path. When all green routes fail, they disappear from the route list, and the FSM falls into step 2 (no green routes -> create new ones), causing an infinite retry loop of creating new green routes instead of a proper rollback.
The unit tests pass because they call blue_green_evaluate() directly with manually constructed failed routes, bypassing the data fetching layer.
Suggested fix: Either:
- Create a new fetch method (or add a parameter) that includes
FAILED_TO_START/TERMINATEDstatuses for the deployment strategy evaluator, or - Handle the "no green routes but previously had green routes" case in the FSM (requires tracking deployment state across cycles).
|
|
||
| # ── 4. All green failed → rollback ── | ||
| if total_green_live == 0 and green_failed: | ||
| log.warning( |
There was a problem hiding this comment.
[HIGH] Finding #2: promote_delay_seconds > 0 creates an infinite wait -- promotion will never happen automatically
When auto_promote=True and promote_delay_seconds > 0, this code always returns PROGRESSING without tracking when the delay period started. Since blue_green_evaluate is a pure function called on each evaluation cycle with no external state, there is no mechanism to determine when the delay has elapsed.
Every subsequent cycle will re-evaluate, see promote_delay_seconds > 0 is still true, and return PROGRESSING again -- indefinitely. The auto-promotion will never trigger.
Suggested fix: Either:
- Track the timestamp when all green routes first became healthy (e.g., in deployment state or a separate field), then compare
now - first_all_healthy_time >= promote_delay_secondsto decide whether to promote or keep waiting. - Use an external timer/scheduler that triggers promotion after the delay.
- If this is a known limitation for the initial implementation, document it clearly and consider validating that
promote_delay_seconds == 0whenauto_promote=Trueuntil the delay tracking is implemented.
| deployment.id, | ||
| desired, | ||
| ) | ||
| route_changes = RouteChanges( |
There was a problem hiding this comment.
[MEDIUM] Finding #3: Mixed healthy+failed green routes return PROGRESSING indefinitely with no recovery path
When some green routes are healthy and some have failed (but none are provisioning), the FSM reaches step 5 (len(green_healthy) < desired) and returns PROGRESSING. However, since:
- Failed routes cannot recover on their own (they are in terminal states
FAILED_TO_START/TERMINATED) - No new green routes are created to replace them
- The desired count will never be met
This creates a stuck deployment that will return PROGRESSING forever.
Suggested fix: Consider one of:
- If the ratio of failed to total green exceeds a threshold, trigger a rollback.
- Create replacement green routes for the failed ones (retry semantics).
- At minimum, add a
max_failuresorfailure_thresholdfield toBlueGreenSpecto control this behavior.
Note: This finding is partially related to Finding #1 -- if failed routes are not fetched from the DB, this scenario manifests as fewer green routes than expected, but the FSM still gets stuck at step 5.
| elif r.status in (RouteStatus.FAILED_TO_START, RouteStatus.TERMINATED): | ||
| green_failed.append(r) | ||
| elif r.status.is_active(): | ||
| green_healthy.append(r) |
There was a problem hiding this comment.
[MEDIUM] Finding #4: UNHEALTHY and DEGRADED green routes silently classified as "healthy"
The is_active() fallback at line 70 catches RouteStatus.UNHEALTHY and RouteStatus.DEGRADED and puts them into green_healthy. This means the FSM could promote green routes that are actively unhealthy or degraded to receive production traffic.
While PROVISIONING is explicitly handled before this point, UNHEALTHY and DEGRADED routes reaching the promotion step would result in switching production traffic to degraded service.
Suggested fix: Consider adding explicit handling for UNHEALTHY and DEGRADED:
- Either treat them like
PROVISIONING(wait for them to become healthy) - Or treat them like failures (count toward a failure threshold)
- At minimum, add a comment explaining why treating them as healthy is intentional if that is the design choice
| """Scale out/in/promote routes based on provided creators and updaters.""" | ||
| async with self._begin_session_read_committed() as db_sess: | ||
| # Scale out routes | ||
| for creator in scale_out_creators: |
There was a problem hiding this comment.
[MEDIUM] Finding #5: Scale-in executes before promote in the DB transaction
In scale_routes(), the execution order is:
- Scale out (create new routes)
- Scale in (blue routes -> TERMINATING, traffic_ratio=0.0, INACTIVE)
- Promote (green routes -> ACTIVE, traffic_ratio=1.0)
During step 2, the blue routes are set to TERMINATING/INACTIVE but the green routes are not yet promoted to ACTIVE. If the transaction is read by a concurrent load balancer query between steps 2 and 3 (READ COMMITTED isolation allows this), there could be a brief moment where no routes are ACTIVE for this deployment.
While this is within a single transaction and the commit is atomic, with READ COMMITTED isolation, other transactions reading during this window could see intermediate state.
Suggested fix: Consider reordering to promote first, then scale-in. This way, there is a brief overlap period where both blue and green are ACTIVE (which is safer for availability than having neither active). Alternatively, consider using SERIALIZABLE isolation for this specific operation.
| result = blue_green_evaluate(deployment, blues + greens, spec) | ||
|
|
||
| expected_scale_in = [r.route_id for r in blues] | ||
| assert result.route_changes.scale_in_route_ids == expected_scale_in |
There was a problem hiding this comment.
[LOW] Finding #6: Missing test coverage for UNHEALTHY and DEGRADED green routes
The tests do not cover scenarios where green routes have RouteStatus.UNHEALTHY or RouteStatus.DEGRADED status. These are important edge cases because the current code classifies them as "healthy" via the is_active() fallback (see Finding #4). Tests should verify this behavior is intentional.
Additionally, there is no test for the case where desired=0 (zero replicas), which could cause edge cases in the green route creation logic.
Suggested additions:
def test_unhealthy_green_treated_as_healthy(self) -> None:
deployment = _make_deployment(desired=3)
blues = _blue_routes(deployment, 3)
greens = _green_routes(deployment, 3, status=RouteStatus.UNHEALTHY)
spec = BlueGreenSpec(auto_promote=True, promote_delay_seconds=0)
result = blue_green_evaluate(deployment, blues + greens, spec)
# Should this promote unhealthy routes? Document the expected behavior.
assert result.completed # or assert not result.completed
def test_degraded_green_treated_as_healthy(self) -> None:
deployment = _make_deployment(desired=3)
blues = _blue_routes(deployment, 3)
greens = _green_routes(deployment, 3, status=RouteStatus.DEGRADED)
spec = BlueGreenSpec(auto_promote=True, promote_delay_seconds=0)
result = blue_green_evaluate(deployment, blues + greens, spec)
assert result.completed # or assert not result.completed
Security & Performance Review CompleteReviewed 9 changed files (+557/-11 lines) implementing the Blue-Green deployment strategy (BEP-1049). Findings Summary
Positive Observations
RecommendationThe two HIGH findings (unreachable rollback and infinite promote delay) represent logic bugs that would cause deployment failures in production. These should be addressed before merging. The MEDIUM findings are design considerations that could be documented as known limitations if they are intentional tradeoffs. |
1497006 to
f726733
Compare
| async def scale_routes( | ||
| self, | ||
| scale_out_creators: Sequence[Creator[RoutingRow]], | ||
| scale_in_updater: BatchUpdater[RoutingRow] | None, | ||
| promote_updater: BatchUpdater[RoutingRow] | None = None, | ||
| ) -> None: | ||
| """Scale out/in routes based on provided creators and updater.""" | ||
| """Scale out/in/promote routes based on provided creators and updaters.""" |
There was a problem hiding this comment.
Scaling in/out and promoting should be considered separately.
cb54845 to
19fe5c6
Compare
…-green strategy Add status_updated_at field to RouteInfo and RoutingRow for tracking when route status last changed. Implement promote_delay_seconds time calculation in blue_green_evaluate() using _latest_status_updated_at helper. Add fetch_routes_by_endpoint_ids to repository for fetching all routes including failed/terminated (needed for rollback detection). Update tests with status_updated_at parameter and add promote delay test scenarios. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
resolves #7383 (BA-3436)
Overview
Implements the Blue-Green deployment strategy (BEP-1049) — creates all new-revision routes as
INACTIVE(Green), validates health, then atomically switches traffic from old (Blue) to new (Green).Architecture
Key Difference from Rolling Update
Cycle-by-Cycle Example (
desired=3, auto_promote=True, delay=0)Rollback Example (all green failed)
Completion Flow
promote_delay_seconds
Key Types
CycleEvaluationResultstrategy/types.pyRouteChangesstrategy/types.pyEvaluationResultstrategy/types.pyBlueGreenSpecmodels/deployment_policy.pyRouteInfo.status_updated_atdata/deployment/types.pyChanged Files
strategy/blue_green.pystrategy/types.pyRouteChanges.promote_route_idsaddedstrategy/evaluator.py_apply_route_changesdata/deployment/types.pyRouteInfo.status_updated_atfield addedmodels/routing/row.pyRoutingRow.status_updated_atcolumn +to_route_info()mappingrepositories/.../creators/route.pyRouteBatchUpdaterSpecsetsstatus_updated_aton status changerepositories/.../db_source.pyfetch_routes_by_endpoint_ids()(no status filter) +scale_routespromote paramrepositories/.../repository.pytest_blue_green.pyChecklist: (if applicable)
ai.backend.testdocsdirectory