Skip to content

feat(BA-3436): Implement Blue-Green deployment strategy#9568

Draft
jopemachine wants to merge 8 commits intoBA-4821from
BA-3436_2
Draft

feat(BA-3436): Implement Blue-Green deployment strategy#9568
jopemachine wants to merge 8 commits intoBA-4821from
BA-3436_2

Conversation

@jopemachine
Copy link
Member

@jopemachine jopemachine commented Mar 2, 2026

resolves #7383 (BA-3436)

Overview

Implements the Blue-Green deployment strategy (BEP-1049) — creates all new-revision routes as INACTIVE (Green), validates health, then atomically switches traffic from old (Blue) to new (Green).

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│  Periodic Scheduler (deploying: 5s / 30s)                                   │
│    → DoDeploymentLifecycleEvent(lifecycle_type="deploying")                  │
└─────────────┬───────────────────────────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  DeploymentCoordinator._process_with_evaluator()                            │
│                                                                             │
│    1. Acquire distributed lock (LOCKID_DEPLOYMENT_DEPLOYING)                │
│    2. Query all DEPLOYING-state endpoints                                   │
│    3. evaluator.evaluate(deployments) → EvaluationResult                    │
│    4. For each sub-step group:                                              │
│         handler = handlers[(DEPLOYING, sub_step)]                           │
│         handler.execute(group) → _handle_status_transitions()               │
│    5. handler.post_process() for each group                                 │
│    6. _transition_completed_deployments() for completed                     │
│                                                                             │
│  Handler map (DeploymentHandlerKey):                                        │
│    (DEPLOYING, PROVISIONING) → DeployingProvisioningHandler                 │
│    (DEPLOYING, PROGRESSING)  → DeployingProgressingHandler                  │
│    (DEPLOYING, ROLLED_BACK)  → DeployingRolledBackHandler                   │
└─────────────┬───────────────────────────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  DeploymentStrategyEvaluator              (strategy/evaluator.py)           │
│                                                                             │
│  evaluate(deployments) → EvaluationResult                                   │
│    1. Bulk-load policies:  fetch_deployment_policies_by_endpoint_ids()      │
│       Bulk-load routes:    fetch_routes_by_endpoint_ids()                   │
│                            ↑ includes FAILED/TERMINATED for rollback check  │
│    2. Per deployment → dispatch by policy.strategy:                         │
│         BLUE_GREEN → blue_green_evaluate(deployment, routes, spec)          │
│    3. Collect route mutations from all CycleEvaluationResults               │
│    4. Apply in one batch:  _apply_route_changes(scale_out, scale_in,       │
│                                                  promote)                   │
│         scale_out → Creator[RoutingRow] (green routes, INACTIVE)            │
│         scale_in  → BatchUpdater (blue → TERMINATING + INACTIVE)            │
│         promote   → BatchUpdater (green → ACTIVE + ratio=1.0)     ← NEW    │
│    5. Group deployments by sub-step → EvaluationResult                      │
└─────────────┬───────────────────────────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  blue_green_evaluate()                    (strategy/blue_green.py)          │
│                                                                             │
│  Pure function: (DeploymentInfo, routes, BlueGreenSpec) → CycleResult       │
│                                                                             │
│  FSM:                                                                       │
│    1. Classify routes by revision_id:                                       │
│         blue_active:        revision != deploying_revision, is_active()     │
│         green_provisioning: revision == deploying_revision, PROVISIONING    │
│         green_healthy:      revision == deploying_revision, HEALTHY         │
│         green_failed:       revision == deploying_revision, FAILED/TERM     │
│                                                                             │
│    2. no green at all? ──→ create desired × green (INACTIVE) → PROVISIONING │
│    3. green PROVISIONING? ────────────────────────→ PROVISIONING (wait)     │
│    4. all green failed? ──→ scale_in green ───────→ ROLLED_BACK            │
│    5. green healthy < desired? ───────────────────→ PROGRESSING (wait)      │
│    6. all healthy + auto_promote=False? ──────────→ PROGRESSING (manual)    │
│    7. all healthy + delay not elapsed? ───────────→ PROGRESSING (delay)     │
│    8. all healthy + promote ready? ──→ promote green + terminate blue       │
│       ────────────────────────────────────────────→ PROGRESSING             │
│                                                      (completed=True)       │
│       RouteChanges:                                                         │
│         promote_route_ids = green healthy IDs (INACTIVE → ACTIVE)           │
│         scale_in_route_ids = blue active IDs (ACTIVE → TERMINATING)         │
└─────────────────────────────────────────────────────────────────────────────┘

Key Difference from Rolling Update

                Rolling Update                     Blue-Green
              ─────────────────                ──────────────────
  Creation:   gradual (max_surge)              all at once
  New route:  traffic_status=ACTIVE            traffic_status=INACTIVE
              traffic immediately               invisible to proxy
  Old route:  gradual termination              all at once on promotion
  Switch:     concurrent with creation         atomic (promote green +
                                                terminate blue in 1 cycle)
  Config:     max_surge, max_unavailable       auto_promote,
                                                promote_delay_seconds
  Rollback:   recovery from partial state      just terminate green
                                                (blue still running)

Cycle-by-Cycle Example (desired=3, auto_promote=True, delay=0)

Cycle 0 (initial)
Blue:  [■ ■ ■]  (3 healthy, ACTIVE)
Green: []
→ Create 3 green routes (INACTIVE, traffic_ratio=0.0)
        │
        ▼
Cycle 1 (green provisioning)
Blue:  [■ ■ ■]  (3 healthy, ACTIVE — still serving all traffic)
Green: [◇ ◇ ◇]  (3 provisioning, INACTIVE — invisible to proxy)
→ wait (PROVISIONING)
        │
        ▼
Cycle 2 (some green healthy)
Blue:  [■ ■ ■]  (ACTIVE)
Green: [■ ◇ ◇]  (1 healthy, 2 provisioning, INACTIVE)
→ wait (PROVISIONING)
        │
        ▼
Cycle 3 (all green healthy — promotion!)
Blue:  [■ ■ ■]  (ACTIVE)
Green: [■ ■ ■]  (3 healthy, INACTIVE)
→ auto_promote=True, delay=0
→ Green: INACTIVE → ACTIVE   (promote_route_ids)
→ Blue:  ACTIVE → TERMINATING (scale_in_route_ids)
→ completed!
        │
        ▼
Final state
Blue:  []
Green: [■ ■ ■]  (3 healthy, ACTIVE)
→ revision swap + DEPLOYING → READY

Legend: ■ = healthy, ◇ = provisioning

Rollback Example (all green failed)

Blue:  [■ ■ ■]  (ACTIVE — untouched)
Green: [✗ ✗ ✗]  (all FAILED_TO_START)
→ scale_in green (TERMINATING)
→ ROLLED_BACK: deploying_revision = NULL, blue keeps serving

Legend: ✗ = failed

Completion Flow

blue_green_evaluate() returns completed=True
  + RouteChanges(promote_route_ids, scale_in_route_ids)
    │
    ▼
DeploymentStrategyEvaluator._apply_route_changes()
  → promote_updater: green → ACTIVE, traffic_ratio=1.0
  → scale_in_updater: blue → TERMINATING, INACTIVE
  → repo.scale_routes(scale_out, scale_in, promote)   ← single DB tx
    │
    ▼
DeploymentCoordinator._transition_completed_deployments()
  → Atomic transaction:
    1. complete_deployment_revision_swap(endpoint_ids)
       current_revision = deploying_revision, deploying_revision = NULL
    2. Lifecycle: DEPLOYING → READY
    3. History recording + Notification events

promote_delay_seconds

  Each cycle (all green healthy + auto_promote=True):
    latest_healthy_at = max(green.status_updated_at where status == HEALTHY)
    elapsed = now() - latest_healthy_at

    elapsed >= promote_delay_seconds?  → promote
    elapsed <  promote_delay_seconds?  → PROGRESSING (wait)

  status_updated_at is set on every route status change,
  so if a route flaps unhealthy→healthy, the delay timer resets.

Key Types

Type Location Purpose
CycleEvaluationResult strategy/types.py Single deployment's FSM result (sub_step + completed + route_changes)
RouteChanges strategy/types.py Route mutations: scale_out + scale_in + promote_route_ids (NEW)
EvaluationResult strategy/types.py Aggregate: groups by sub-step + completed + skipped + errors
BlueGreenSpec models/deployment_policy.py Config: auto_promote, promote_delay_seconds
RouteInfo.status_updated_at data/deployment/types.py Tracks last status change time for promote delay

Changed Files

File Change
strategy/blue_green.py Blue-green FSM (stub → full 8-step implementation)
strategy/types.py RouteChanges.promote_route_ids added
strategy/evaluator.py promote ID collection + promote_updater in _apply_route_changes
data/deployment/types.py RouteInfo.status_updated_at field added
models/routing/row.py RoutingRow.status_updated_at column + to_route_info() mapping
repositories/.../creators/route.py RouteBatchUpdaterSpec sets status_updated_at on status change
repositories/.../db_source.py fetch_routes_by_endpoint_ids() (no status filter) + scale_routes promote param
repositories/.../repository.py Mirror db_source signature changes
test_blue_green.py 35+ test scenarios covering all FSM branches

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

@github-actions github-actions bot added the size:XL 500~ LoC label Mar 2, 2026
@jopemachine jopemachine added this to the 26.3 milestone Mar 2, 2026
@github-actions github-actions bot added the comp:manager Related to Manager component label Mar 2, 2026
Copy link
Member Author

@jopemachine jopemachine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Blue-Green Deployment Strategy (BEP-1049)

Reviewed the FSM implementation, evaluator changes, DB layer modifications, and test coverage. The overall design is clean and well-structured -- the pure function approach for blue_green_evaluate() is excellent for testability. However, I identified several issues that need attention before merging.

Summary of Findings

# Severity File Issue
1 HIGH blue_green.py / evaluator.py fetch_active_routes_by_endpoint_ids excludes FAILED_TO_START and TERMINATED routes, making the rollback path (step 4) unreachable in production
2 HIGH blue_green.py:134 promote_delay_seconds > 0 always returns PROGRESSING without tracking when the delay started -- the delay will never expire
3 MEDIUM blue_green.py:114 Mixed healthy+failed green routes (step 5) return PROGRESSING indefinitely with no recovery path
4 MEDIUM blue_green.py:69-70 UNHEALTHY and DEGRADED green routes are silently classified as "healthy" via the is_active() fallback
5 MEDIUM db_source.py:1425-1430 Promote and scale-in order in DB transaction: scale-in (blue -> TERMINATING) executes before promote (green -> ACTIVE), creating a brief window where no routes are ACTIVE
6 LOW test_blue_green.py No test coverage for UNHEALTHY or DEGRADED green routes

Details are in the individual file comments below.


for r in routes:
is_green = r.revision_id == deploying_rev
if not is_green:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Finding #1: Rollback path (step 4) is unreachable in production

The evaluator calls fetch_active_routes_by_endpoint_ids() which filters by RouteStatus.active_route_statuses() = {PROVISIONING, HEALTHY, UNHEALTHY, DEGRADED}. The statuses FAILED_TO_START and TERMINATED are excluded from the query results.

This means the FSM will never see green_failed routes in production. Step 4 (all green failed -> rollback) is dead code in the integrated path. When all green routes fail, they disappear from the route list, and the FSM falls into step 2 (no green routes -> create new ones), causing an infinite retry loop of creating new green routes instead of a proper rollback.

The unit tests pass because they call blue_green_evaluate() directly with manually constructed failed routes, bypassing the data fetching layer.

Suggested fix: Either:

  1. Create a new fetch method (or add a parameter) that includes FAILED_TO_START / TERMINATED statuses for the deployment strategy evaluator, or
  2. Handle the "no green routes but previously had green routes" case in the FSM (requires tracking deployment state across cycles).


# ── 4. All green failed → rollback ──
if total_green_live == 0 and green_failed:
log.warning(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Finding #2: promote_delay_seconds > 0 creates an infinite wait -- promotion will never happen automatically

When auto_promote=True and promote_delay_seconds > 0, this code always returns PROGRESSING without tracking when the delay period started. Since blue_green_evaluate is a pure function called on each evaluation cycle with no external state, there is no mechanism to determine when the delay has elapsed.

Every subsequent cycle will re-evaluate, see promote_delay_seconds > 0 is still true, and return PROGRESSING again -- indefinitely. The auto-promotion will never trigger.

Suggested fix: Either:

  1. Track the timestamp when all green routes first became healthy (e.g., in deployment state or a separate field), then compare now - first_all_healthy_time >= promote_delay_seconds to decide whether to promote or keep waiting.
  2. Use an external timer/scheduler that triggers promotion after the delay.
  3. If this is a known limitation for the initial implementation, document it clearly and consider validating that promote_delay_seconds == 0 when auto_promote=True until the delay tracking is implemented.

deployment.id,
desired,
)
route_changes = RouteChanges(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Finding #3: Mixed healthy+failed green routes return PROGRESSING indefinitely with no recovery path

When some green routes are healthy and some have failed (but none are provisioning), the FSM reaches step 5 (len(green_healthy) < desired) and returns PROGRESSING. However, since:

  • Failed routes cannot recover on their own (they are in terminal states FAILED_TO_START / TERMINATED)
  • No new green routes are created to replace them
  • The desired count will never be met

This creates a stuck deployment that will return PROGRESSING forever.

Suggested fix: Consider one of:

  1. If the ratio of failed to total green exceeds a threshold, trigger a rollback.
  2. Create replacement green routes for the failed ones (retry semantics).
  3. At minimum, add a max_failures or failure_threshold field to BlueGreenSpec to control this behavior.

Note: This finding is partially related to Finding #1 -- if failed routes are not fetched from the DB, this scenario manifests as fewer green routes than expected, but the FSM still gets stuck at step 5.

elif r.status in (RouteStatus.FAILED_TO_START, RouteStatus.TERMINATED):
green_failed.append(r)
elif r.status.is_active():
green_healthy.append(r)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Finding #4: UNHEALTHY and DEGRADED green routes silently classified as "healthy"

The is_active() fallback at line 70 catches RouteStatus.UNHEALTHY and RouteStatus.DEGRADED and puts them into green_healthy. This means the FSM could promote green routes that are actively unhealthy or degraded to receive production traffic.

While PROVISIONING is explicitly handled before this point, UNHEALTHY and DEGRADED routes reaching the promotion step would result in switching production traffic to degraded service.

Suggested fix: Consider adding explicit handling for UNHEALTHY and DEGRADED:

  • Either treat them like PROVISIONING (wait for them to become healthy)
  • Or treat them like failures (count toward a failure threshold)
  • At minimum, add a comment explaining why treating them as healthy is intentional if that is the design choice

"""Scale out/in/promote routes based on provided creators and updaters."""
async with self._begin_session_read_committed() as db_sess:
# Scale out routes
for creator in scale_out_creators:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Finding #5: Scale-in executes before promote in the DB transaction

In scale_routes(), the execution order is:

  1. Scale out (create new routes)
  2. Scale in (blue routes -> TERMINATING, traffic_ratio=0.0, INACTIVE)
  3. Promote (green routes -> ACTIVE, traffic_ratio=1.0)

During step 2, the blue routes are set to TERMINATING/INACTIVE but the green routes are not yet promoted to ACTIVE. If the transaction is read by a concurrent load balancer query between steps 2 and 3 (READ COMMITTED isolation allows this), there could be a brief moment where no routes are ACTIVE for this deployment.

While this is within a single transaction and the commit is atomic, with READ COMMITTED isolation, other transactions reading during this window could see intermediate state.

Suggested fix: Consider reordering to promote first, then scale-in. This way, there is a brief overlap period where both blue and green are ACTIVE (which is safer for availability than having neither active). Alternatively, consider using SERIALIZABLE isolation for this specific operation.

result = blue_green_evaluate(deployment, blues + greens, spec)

expected_scale_in = [r.route_id for r in blues]
assert result.route_changes.scale_in_route_ids == expected_scale_in
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Finding #6: Missing test coverage for UNHEALTHY and DEGRADED green routes

The tests do not cover scenarios where green routes have RouteStatus.UNHEALTHY or RouteStatus.DEGRADED status. These are important edge cases because the current code classifies them as "healthy" via the is_active() fallback (see Finding #4). Tests should verify this behavior is intentional.

Additionally, there is no test for the case where desired=0 (zero replicas), which could cause edge cases in the green route creation logic.

Suggested additions:

def test_unhealthy_green_treated_as_healthy(self) -> None:
    deployment = _make_deployment(desired=3)
    blues = _blue_routes(deployment, 3)
    greens = _green_routes(deployment, 3, status=RouteStatus.UNHEALTHY)
    spec = BlueGreenSpec(auto_promote=True, promote_delay_seconds=0)
    result = blue_green_evaluate(deployment, blues + greens, spec)
    # Should this promote unhealthy routes? Document the expected behavior.
    assert result.completed  # or assert not result.completed

def test_degraded_green_treated_as_healthy(self) -> None:
    deployment = _make_deployment(desired=3)
    blues = _blue_routes(deployment, 3)
    greens = _green_routes(deployment, 3, status=RouteStatus.DEGRADED)
    spec = BlueGreenSpec(auto_promote=True, promote_delay_seconds=0)
    result = blue_green_evaluate(deployment, blues + greens, spec)
    assert result.completed  # or assert not result.completed

@jopemachine
Copy link
Member Author

Security & Performance Review Complete

Reviewed 9 changed files (+557/-11 lines) implementing the Blue-Green deployment strategy (BEP-1049).

Findings Summary

# Severity Location Issue Status
1 HIGH blue_green.py (step 4, rollback) fetch_active_routes_by_endpoint_ids excludes FAILED_TO_START/TERMINATED routes -- rollback path is unreachable in production, causing infinite retry loops instead of rollback Open
2 HIGH blue_green.py:134 (step 7, delay) promote_delay_seconds > 0 always returns PROGRESSING without tracking delay start time -- auto-promotion with delay will never trigger Open
3 MEDIUM blue_green.py:114 (step 5, mixed) Mixed healthy+failed green routes stuck in PROGRESSING forever with no recovery path Open
4 MEDIUM blue_green.py:69-70 (classification) UNHEALTHY/DEGRADED green routes classified as "healthy" via is_active() fallback -- could promote degraded routes to production Open
5 MEDIUM db_source.py:1425-1430 (transaction order) Scale-in (blue -> TERMINATING) executes before promote (green -> ACTIVE) in the same transaction; with READ COMMITTED isolation, concurrent reads could briefly see no ACTIVE routes Open
6 LOW test_blue_green.py No test coverage for UNHEALTHY/DEGRADED green routes or desired=0 edge case Open

Positive Observations

  • The pure function design of blue_green_evaluate() is excellent for testability and reasoning about state transitions
  • Clean FSM flow with well-documented steps in the docstring
  • Good use of the batch evaluator pattern -- aggregating all route changes across deployments and applying them in a single DB transaction
  • The promote_updater parameter uses a default value (None) to maintain backward compatibility with existing callers (e.g., executor.py)
  • Test coverage for the happy path and basic edge cases (single replica, many replicas, fresh deployment) is solid
  • The RouteChanges dataclass cleanly separates the three types of mutations (scale-out, scale-in, promote)

Recommendation

The two HIGH findings (unreachable rollback and infinite promote delay) represent logic bugs that would cause deployment failures in production. These should be addressed before merging. The MEDIUM findings are design considerations that could be documented as known limitations if they are intentional tradeoffs.

Comment on lines 1440 to +1446
async def scale_routes(
self,
scale_out_creators: Sequence[Creator[RoutingRow]],
scale_in_updater: BatchUpdater[RoutingRow] | None,
promote_updater: BatchUpdater[RoutingRow] | None = None,
) -> None:
"""Scale out/in routes based on provided creators and updater."""
"""Scale out/in/promote routes based on provided creators and updaters."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scaling in/out and promoting should be considered separately.

@jopemachine jopemachine force-pushed the BA-4821 branch 4 times, most recently from cb54845 to 19fe5c6 Compare March 4, 2026 02:48
jopemachine and others added 7 commits March 4, 2026 02:51
…-green strategy

Add status_updated_at field to RouteInfo and RoutingRow for tracking when
route status last changed. Implement promote_delay_seconds time calculation
in blue_green_evaluate() using _latest_status_updated_at helper. Add
fetch_routes_by_endpoint_ids to repository for fetching all routes including
failed/terminated (needed for rollback detection). Update tests with
status_updated_at parameter and add promote delay test scenarios.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants