Skip to content

Canary deployment broken #4745

@lukas-w

Description

@lukas-w

The way flyctl implements the canary deployment strategy appears to be broken or at least unexpected. It does create a canary machine, but removes it immediately after health checks pass, so it's more akin to a smoke check than a deployment strategy. What it appears to do is:

  1. Create a new canary machine
  2. Wait for the canary to be healthy
  3. Destroy the canary
  4. Continue with a rolling deployment
Show example log
> Creating canary machine for group app
> Machine 2876e0db563318 [app] was created
✔ Machine 2876e0db563318 [app] update finished: success
machine 2876e0db563318 was found and is currently in created state, attempting to destroy...
Updating existing machines in '[app name redacted]' with canary strategy
> Acquiring lease for d894d55fe3edd8
> Acquired lease for d894d55fe3edd8
> Updating machine config for d894d55fe3edd8
> Updating d894d55fe3edd8 [app]
> Updated machine config for d894d55fe3edd8
> Waiting for machine d894d55fe3edd8 to reach a good state
> Machine d894d55fe3edd8 reached started state
> Running smoke checks on machine d894d55fe3edd8
> Running machine checks on machine d894d55fe3edd8
> Checking health of machine d894d55fe3edd8
✔ Machine d894d55fe3edd8 is now in a good state
> Clearing lease for d894d55fe3edd8
✔ Cleared lease for d894d55fe3edd8
> Acquiring lease for d894d55fe3edd8
> Acquired lease for d894d55fe3edd8
> Updating machine config for d894d55fe3edd8
> Updating d894d55fe3edd8 [app]
> Updated machine config for d894d55fe3edd8
✔ Machine d894d55fe3edd8 is now in a good state
> Clearing lease for d894d55fe3edd8
✔ Cleared lease for d894d55fe3edd8

For a simple one-machine deployment, this means there's downtime during deployment (this is for illustration, I'm aware there should be at least two machines for high availability). Instead, the canary should be destroyed last:

  1. Create a new canary machine
  2. Wait for the canary to be healthy
  3. Continue with a rolling deployment
  4. Destroy the canary

Or alternatively, but mostly equivalent:

  1. Create a new canary machine
  2. Wait for the canary to be healthy
  3. Destroy a non-canary
  4. Repeat 1.-3. until all machines are updated

In my understanding the purpose of this is that with $n$ machines before the deployment, there are always at least $n + 1$ machines running, of which $n$ are healthy, meaning there should be no degraded performance during deployment and no downtime when $n=1$. The current implementation defeats this purpose and is a regression from the Nomad-implemented deployments previously offered.

While the documentation's description technically describes the implementation accurately and includes a deployment log showing the actual behavior, other public communication implies that the canary strategy is intended to ensure that there's always a healthy instance running [1, 2, 3] and there are several forum posts reporting unexpected downtime with canary deployments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions