Canary deployment broken

The way `flyctl` implements the canary deployment strategy appears to be broken or at least unexpected. It does create a canary machine, but removes it immediately after health checks pass, so it's more akin to a smoke check than a deployment strategy. What it appears to do is:

1. Create a new canary machine
2. Wait for the canary to be healthy
3. Destroy the canary
4. Continue with a rolling deployment

<details>
<summary>Show example log</summary>
<pre>
> Creating canary machine for group app
> Machine 2876e0db563318 [app] was created
✔ Machine 2876e0db563318 [app] update finished: success
machine 2876e0db563318 was found and is currently in created state, attempting to destroy...
Updating existing machines in '[app name redacted]' with canary strategy
> Acquiring lease for d894d55fe3edd8
> Acquired lease for d894d55fe3edd8
> Updating machine config for d894d55fe3edd8
> Updating d894d55fe3edd8 [app]
> Updated machine config for d894d55fe3edd8
> Waiting for machine d894d55fe3edd8 to reach a good state
> Machine d894d55fe3edd8 reached started state
> Running smoke checks on machine d894d55fe3edd8
> Running machine checks on machine d894d55fe3edd8
> Checking health of machine d894d55fe3edd8
✔ Machine d894d55fe3edd8 is now in a good state
> Clearing lease for d894d55fe3edd8
✔ Cleared lease for d894d55fe3edd8
> Acquiring lease for d894d55fe3edd8
> Acquired lease for d894d55fe3edd8
> Updating machine config for d894d55fe3edd8
> Updating d894d55fe3edd8 [app]
> Updated machine config for d894d55fe3edd8
✔ Machine d894d55fe3edd8 is now in a good state
> Clearing lease for d894d55fe3edd8
✔ Cleared lease for d894d55fe3edd8
</pre>
</details>

For a simple one-machine deployment, this means there's downtime during deployment (this is for illustration, I'm aware there should be at least two machines for high availability). Instead, the canary should be destroyed last:

1. Create a new canary machine
2. Wait for the canary to be healthy
3. Continue with a rolling deployment
4. Destroy the canary

Or alternatively, but mostly equivalent:

1. Create a new canary machine
2. Wait for the canary to be healthy
3. Destroy a non-canary
4. Repeat 1.-3. until all machines are updated

In my understanding the purpose of this is that with $n$ machines before the deployment, there are always at least $n + 1$ machines running, of which $n$ are healthy, meaning there should be no degraded performance during deployment and no downtime when $n=1$. The current implementation defeats this purpose and is a regression from the Nomad-implemented deployments previously offered.

While the documentation's description _technically_ describes the implementation accurately and [includes a deployment log](https://fly.io/docs/launch/deploy/#deployment-strategy) showing the actual behavior, other public communication implies that the canary strategy is intended to ensure that there's always a healthy instance running [[1](https://fly.io/docs/about/healthcare/#rolling-deployments), [2](https://community.fly.io/t/rolling-deployments-sequence/602/2), [3](https://community.fly.io/t/general-questions/1813/2)] and there are several forum posts reporting unexpected downtime with canary deployments.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canary deployment broken #4745

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Canary deployment broken #4745

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions