-
Notifications
You must be signed in to change notification settings - Fork 288
Description
The way flyctl implements the canary deployment strategy appears to be broken or at least unexpected. It does create a canary machine, but removes it immediately after health checks pass, so it's more akin to a smoke check than a deployment strategy. What it appears to do is:
- Create a new canary machine
- Wait for the canary to be healthy
- Destroy the canary
- Continue with a rolling deployment
Show example log
> Creating canary machine for group app > Machine 2876e0db563318 [app] was created ✔ Machine 2876e0db563318 [app] update finished: success machine 2876e0db563318 was found and is currently in created state, attempting to destroy... Updating existing machines in '[app name redacted]' with canary strategy > Acquiring lease for d894d55fe3edd8 > Acquired lease for d894d55fe3edd8 > Updating machine config for d894d55fe3edd8 > Updating d894d55fe3edd8 [app] > Updated machine config for d894d55fe3edd8 > Waiting for machine d894d55fe3edd8 to reach a good state > Machine d894d55fe3edd8 reached started state > Running smoke checks on machine d894d55fe3edd8 > Running machine checks on machine d894d55fe3edd8 > Checking health of machine d894d55fe3edd8 ✔ Machine d894d55fe3edd8 is now in a good state > Clearing lease for d894d55fe3edd8 ✔ Cleared lease for d894d55fe3edd8 > Acquiring lease for d894d55fe3edd8 > Acquired lease for d894d55fe3edd8 > Updating machine config for d894d55fe3edd8 > Updating d894d55fe3edd8 [app] > Updated machine config for d894d55fe3edd8 ✔ Machine d894d55fe3edd8 is now in a good state > Clearing lease for d894d55fe3edd8 ✔ Cleared lease for d894d55fe3edd8
For a simple one-machine deployment, this means there's downtime during deployment (this is for illustration, I'm aware there should be at least two machines for high availability). Instead, the canary should be destroyed last:
- Create a new canary machine
- Wait for the canary to be healthy
- Continue with a rolling deployment
- Destroy the canary
Or alternatively, but mostly equivalent:
- Create a new canary machine
- Wait for the canary to be healthy
- Destroy a non-canary
- Repeat 1.-3. until all machines are updated
In my understanding the purpose of this is that with
While the documentation's description technically describes the implementation accurately and includes a deployment log showing the actual behavior, other public communication implies that the canary strategy is intended to ensure that there's always a healthy instance running [1, 2, 3] and there are several forum posts reporting unexpected downtime with canary deployments.