Losing all spot nodes at once across multiple pools and sizes #5074
Replies: 3 comments
-
|
We are cofigured for one node type, but are seeing the same behavior. It started about two weeks ago on a 24 our cycle and then today it's just constant for us. |
Beta Was this translation helpful? Give feedback.
-
|
We also noticed this behaviour on AKS clusters - all spot machines are evicted, across three availability zones at the same time. |
Beta Was this translation helpful? Give feedback.
-
|
So we've developed a workaround that's more complex than I'd like. Instead of using the default scalable node pools, we have switched to using Karpenter. Instead of putting everything in VMSS, Karpenter provisions VMs directly and connects them to the cluster as nodes. This lets it span more image types and availability zones. It recovers faster, spreads better, and doesn't all go down at once. Azure has an official repo for the Helm chart and deployment of Karpenter. Alongside Karpenter, we set up some nodes that stay warm but have essentially nothing running on them. We set up about 25% of our capacity to consist of Pause containers that do nothing but request a full node worth of resources. They are set to the lowest preemption level, so if our workloads are kicked off a spot node, they'll be bumped to the warm node and kick the placeholder off, forcing Karpenter to create a new node for that placeholder container. When the spot nodes go down, its over the course of a few minutes, so Karpenter can spin up new warm nodes faster than the spot nodes are killed. This results in minimal disruptions. That being said, we just started with Karpenter this week. It's only in our Dev environment. We have had good success with it there, but we are giving it some time and observation before promoting it to Prod. I know that saying "just use a different tool" isn't a viable answer for questions like this, and it doesn't explain why this behavior is happening with VMSS and standard node pools, but this is how we are getting around the problem. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Currently, we have 3 different types of spot nodes configured with different machine sizes: d4as_v5, e4as_v5, and f4as_v6. We have enough redundancy built in to our applications to handle an outage on a single node. However, every node is evicted simultaneously across all pools. Based on my understanding of spot nodes, it seems unlikely that all 3 machine sizes would be fully evicted at the same time. Is this expected behavior? Has anyone dealt with this before, and if so, how did you resolve it?
Beta Was this translation helpful? Give feedback.
All reactions