Added proposal for auto-rebalance on imbalanced cluster feature in operator#211
Added proposal for auto-rebalance on imbalanced cluster feature in operator#211ShubhamRwt wants to merge 9 commits into
Conversation
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
|
@ShubhamRwt So, is this a draft? Or is it ready for review? It is not completely clear as the PR is not a Draft but you did not requested the review from all maintainers but only from two specific people. |
|
@scholzj It is ready for review, Sorry I didn't tagged everyone |
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
scholzj
left a comment
There was a problem hiding this comment.
- It is not clear what the lifecycle of the KafkaRebalance resources is
- What is the impact of maintenance time windows?
- What is the testing strategy?
|
|
||
| #### What happens if some rebalance fails: | ||
|
|
||
| With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete` |
There was a problem hiding this comment.
Why do we need an additional state for failed rebalance? IIRC there is no such a specific state for failures with auto-rebalancing on scale up/down. Or am I wrong?
There was a problem hiding this comment.
I think it was me overthinking that in case some failure happens and it might need a human intervention and we might not need the auto rebalance to happen again and again but I think this is not really required. I will remove it
There was a problem hiding this comment.
So what state will it move to if there is an error? Idle?
There was a problem hiding this comment.
Will the generate KR CR move to NotReady (or whatever the fail state for that is)? How will a human operator know this happened?
There was a problem hiding this comment.
The generated KR Cr would move to NotReady, yes but I think the user can check the logs but I dont think there is any other thing
There was a problem hiding this comment.
Is the answer clear based on my previous reply?
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
| To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator. | ||
| To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details). | ||
| We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance. | ||
| We will not enable the other anomaly detection classes, related to goal violations, that would require manual interventions at the infrastructure level such as disk or broker failures. |
There was a problem hiding this comment.
I think it would be nice, if we make events or WARN logs in Strimzi if there are other type of anomalies around Kafka cluster, which cannot be automatically fixed. WDYT?
There was a problem hiding this comment.
AFAIK disk and broker failures would surface through the Strimzi Kafka CR status and normal metrics monitoring channels so I am not sure having CC also report that is particularly useful?
There was a problem hiding this comment.
If it is reported in "other channels" then it is fine to me!
There was a problem hiding this comment.
I guess that depends on what disk and broker failures really are. I do not think disk failures are necessarily something the operator would notice? Broker failures - if that means the broker is unresponsive - then it would actually fail the reconciliation I think. But if the broker accepts connections but doesn't for example replicate any data, we would not notice it.
There was a problem hiding this comment.
I think all those situations are better surfaced through monitoring the appropriate metrics, that is all CC is doing anyway.
| * `remove-brokers` - auto-rebalancing on scale down | ||
|
|
||
| To leverage the automated rebalance on imbalanced clusters (those with detected goal violations), we will be introducing a new mode to the auto-rebalancing feature. | ||
| The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to all the brokers. |
There was a problem hiding this comment.
nit: I think we should call it rebalance or rebalance-broker instead of imbalance
I prefer rebalance or rebalance-broker, since it will match with CC terminology and also add-brokers, remove-brokers have a verb, while "imbalance" is just an adjective. So to match the current convention better.
There was a problem hiding this comment.
I disagree, to me, the mode is the reason for the auto-rebalance being triggered.
There was a problem hiding this comment.
In which case we might want to make it past-tense: "imbalanced"
There was a problem hiding this comment.
I disagree, to me, the mode is the reason for the auto-rebalance being triggered.
I also disagree with you, if it is the reason, then why we call it 'mode'? It should be called 'reason' then!
In this current form, for me, the rebalance-brokers would match the convention created by add-brokers and remove-brokers...
There was a problem hiding this comment.
But all the modes are rebalancing the brokers?
There was a problem hiding this comment.
The "mode" terminology is coming from the KafkaRebalance spec where you specify which mode you want to use for rebalancing: full, add-brokers, remove-brokers or remove-disks. All these modes maps to the corresponding CC endpoints. Maybe we had a discussion about a good naming long time ago about what to use in the auto-rebalancing and we ending with the same because the mapping was one to one. I can see @egyedt point where, within the auto-rebalancing, we are actually specifying "when I should trigger an auto-rebalancing?" on adding brokers, on removing brokers or ... whenever there is an anomaly. What the current proposal is going to do underneath in terms of KafkaRebalance is using the "full" mode which is using the /rebalance endpoint (full because using ALL brokers within the cluster). Maybe we can stick with "full" instead of "imbalance" for now? Or we should change the field name but it needs deprecating this and adding a new one.
There was a problem hiding this comment.
I still see this is an ongoing discussion - Do we plan to use imbalance or full mode?
| This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now. | ||
| To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator. | ||
| To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details). | ||
| We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance. |
There was a problem hiding this comment.
Maybe for some anomalies, the call of the fix-offline-replicas endpoint can be useful!
Beside the usage of rebalance endpoint.
There was a problem hiding this comment.
How would we decide which anomalies to call the different endpoint for?
There was a problem hiding this comment.
There should be some logic about this in Strimzi if we decide to use fix-offline-replicas endpoint too. Otherwise we can use only the rebalance endpoint, it is totally fine to me. I just mentioned here as an interesting idea.
There was a problem hiding this comment.
I didn't know about this /fix_offline_replicas endpoint at all. If there is a specific anomaly raised about it, maybe it could make sense to use it but it's not trivial. For how the rebalance operator works today, Each CC endpoint call is mappend on a corresponding "mode" within the KafkaRebalance, so hitting two endpoints would mean creating two KafkaRebalance: one using the /rebalance endpoint and another one using the /fix_offline_replicas (which is anyway not supported now).
tomncooper
left a comment
There was a problem hiding this comment.
I have had a pass. I left a few comments.
As other reviewers have identified, the main point is checking if a rebalance (manual or automatic) is ongoing before you do anything else.
For auto rebalances if seems you can check the state machine, for manual you could check CC directly or the status of the KR CR. If a manual KR CR already exists then something is going to check it's status with CC, depending on the ordering of those operations it may happen before you need to do your checks, in which case you could use that. If it is after or ordering can't be guaranteed then you might be best querying CC directly.
| This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now. | ||
| To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator. | ||
| To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details). | ||
| We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance. |
There was a problem hiding this comment.
How would we decide which anomalies to call the different endpoint for?
| To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator. | ||
| To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details). | ||
| We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance. | ||
| We will not enable the other anomaly detection classes, related to goal violations, that would require manual interventions at the infrastructure level such as disk or broker failures. |
There was a problem hiding this comment.
AFAIK disk and broker failures would surface through the Strimzi Kafka CR status and normal metrics monitoring channels so I am not sure having CC also report that is particularly useful?
| * `remove-brokers` - auto-rebalancing on scale down | ||
|
|
||
| To leverage the automated rebalance on imbalanced clusters (those with detected goal violations), we will be introducing a new mode to the auto-rebalancing feature. | ||
| The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to all the brokers. |
There was a problem hiding this comment.
I disagree, to me, the mode is the reason for the auto-rebalance being triggered.
|
|
||
| #### What happens if some rebalance fails: | ||
|
|
||
| With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete` |
There was a problem hiding this comment.
So what state will it move to if there is an error? Idle?
|
|
||
| #### What happens if some rebalance fails: | ||
|
|
||
| With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete` |
There was a problem hiding this comment.
Will the generate KR CR move to NotReady (or whatever the fail state for that is)? How will a human operator know this happened?
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
|
Is there an estimate when this PR will be ready? |
|
@egyedt Hi, I pushed the suggestions yesterday. I think this proposal is ready for next set of reviews. I hope the new suggestions fixes the open questions |
|
@scholzj @ppatierno @tomncooper can you guys have another pass at this, please. Thankyou |
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
This PR aims to introduce the self-healing feature in Strimzi. This proposal contains all the comments and suggestion left on the old proposal . This proposal aim to utilize the auto-rebalancing feature of Strimzi to introduce the self healing.