Added proposal for auto-rebalance on imbalanced cluster feature in operator by ShubhamRwt · Pull Request #211 · strimzi/proposals

ShubhamRwt · 2026-03-30T07:48:19Z

This PR aims to introduce the self-healing feature in Strimzi. This proposal contains all the comments and suggestion left on the old proposal . This proposal aim to utilize the auto-rebalancing feature of Strimzi to introduce the self healing.

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

scholzj · 2026-04-01T12:38:27Z

@ShubhamRwt So, is this a draft? Or is it ready for review? It is not completely clear as the PR is not a Draft but you did not requested the review from all maintainers but only from two specific people.

ShubhamRwt · 2026-04-01T12:53:29Z

@scholzj It is ready for review, Sorry I didn't tagged everyone

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

scholzj

It is not clear what the lifecycle of the KafkaRebalance resources is
What is the impact of maintenance time windows?
What is the testing strategy?

ppatierno · 2026-04-07T13:09:32Z

+
+#### What happens if some rebalance fails:
+
+With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete`


Why do we need an additional state for failed rebalance? IIRC there is no such a specific state for failures with auto-rebalancing on scale up/down. Or am I wrong?

I think it was me overthinking that in case some failure happens and it might need a human intervention and we might not need the auto rebalance to happen again and again but I think this is not really required. I will remove it

So what state will it move to if there is an error? Idle?

Will the generate KR CR move to NotReady (or whatever the fail state for that is)? How will a human operator know this happened?

The generated KR Cr would move to NotReady, yes but I think the user can check the logs but I dont think there is any other thing

Is the answer clear based on my previous reply?

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

egyedt · 2026-04-16T12:12:58Z

+To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
+To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
+We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.
+We will not enable the other anomaly detection classes, related to goal violations, that would require manual interventions at the infrastructure level such as disk or broker failures.


I think it would be nice, if we make events or WARN logs in Strimzi if there are other type of anomalies around Kafka cluster, which cannot be automatically fixed. WDYT?

AFAIK disk and broker failures would surface through the Strimzi Kafka CR status and normal metrics monitoring channels so I am not sure having CC also report that is particularly useful?

If it is reported in "other channels" then it is fine to me!

I guess that depends on what disk and broker failures really are. I do not think disk failures are necessarily something the operator would notice? Broker failures - if that means the broker is unresponsive - then it would actually fail the reconciliation I think. But if the broker accepts connections but doesn't for example replicate any data, we would not notice it.

I think all those situations are better surfaced through monitoring the appropriate metrics, that is all CC is doing anyway.

egyedt · 2026-04-16T12:17:17Z

+* `remove-brokers` - auto-rebalancing on scale down
+
+To leverage the automated rebalance on imbalanced clusters (those with detected goal violations), we will be introducing a new mode to the auto-rebalancing feature.
+The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to all the brokers.


nit: I think we should call it rebalance or rebalance-broker instead of imbalance
I prefer rebalance or rebalance-broker, since it will match with CC terminology and also add-brokers, remove-brokers have a verb, while "imbalance" is just an adjective. So to match the current convention better.

I disagree, to me, the mode is the reason for the auto-rebalance being triggered.

In which case we might want to make it past-tense: "imbalanced"

I disagree, to me, the mode is the reason for the auto-rebalance being triggered.

I also disagree with you, if it is the reason, then why we call it 'mode'? It should be called 'reason' then!
In this current form, for me, the rebalance-brokers would match the convention created by add-brokers and remove-brokers...

But all the modes are rebalancing the brokers?

The "mode" terminology is coming from the KafkaRebalance spec where you specify which mode you want to use for rebalancing: full, add-brokers, remove-brokers or remove-disks. All these modes maps to the corresponding CC endpoints. Maybe we had a discussion about a good naming long time ago about what to use in the auto-rebalancing and we ending with the same because the mapping was one to one. I can see @egyedt point where, within the auto-rebalancing, we are actually specifying "when I should trigger an auto-rebalancing?" on adding brokers, on removing brokers or ... whenever there is an anomaly. What the current proposal is going to do underneath in terms of KafkaRebalance is using the "full" mode which is using the /rebalance endpoint (full because using ALL brokers within the cluster). Maybe we can stick with "full" instead of "imbalance" for now? Or we should change the field name but it needs deprecating this and adding a new one.

I still see this is an ongoing discussion - Do we plan to use imbalance or full mode?

egyedt · 2026-04-16T12:27:54Z

+This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now.
+To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
+To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
+We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.


Maybe for some anomalies, the call of the fix-offline-replicas endpoint can be useful!
Beside the usage of rebalance endpoint.

How would we decide which anomalies to call the different endpoint for?

There should be some logic about this in Strimzi if we decide to use fix-offline-replicas endpoint too. Otherwise we can use only the rebalance endpoint, it is totally fine to me. I just mentioned here as an interesting idea.

I didn't know about this /fix_offline_replicas endpoint at all. If there is a specific anomaly raised about it, maybe it could make sense to use it but it's not trivial. For how the rebalance operator works today, Each CC endpoint call is mappend on a corresponding "mode" within the KafkaRebalance, so hitting two endpoints would mean creating two KafkaRebalance: one using the /rebalance endpoint and another one using the /fix_offline_replicas (which is anyway not supported now).

tomncooper

I have had a pass. I left a few comments.

As other reviewers have identified, the main point is checking if a rebalance (manual or automatic) is ongoing before you do anything else.

For auto rebalances if seems you can check the state machine, for manual you could check CC directly or the status of the KR CR. If a manual KR CR already exists then something is going to check it's status with CC, depending on the ordering of those operations it may happen before you need to do your checks, in which case you could use that. If it is after or ordering can't be guaranteed then you might be best querying CC directly.

tomncooper · 2026-04-16T15:57:02Z

+This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now.
+To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
+To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
+We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.


How would we decide which anomalies to call the different endpoint for?

tomncooper · 2026-04-16T15:58:32Z

+To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
+To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
+We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.
+We will not enable the other anomaly detection classes, related to goal violations, that would require manual interventions at the infrastructure level such as disk or broker failures.


AFAIK disk and broker failures would surface through the Strimzi Kafka CR status and normal metrics monitoring channels so I am not sure having CC also report that is particularly useful?

tomncooper · 2026-04-16T16:01:03Z

+* `remove-brokers` - auto-rebalancing on scale down
+
+To leverage the automated rebalance on imbalanced clusters (those with detected goal violations), we will be introducing a new mode to the auto-rebalancing feature.
+The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to all the brokers.


I disagree, to me, the mode is the reason for the auto-rebalance being triggered.

tomncooper · 2026-04-16T16:33:04Z

+
+#### What happens if some rebalance fails:
+
+With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete`


So what state will it move to if there is an error? Idle?

tomncooper · 2026-04-16T16:33:55Z

+
+#### What happens if some rebalance fails:
+
+With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete`


Will the generate KR CR move to NotReady (or whatever the fail state for that is)? How will a human operator know this happened?

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

egyedt · 2026-05-06T13:50:02Z

Is there an estimate when this PR will be ready?
Is there a lot of open questions?
If help is needed with the questions then feel free to answer me in this thread and I will try to help!

ShubhamRwt · 2026-05-06T13:56:34Z

@egyedt Hi, I pushed the suggestions yesterday. I think this proposal is ready for next set of reviews. I hope the new suggestions fixes the open questions

ShubhamRwt · 2026-05-06T13:57:17Z

@scholzj @ppatierno @tomncooper can you guys have another pass at this, please. Thankyou

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

ShubhamRwt added 3 commits March 30, 2026 12:42

proposal for auto-rebalance on imbalance

3c034bb

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Fixing minot issues

9d215a2

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Remove unnecessory lines

4f3d6e5

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

ShubhamRwt requested review from ppatierno and tomncooper March 30, 2026 08:33

Fix broken link

d8965cf

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

ShubhamRwt requested review from a team and scholzj April 1, 2026 12:53

ppatierno requested review from Frawless, im-konge, katheris, see-quick and tinaselenge and removed request for a team April 1, 2026 12:55

Refined the proposal

7bfa09f

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

scholzj reviewed Apr 7, 2026

View reviewed changes

ppatierno reviewed Apr 7, 2026

View reviewed changes

Added suggestions from PP and JS

50b8936

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

egyedt reviewed Apr 16, 2026

View reviewed changes

tomncooper reviewed Apr 16, 2026

View reviewed changes

ShubhamRwt added 2 commits May 5, 2026 17:53

Added suggestions by TC, PP, JS and EG

813f57d

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

update diagram

60ae9e6

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Add minor edits

86d1681

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

scholzj mentioned this pull request May 10, 2026

Add automatic offline log directory detection and rolling restart strimzi/strimzi-kafka-operator#12726

Closed

acgtun mentioned this pull request May 10, 2026

Proposal 142: Automatic detection and recovery of offline log directories #221

Closed


		#### What happens if some rebalance fails:

		With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete`

Conversation

ShubhamRwt commented Mar 30, 2026

Uh oh!

scholzj commented Apr 1, 2026

Uh oh!

ShubhamRwt commented Apr 1, 2026

Uh oh!

scholzj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tomncooper left a comment

Choose a reason for hiding this comment

Uh oh!