Skip to content

Added proposal for auto-rebalance on imbalanced cluster feature in operator#211

Open
ShubhamRwt wants to merge 9 commits into
strimzi:mainfrom
ShubhamRwt:autoRebalanceImbalance
Open

Added proposal for auto-rebalance on imbalanced cluster feature in operator#211
ShubhamRwt wants to merge 9 commits into
strimzi:mainfrom
ShubhamRwt:autoRebalanceImbalance

Conversation

@ShubhamRwt
Copy link
Copy Markdown
Contributor

This PR aims to introduce the self-healing feature in Strimzi. This proposal contains all the comments and suggestion left on the old proposal . This proposal aim to utilize the auto-rebalancing feature of Strimzi to introduce the self healing.

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
@scholzj
Copy link
Copy Markdown
Member

scholzj commented Apr 1, 2026

@ShubhamRwt So, is this a draft? Or is it ready for review? It is not completely clear as the PR is not a Draft but you did not requested the review from all maintainers but only from two specific people.

@ShubhamRwt
Copy link
Copy Markdown
Contributor Author

@scholzj It is ready for review, Sorry I didn't tagged everyone

@ShubhamRwt ShubhamRwt requested review from a team and scholzj April 1, 2026 12:53
@ppatierno ppatierno requested review from Frawless, im-konge, katheris, see-quick and tinaselenge and removed request for a team April 1, 2026 12:55
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Copy link
Copy Markdown
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It is not clear what the lifecycle of the KafkaRebalance resources is
  • What is the impact of maintenance time windows?
  • What is the testing strategy?

Comment thread images/136-auto-rebalance-imbalance.png
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated

#### What happens if some rebalance fails:

With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an additional state for failed rebalance? IIRC there is no such a specific state for failures with auto-rebalancing on scale up/down. Or am I wrong?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was me overthinking that in case some failure happens and it might need a human intervention and we might not need the auto rebalance to happen again and again but I think this is not really required. I will remove it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what state will it move to if there is an error? Idle?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the generate KR CR move to NotReady (or whatever the fail state for that is)? How will a human operator know this happened?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated KR Cr would move to NotReady, yes but I think the user can check the logs but I dont think there is any other thing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the answer clear based on my previous reply?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me

Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.
We will not enable the other anomaly detection classes, related to goal violations, that would require manual interventions at the infrastructure level such as disk or broker failures.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice, if we make events or WARN logs in Strimzi if there are other type of anomalies around Kafka cluster, which cannot be automatically fixed. WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK disk and broker failures would surface through the Strimzi Kafka CR status and normal metrics monitoring channels so I am not sure having CC also report that is particularly useful?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is reported in "other channels" then it is fine to me!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that depends on what disk and broker failures really are. I do not think disk failures are necessarily something the operator would notice? Broker failures - if that means the broker is unresponsive - then it would actually fail the reconciliation I think. But if the broker accepts connections but doesn't for example replicate any data, we would not notice it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all those situations are better surfaced through monitoring the appropriate metrics, that is all CC is doing anyway.

* `remove-brokers` - auto-rebalancing on scale down

To leverage the automated rebalance on imbalanced clusters (those with detected goal violations), we will be introducing a new mode to the auto-rebalancing feature.
The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to all the brokers.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we should call it rebalance or rebalance-broker instead of imbalance
I prefer rebalance or rebalance-broker, since it will match with CC terminology and also add-brokers, remove-brokers have a verb, while "imbalance" is just an adjective. So to match the current convention better.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, to me, the mode is the reason for the auto-rebalance being triggered.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case we might want to make it past-tense: "imbalanced"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, to me, the mode is the reason for the auto-rebalance being triggered.

I also disagree with you, if it is the reason, then why we call it 'mode'? It should be called 'reason' then!
In this current form, for me, the rebalance-brokers would match the convention created by add-brokers and remove-brokers...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But all the modes are rebalancing the brokers?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "mode" terminology is coming from the KafkaRebalance spec where you specify which mode you want to use for rebalancing: full, add-brokers, remove-brokers or remove-disks. All these modes maps to the corresponding CC endpoints. Maybe we had a discussion about a good naming long time ago about what to use in the auto-rebalancing and we ending with the same because the mapping was one to one. I can see @egyedt point where, within the auto-rebalancing, we are actually specifying "when I should trigger an auto-rebalancing?" on adding brokers, on removing brokers or ... whenever there is an anomaly. What the current proposal is going to do underneath in terms of KafkaRebalance is using the "full" mode which is using the /rebalance endpoint (full because using ALL brokers within the cluster). Maybe we can stick with "full" instead of "imbalance" for now? Or we should change the field name but it needs deprecating this and adding a new one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see this is an ongoing discussion - Do we plan to use imbalance or full mode?

Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now.
To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for some anomalies, the call of the fix-offline-replicas endpoint can be useful!
Beside the usage of rebalance endpoint.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we decide which anomalies to call the different endpoint for?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be some logic about this in Strimzi if we decide to use fix-offline-replicas endpoint too. Otherwise we can use only the rebalance endpoint, it is totally fine to me. I just mentioned here as an interesting idea.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about this /fix_offline_replicas endpoint at all. If there is a specific anomaly raised about it, maybe it could make sense to use it but it's not trivial. For how the rebalance operator works today, Each CC endpoint call is mappend on a corresponding "mode" within the KafkaRebalance, so hitting two endpoints would mean creating two KafkaRebalance: one using the /rebalance endpoint and another one using the /fix_offline_replicas (which is anyway not supported now).

Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Copy link
Copy Markdown
Member

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have had a pass. I left a few comments.

As other reviewers have identified, the main point is checking if a rebalance (manual or automatic) is ongoing before you do anything else.

For auto rebalances if seems you can check the state machine, for manual you could check CC directly or the status of the KR CR. If a manual KR CR already exists then something is going to check it's status with CC, depending on the ordering of those operations it may happen before you need to do your checks, in which case you could use that. If it is after or ordering can't be guaranteed then you might be best querying CC directly.

Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now.
To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we decide which anomalies to call the different endpoint for?

To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.
We will not enable the other anomaly detection classes, related to goal violations, that would require manual interventions at the infrastructure level such as disk or broker failures.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK disk and broker failures would surface through the Strimzi Kafka CR status and normal metrics monitoring channels so I am not sure having CC also report that is particularly useful?

* `remove-brokers` - auto-rebalancing on scale down

To leverage the automated rebalance on imbalanced clusters (those with detected goal violations), we will be introducing a new mode to the auto-rebalancing feature.
The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to all the brokers.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, to me, the mode is the reason for the auto-rebalance being triggered.


#### What happens if some rebalance fails:

With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what state will it move to if there is an error? Idle?


#### What happens if some rebalance fails:

With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the generate KR CR move to NotReady (or whatever the fail state for that is)? How will a human operator know this happened?

Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
ShubhamRwt added 2 commits May 5, 2026 17:53
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
@egyedt
Copy link
Copy Markdown

egyedt commented May 6, 2026

Is there an estimate when this PR will be ready?
Is there a lot of open questions?
If help is needed with the questions then feel free to answer me in this thread and I will try to help!

@ShubhamRwt
Copy link
Copy Markdown
Contributor Author

@egyedt Hi, I pushed the suggestions yesterday. I think this proposal is ready for next set of reviews. I hope the new suggestions fixes the open questions

@ShubhamRwt
Copy link
Copy Markdown
Contributor Author

@scholzj @ppatierno @tomncooper can you guys have another pass at this, please. Thankyou

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants