Skip to content

feat: Add cluster maintenance mode#40

Merged
rdettai-sk merged 8 commits intosekoiafrom
feat/maintenance_mode
May 7, 2026
Merged

feat: Add cluster maintenance mode#40
rdettai-sk merged 8 commits intosekoiafrom
feat/maintenance_mode

Conversation

@Darkheir
Copy link
Copy Markdown
Collaborator

@Darkheir Darkheir commented May 5, 2026

Description

Add a cluster maintenance mode.

When in maintenance the indexing plan is frozen along with all related operations (index creation, ...)

How was this PR tested?

Describe how you tested this PR.

Copilot AI review requested due to automatic review settings May 5, 2026 08:33
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a cluster-wide maintenance mode that freezes the control plane’s indexing plan and rejects metadata mutations while maintenance is active. It adds persistence for the maintenance flag + frozen plan via the metastore KV API, exposes new REST endpoints (and a REST client), and provides a CLI surface to manage the mode.

Changes:

  • Add Control Plane maintenance mode state + persistence (metastore-backed) and metrics.
  • Add REST API endpoints (/api/v1/cluster/maintenance) plus REST client support.
  • Add metastore KV RPCs (proto/codegen + metastore backends) and a quickwit-cli maintenance command.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
quickwit/quickwit-serve/src/rest.rs Wires the new maintenance REST handler into /api/v1 routes.
quickwit/quickwit-serve/src/lib.rs Spawns the control plane with metastore-backed maintenance persistence.
quickwit/quickwit-serve/src/cluster_api/rest_handler.rs Adds REST endpoints + OpenAPI paths for maintenance mode.
quickwit/quickwit-serve/src/cluster_api/mod.rs Re-exports maintenance_handler.
quickwit/quickwit-rest-client/src/rest_client.rs Adds a MaintenanceClient with status/enable/disable calls.
quickwit/quickwit-proto/src/control_plane/mod.rs Introduces MaintenanceMode error + RPC name bindings.
quickwit/quickwit-proto/src/codegen/quickwit/quickwit.metastore.rs Codegen updates for metastore KV RPCs/types and tower wiring.
quickwit/quickwit-proto/src/codegen/quickwit/quickwit.control_plane.rs Codegen updates for maintenance RPCs/types and tower wiring.
quickwit/quickwit-proto/protos/quickwit/metastore.proto Adds KV RPCs/messages to the metastore service.
quickwit/quickwit-proto/protos/quickwit/control_plane.proto Adds maintenance mode RPCs/messages to the control plane service.
quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs Implements KV operations for PostgreSQL metastore backend.
quickwit/quickwit-metastore/src/metastore/file_backed/state.rs Adds an in-memory KV store field to file-backed metastore state.
quickwit/quickwit-metastore/src/metastore/file_backed/mod.rs Implements KV operations for the file-backed metastore backend.
quickwit/quickwit-metastore/src/metastore/control_plane_metastore.rs Proxies KV operations through MetastoreServiceClient.
quickwit/quickwit-control-plane/src/metrics.rs Adds a maintenance_mode gauge metric.
quickwit/quickwit-control-plane/src/maintenance.rs New maintenance mode persistence/state utilities (incl. metastore KV persistence).
quickwit/quickwit-control-plane/src/lib.rs Exposes the new maintenance module.
quickwit/quickwit-control-plane/src/indexing_scheduler/mod.rs Adds a method to load a frozen plan into scheduler state.
quickwit/quickwit-control-plane/src/indexing_plan.rs Makes PhysicalIndexingPlan deserializable for persistence.
quickwit/quickwit-control-plane/src/control_plane.rs Core maintenance mode behavior: guards mutations, freezes plan, adds RPC handlers.
quickwit/quickwit-control-plane/Cargo.toml Adds time dependency for RFC3339 timestamps.
quickwit/quickwit-cli/src/maintenance.rs New CLI command group to enable/disable/query maintenance mode.
quickwit/quickwit-cli/src/lib.rs Exposes the new maintenance CLI module.
quickwit/quickwit-cli/src/cli.rs Registers the maintenance CLI command.
quickwit/Cargo.lock Locks the added time dependency.
Comments suppressed due to low confidence (1)

quickwit/quickwit-control-plane/src/control_plane.rs:639

  • In maintenance mode, ControlPlanLoop returns early and skips indexing_scheduler.control_running_plan(&self.model). This prevents re-applying the frozen plan to indexers that restart during maintenance, contradicting the intended behavior described in IndexingScheduler::load_frozen_plan docs and potentially leaving restarted indexers without tasks. Consider still calling control_running_plan while skipping shard rebalancing and plan rebuilds, so the frozen plan continues to be enforced during maintenance windows.
        if self.maintenance.is_active() {
            // In maintenance mode: skip shard rebalancing and plan control.
            ctx.schedule_self_msg(CONTROL_PLAN_LOOP_INTERVAL, ControlPlanLoop);
            return Ok(());
        }
        if let Err(metastore_error) = self
            .ingest_controller
            .rebalance_shards(&mut self.model, ctx.mailbox(), ctx.progress())
            .await
        {
            return convert_metastore_error::<()>(metastore_error).map(|_| ());
        }
        self.indexing_scheduler.control_running_plan(&self.model);
        ctx.schedule_self_msg(CONTROL_PLAN_LOOP_INTERVAL, ControlPlanLoop);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread quickwit/quickwit-metastore/src/metastore/file_backed/state.rs
Comment thread quickwit/quickwit-control-plane/src/maintenance.rs Outdated
@Darkheir Darkheir force-pushed the feat/maintenance_mode branch 2 times, most recently from ced1764 to e22ba25 Compare May 5, 2026 09:50
@Darkheir Darkheir requested a review from Copilot May 5, 2026 10:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 26 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread quickwit/quickwit-control-plane/src/control_plane.rs Outdated
Comment thread quickwit/quickwit-rest-client/src/rest_client.rs
ControlPlaneError::TooManyRequests => MetastoreError::TooManyRequests,
ControlPlaneError::Unavailable(message) => MetastoreError::Unavailable(message),
ControlPlaneError::MaintenanceMode => {
MetastoreError::Unavailable("cluster is in maintenance mode".to_string())
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure precondition failed is an appropriate HTTP error status.

@Darkheir Darkheir force-pushed the feat/maintenance_mode branch from e22ba25 to 5789f5a Compare May 5, 2026 10:31
@Darkheir Darkheir requested a review from rdettai-sk May 5, 2026 10:49
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
@Darkheir Darkheir force-pushed the feat/maintenance_mode branch from 5789f5a to 90c0066 Compare May 5, 2026 12:47
Comment thread quickwit/quickwit-control-plane/Cargo.toml
Comment on lines +242 to +246
warn!(
error = %err,
"failed to deserialize maintenance state; clearing corrupted key and \
starting in normal mode"
);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think clearing the maintenance state is too aggressive here. If for some reason deserializaiton fail, it might be better to offer the choice to either clean the db entry manually or rollback.

Comment on lines +340 to +344
warn!(
error = %err,
"failed to load maintenance state from persistence, starting in normal mode"
);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, an error (that might be transient) the has us start out of the maintenance mode. We are now in an inconsistent state where the control plane is not in maintenance mode but if the control plane restarts it might jump back into it.

Darkheir added 2 commits May 5, 2026 17:35
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
@Darkheir Darkheir force-pushed the feat/maintenance_mode branch from 94d5010 to 2d2ee33 Compare May 5, 2026 16:02
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 29 changed files in this pull request and generated 6 comments.

Comment thread quickwit/quickwit-control-plane/src/maintenance.rs
Comment thread quickwit/quickwit-control-plane/src/maintenance.rs Outdated
Comment thread quickwit/quickwit-control-plane/src/maintenance.rs Outdated
Comment thread quickwit/quickwit-control-plane/src/maintenance.rs Outdated
Comment thread quickwit/quickwit-cli/src/maintenance.rs
Comment thread quickwit/quickwit-proto/protos/quickwit/control_plane.proto
@rdettai-sk rdettai-sk merged commit 9b2bf17 into sekoia May 7, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants