diff --git a/docs/operator.md b/docs/operator.md index 1f612dc..ceec0fa 100644 --- a/docs/operator.md +++ b/docs/operator.md @@ -365,6 +365,16 @@ Turns [Automatic Crash Recovery](recovery.md#automatic-crash-recovery) on or off | ----------- | ---------- | | :material-toggle-switch-outline: boolean | `true` | +### `pxc.sstRetryCount` + +Limits how many State Snapshot Transfer (SST) retries a joining node can perform before it stops retrying and remains running but unready. + +Use this option to avoid endless SST retry loops and make recovery behavior predictable. For details about behavior and recovery steps, see [Limit SST retries](sst-retry-limit.md). + +| Value type | Example | +| ----------- | ---------- | +| :material-numeric-1-box: int (minimum `1`) | `3` | + ### `pxc.expose.enabled` Enable or disable exposing Percona XtraDB Cluster instances with dedicated IP addresses. diff --git a/docs/sst-retry-limit.md b/docs/sst-retry-limit.md new file mode 100644 index 0000000..3330b90 --- /dev/null +++ b/docs/sst-retry-limit.md @@ -0,0 +1,41 @@ +# Limit SST retries + +When a Percona XtraDB Cluster node joins or rejoins the cluster, it receives data from an existing cluster member using the State Snapshot Transfer (SST) method. If SST fails repeatedly, the node can quickly enter an endless retry loop, using resources such as network bandwidth and impacting the overall cluster performance. + +To prevent excessive and ineffective SST retry loops, you can set a limit on SST attempts for each joining node using the `spec.pxc.sstRetryCount` option in the Custom Resource. The Operator counts SST retries and records them in the `/var/lib/mysql/sst_retry_count` file inside the Pod. + +When the number of SST attempts exceeds the specified threshold, the following occurs: + +* The Operator creates the `/var/lib/mysql/sst_retry_limit_reached` marker file and further SST attempts are stopped. +* Liveness checks on the Pod continue to pass +* Readiness checks fail +* The Pod stays running, but remains unready +* The `SST retry limit reached` message is written in the container logs + +This behavior lets you inspect the Pod and decide when to resume retries. + +## Configure the retry limit + +Set `spec.pxc.sstRetryCount` in your Custom Resource: + +```yaml +apiVersion: pxc.percona.com/v1 +kind: PerconaXtraDBCluster +metadata: + name: cluster1 +spec: + pxc: + sstRetryCount: 3 +``` + +The value must be an integer greater than or equal to `1`. + +## Resume SST retries + +To allow retries again, remove the marker file inside the affected Pod: + +```bash +kubectl exec -it cluster1-pxc-2 -c pxc -- rm -f /var/lib/mysql/sst_retry_limit_reached +``` + +The retry state is cleared automatically after the node successfully reaches the `joined` or `synced` state. diff --git a/mkdocs-base.yml b/mkdocs-base.yml index 8bd4765..32f6223 100644 --- a/mkdocs-base.yml +++ b/mkdocs-base.yml @@ -209,6 +209,7 @@ nav: - "Application and system users": users.md - "Exposing the cluster": expose.md - "Changing MySQL Options": options.md + - "Limit SST retries": sst-retry-limit.md - "Control Pod scheduling": constraints.md - "Labels and annotations": annotations.md - "Local Storage support": storage.md