From f5fe9acb3bbf9b590956ac118d29cf01d3b7dddd Mon Sep 17 00:00:00 2001 From: Pavel Tcholakov Date: Fri, 30 Jan 2026 15:27:07 +0200 Subject: [PATCH 1/2] Add a snapshot storage location migration guide --- docs/guides/snapshot-storage-migration.mdx | 245 +++++++++++++++++++++ 1 file changed, 245 insertions(+) create mode 100644 docs/guides/snapshot-storage-migration.mdx diff --git a/docs/guides/snapshot-storage-migration.mdx b/docs/guides/snapshot-storage-migration.mdx new file mode 100644 index 00000000..2d68cf60 --- /dev/null +++ b/docs/guides/snapshot-storage-migration.mdx @@ -0,0 +1,245 @@ +--- +title: "Migrating Snapshot Storage Backend" +description: "Move a multi-node Restate cluster from one snapshot destination to another." +tags: ["deployment", "snapshots", "migration"] +--- + +This guide describes how to migrate a multi-node Restate cluster from one snapshot storage backend to another (for example, between different buckets/prefixes, or from MinIO to GCS). + +The migration temporarily increases partition replication to ensure every node hosts every partition before snapshots are disabled. This prevents trim-gap failures during rolling restarts. The migration leverages Restate's `worker.durability-mode` configuration option to prevent any log trimming during the transition, ensuring no data loss even if the old snapshots become unavailable before new ones are created. + +**Prerequisites:** +- Restate server version **1.6 or later** (required for `worker.durability-mode` configuration) +- Rolling restart capability for your cluster +- Access to both old and new snapshot storage backends during migration +- Capacity to temporarily run each partition on every worker node (partition replication = cluster size) +- [`restatectl`](/server/clusters#controlling-clusters-with-restatectl) CLI configured to communicate with your cluster + + + + +Capture the current cluster replication settings so you can restore them later: + +```shell +restatectl config get +``` + +Note the current **Partition replication** value (for example `{node: 2}`). + + + + +Set partition replication to your cluster size `N` (the number of worker nodes). This ensures every node has a local copy of every partition before you disable snapshots. + +```shell +restatectl config set --partition-replication N +``` + +Do **not** use `--replication` here unless you also want to increase log replication. + +:::note[Why increase partition replication?] +Without this step, when the cluster controller reconfigures partition replica sets during rolling restarts, some nodes may be unable to serve a given partition depending on their prior local partition store state, the log trim point, and available snapshots. With partition replication matching cluster size, every node will maintain a warm replica of every partition and is able to resume without the need for a new snapshot on restart, provided the log was not trimmed during its downtime. +::: + + + + +Wait until every partition has a replica on every node and followers have no lag: + +```shell +restatectl partitions list +``` + +Verify that: +- Each partition ID appears `N` times (once per node) +- All rows show `LSN-LAG` of `0` (or consistently near `0`) + +For example, with 8 partitions and 3 nodes, you should see 24 rows total. + + + + +Roll out a configuration update to disable automatic snapshots and switch to conservative durability mode: + +```toml restate.toml +[worker.snapshots] +# Disable automatic snapshots by removing/commenting destination +# destination = "s3://old-bucket/prefix" + +[worker] +# Use the strictest mode - requires BOTH replicas AND snapshots for trim +# When snapshot destination is not set, this halts all log trimming +durability-mode = "snapshot-and-replica-set" +``` + +This effectively disables both snapshotting and log trimming. The system will log a warning every 60 seconds: *"Detected cluster environment with no snapshot repository configured. Automatic log trimming is disabled..."* - this is expected during the migration. + +Perform a rolling restart of all cluster nodes with the new configuration. Restart one node at a time, waiting for it to rejoin and partitions to become active before proceeding to the next node. + +:::tip[Live traffic during migration] +With partition replication matching cluster size, rolling restarts have minimal impact on live traffic. Requests in-flight on a restarting node may fail—use [idempotency keys](/develop/ts/service-communication#idempotent-invocations) to make retries safe. +::: + + + + +Check the cluster status to confirm all partitions are active: + +```shell +restatectl partitions list +``` + +You should see all partitions with the `ARCHIVED` column empty or unchanged: + +``` +ID NODE MODE STATUS EPOCH APPLIED DURABLE ARCHIVED LSN-LAG UPDATED +0 N1:1 Leader Active 5 1234 1234 - 0 2s ago +1 N2:1 Leader Active 5 5678 5678 - 0 1s ago +... +``` + +The `ARCHIVED` column shows `-` (due to no known snapshot). This is expected. + +The applied LSN should increase over time if there is cluster activity but the archived LSN should remain `-`: + + + + +Roll out a configuration update with the new snapshot destination: + +```toml restate.toml +[worker.snapshots] +destination = "s3://new-bucket/prefix" # New repository + +[worker] +# Use conservative settings +durability-mode = "snapshot-and-replica-set" +trim-delay-interval = "24h" +``` + +Perform a rolling restart of all cluster nodes (one at a time, verifying health between each). + + + + +Trigger manual snapshots for all partitions to populate the new repository immediately: + +```shell +restatectl snapshot create +``` + +You should see output confirming each partition was snapshotted: + +``` +Snapshot created for partition 0: snap_15GSJBOfxk3x8k1CfPwfxrb (log 0 @ LSN >= 49622035) +Snapshot created for partition 1: snap_2xHJKLMnop4y9z2DgQwgAbc (log 1 @ LSN >= 49622040) +... +``` + + + + +Check that snapshots exist in the new storage backend. For S3: + +```shell +aws s3 ls s3://new-bucket/prefix/ --recursive | head -20 +``` + +Each partition should have a `latest.json` file and a snapshot directory: + +``` +prefix/0/latest.json +prefix/0/lsn_00000000000000860864-snap_13yBpep1H1jKGAzHhqkmCyt/... +prefix/1/latest.json +prefix/1/lsn_00000000000000860870-snap_2xHJKLMnop4y9z2DgQwgAbc/... +... +``` + +Confirm the archived LSN column now shows the snapshot LSN values: + +```shell +restatectl partitions list +``` + +Expected output: + +``` +ID NODE MODE STATUS EPOCH APPLIED DURABLE ARCHIVED LSN-LAG UPDATED +0 N1:1 Leader Active 5 1250 1250 1234 0 2s ago +1 N2:1 Leader Active 5 5700 5700 5678 0 1s ago +``` + + + + +After the new snapshot repository is verified, restore the original partition replication value you recorded earlier: + +```shell +restatectl config set --partition-replication +``` + + + + +Roll out a configuration update with production settings: + +```toml restate.toml +[worker] +# Return to balanced mode (recommended for production) +durability-mode = "balanced" +``` + +Perform a rolling restart of all cluster nodes (one at a time, verifying health between each). + + + + +Check that the cluster status is healthy: + +```shell +restatectl status --extra +``` + +All nodes should be healthy and all partitions active with no warnings. + +Confirm log trimming has resumed: + +```shell +restatectl log list +``` + +The trim point should gradually increase as durability conditions are met. + + + + +After confirming the cluster is migrated to the new snapshot backend: + +1. Remove old snapshots +2. Revoke access to the old storage backend + + + + +## Durability mode reference + +| Mode | Description | Use case | +|------|-------------|----------| +| `balanced` | Requires snapshot AND at least one replica flushed | Production default (when snapshots configured) | +| `snapshot-and-replica-set` | Requires snapshot AND all replicas flushed | Migration phase (strictest) | +| `snapshot-only` | Requires only snapshot, ignores replicas | Special cases | +| `replica-set-only` | Requires all replicas flushed, ignores snapshots | Default without snapshots | +| `none` | Disables automatic durability tracking | Testing only | + +## Rollback plan + +If you encounter issues during migration: + +- **Before snapshots exist in the new repository**: Restore the original configuration pointing to the old snapshot repository, perform a rolling restart, and resume normal operations. + +- **After snapshots exist in the new repository**: If no log trimming has occurred since the original repository was disabled, you can safely discard the new repository and revert the original configuration. If logs have been trimmed based on snapshot LSNs published to the new repository, follow the same process from the start to migrate to the original destination: 1) disable log trimming; 2) update snapshot destination; 3) create and verify snapshots; 4) re-enable log trimming. + +## See also + +- [Configuring automatic snapshotting](/server/snapshots#configuring-automatic-snapshotting) +- [Controlling clusters with restatectl](/server/clusters#controlling-clusters-with-restatectl) From fadd91b6e05f5382c02d2a1c797e4ea1d7412400 Mon Sep 17 00:00:00 2001 From: Pavel Tcholakov Date: Wed, 4 Feb 2026 15:52:54 +0200 Subject: [PATCH 2/2] Improve rollback section with step-by-step guidance --- docs/guides/snapshot-storage-migration.mdx | 26 +++++++++++++++++++--- 1 file changed, 23 insertions(+), 3 deletions(-) diff --git a/docs/guides/snapshot-storage-migration.mdx b/docs/guides/snapshot-storage-migration.mdx index 2d68cf60..4d0fe0a5 100644 --- a/docs/guides/snapshot-storage-migration.mdx +++ b/docs/guides/snapshot-storage-migration.mdx @@ -233,11 +233,31 @@ After confirming the cluster is migrated to the new snapshot backend: ## Rollback plan -If you encounter issues during migration: +If you encounter issues during migration, the rollback procedure depends on how far you've progressed: -- **Before snapshots exist in the new repository**: Restore the original configuration pointing to the old snapshot repository, perform a rolling restart, and resume normal operations. +**During steps 1-3** (before log trimming is disabled): -- **After snapshots exist in the new repository**: If no log trimming has occurred since the original repository was disabled, you can safely discard the new repository and revert the original configuration. If logs have been trimmed based on snapshot LSNs published to the new repository, follow the same process from the start to migrate to the original destination: 1) disable log trimming; 2) update snapshot destination; 3) create and verify snapshots; 4) re-enable log trimming. +No destructive changes have been made. Simply restore partition replication to the original value: + +```shell +restatectl config set --partition-replication +``` + +**During steps 4-5** (log trimming disabled, no new snapshots yet): + +Restore the original configuration pointing to the old snapshot repository, perform a rolling restart, then restore partition replication: + +```shell +restatectl config set --partition-replication +``` + +**During steps 6-8** (configuring new repository, creating snapshots): + +If no log trimming has occurred since the original repository was disabled, you can safely discard the new repository and revert to the original configuration. Restore partition replication after the rollback. + +**After step 9** (partition replication restored, normal operations): + +If logs have been trimmed based on snapshot LSNs published to the new repository, you must follow the same migration process to return to the original destination: disable log trimming, update snapshot destination, create and verify snapshots, then re-enable log trimming. ## See also