Summary
Commit 0d8279e7 (first included in v2.7.1) introduced a stale znode detection check in checkReplicaAlreadyExistsAndChangeReplicationPath that breaks HA schema restores. When restoring schema on a second replica after the first replica has already been restored and registered in Keeper, the new check finds the first replica's children at the table-level ZK path and incorrectly treats them as stale leftovers, rebinding the second replica to default_replica_path.
Steps to reproduce
- Two-replica ClickHouse cluster with
ReplicatedMergeTree tables
- Fresh Keeper (no prior state)
- Restore schema on replica 0:
clickhouse-backup restore_remote --schema --rm <backup> — succeeds, registers tables in Keeper
- Restore schema on replica 1:
clickhouse-backup restore_remote --schema --rm <backup> — tables are created under default_replica_path instead of the backup's original ZK path
Expected behavior
Replica 1 should create tables under the same ZK path as replica 0, joining the existing replication group (total_replicas=2).
Actual behavior
Replica 1's tables are created under a different ZK path (default_replica_path), resulting in total_replicas=1 on both nodes — a split-brain where neither replica replicates to the other.
The log shows:
zookeeper path /path/to/table still has 18 children after table drop, will rebind to fresh replica path
replica /path/to/table/replicas/replica-1 already exists in system.zookeeper will replace to /clickhouse/tables/{cluster}/{shard}/{database}/{table}/replicas/{replica}
Root cause
In restore.go:1839-1848, after confirming that the specific replica entry does NOT exist at the ZK path, the code checks if the table-level ZK path has any children:
if err = b.ch.SelectSingleRow(ctx, &isTablePathStale, "SELECT count() FROM system.zookeeper WHERE path=?", resolvedReplicaPath); err != nil {
return
}
if isTablePathStale == 0 {
return
}
log.Warn().Msgf("zookeeper path %s still has %d children after table drop, will rebind to fresh replica path", resolvedReplicaPath, isTablePathStale)
This finds children at the table-level ZK path (log/, queue/, columns/, replicas/<replica-0>, etc.) belonging to the active first replica. The check cannot distinguish between:
Workaround
Pin to v2.7.0 (before the regression). The v2.7.0 logic only checks for the specific replica entry and proceeds normally if it doesn't exist, allowing CREATE TABLE to join the existing replication group.
Environment
- clickhouse-backup v2.7.2
- ClickHouse 25.8.x
- Kubernetes, 2-replica ReplicatedMergeTree cluster
- Verified that v2.7.0 does NOT exhibit this behavior
Summary
Commit 0d8279e7 (first included in v2.7.1) introduced a stale znode detection check in
checkReplicaAlreadyExistsAndChangeReplicationPaththat breaks HA schema restores. When restoring schema on a second replica after the first replica has already been restored and registered in Keeper, the new check finds the first replica's children at the table-level ZK path and incorrectly treats them as stale leftovers, rebinding the second replica todefault_replica_path.Steps to reproduce
ReplicatedMergeTreetablesclickhouse-backup restore_remote --schema --rm <backup>— succeeds, registers tables in Keeperclickhouse-backup restore_remote --schema --rm <backup>— tables are created underdefault_replica_pathinstead of the backup's original ZK pathExpected behavior
Replica 1 should create tables under the same ZK path as replica 0, joining the existing replication group (
total_replicas=2).Actual behavior
Replica 1's tables are created under a different ZK path (
default_replica_path), resulting intotal_replicas=1on both nodes — a split-brain where neither replica replicates to the other.The log shows:
Root cause
In
restore.go:1839-1848, after confirming that the specific replica entry does NOT exist at the ZK path, the code checks if the table-level ZK path has any children:This finds children at the table-level ZK path (
log/,queue/,columns/,replicas/<replica-0>, etc.) belonging to the active first replica. The check cannot distinguish between:restorecommand for ReplicatedMergeTree after DROP TABLE IF EXISTS ... should check replication path is exists in ZK and generate new path with predefined format, i.e. /clickhouse/tables/{cluster}/{shard}/{database}/{table} #849 fixed)Workaround
Pin to v2.7.0 (before the regression). The v2.7.0 logic only checks for the specific replica entry and proceeds normally if it doesn't exist, allowing
CREATE TABLEto join the existing replication group.Environment