Skip to content

restore --schema on second replica rebinds ZK path when first replica is active (regression in v2.7.1 via commit 0d8279e7) #1428

Description

@billboggs

Summary

Commit 0d8279e7 (first included in v2.7.1) introduced a stale znode detection check in checkReplicaAlreadyExistsAndChangeReplicationPath that breaks HA schema restores. When restoring schema on a second replica after the first replica has already been restored and registered in Keeper, the new check finds the first replica's children at the table-level ZK path and incorrectly treats them as stale leftovers, rebinding the second replica to default_replica_path.

Steps to reproduce

  1. Two-replica ClickHouse cluster with ReplicatedMergeTree tables
  2. Fresh Keeper (no prior state)
  3. Restore schema on replica 0: clickhouse-backup restore_remote --schema --rm <backup> — succeeds, registers tables in Keeper
  4. Restore schema on replica 1: clickhouse-backup restore_remote --schema --rm <backup> — tables are created under default_replica_path instead of the backup's original ZK path

Expected behavior

Replica 1 should create tables under the same ZK path as replica 0, joining the existing replication group (total_replicas=2).

Actual behavior

Replica 1's tables are created under a different ZK path (default_replica_path), resulting in total_replicas=1 on both nodes — a split-brain where neither replica replicates to the other.

The log shows:

zookeeper path /path/to/table still has 18 children after table drop, will rebind to fresh replica path
replica /path/to/table/replicas/replica-1 already exists in system.zookeeper will replace to /clickhouse/tables/{cluster}/{shard}/{database}/{table}/replicas/{replica}

Root cause

In restore.go:1839-1848, after confirming that the specific replica entry does NOT exist at the ZK path, the code checks if the table-level ZK path has any children:

if err = b.ch.SelectSingleRow(ctx, &isTablePathStale, "SELECT count() FROM system.zookeeper WHERE path=?", resolvedReplicaPath); err != nil {
    return
}
if isTablePathStale == 0 {
    return
}
log.Warn().Msgf("zookeeper path %s still has %d children after table drop, will rebind to fresh replica path", resolvedReplicaPath, isTablePathStale)

This finds children at the table-level ZK path (log/, queue/, columns/, replicas/<replica-0>, etc.) belonging to the active first replica. The check cannot distinguish between:

Workaround

Pin to v2.7.0 (before the regression). The v2.7.0 logic only checks for the specific replica entry and proceeds normally if it doesn't exist, allowing CREATE TABLE to join the existing replication group.

Environment

  • clickhouse-backup v2.7.2
  • ClickHouse 25.8.x
  • Kubernetes, 2-replica ReplicatedMergeTree cluster
  • Verified that v2.7.0 does NOT exhibit this behavior

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions