Skip to content

Split master and replicas into two StatefulSets (decouple replica persistence) #178

@dragoangel

Description

@dragoangel

Summary

The chart currently deploys both the master and its replicas under a single StatefulSet. This couples the replicas' persistence model to the master and forces every pod to carry a PVC. I'd propose splitting this into two separate StatefulSets.

Problem

A StatefulSet has exactly one pod template and one volumeClaimTemplates. Every ordinal pod is therefore identical, storage included. There is no way to express "ordinal-0 is the master with a PVC, ordinals 1..N are replicas with no storage."

Consequences

  • Replicas get PVCs they don't need (provisioning, attach/detach latency, storage cost — all for nothing).
  • Role handling has to be pushed into the container entrypoint: bash that inspects the pod ordinal to decide master vs replica, set the replicaof target, service selection complexity, etc.

This design is fragile (which is exactly why #160 exist) and hard to follow.

Tradeoff (arguable)

The only thing the current single-STS design buys you is a unified endpoint spanning master and replicas together. In real-world production that's rarely what you want: the whole point of read-replicas is to keep read traffic off an already busy master, which is exactly what they're for. You deliberately want separate write (master) and read (replica) endpoints rather than one combined RW service — and two StatefulSets give you that split for free.

Why replicas don't need persistence

  • With reliable distributed storage (e.g. Ceph RBD, NFS, Longhorn, etc) the master can be rescheduled onto any node and reattach its volume, so the master alone covers durability.
  • Replicas gain nothing from their own storage: on startup a replica does a full sync from the master. With repl-diskless-sync, master forks and streams the dataset (RDB) straight from RAM over the socket, and the replica loads it via repl-diskless-load without ever touching disk.
  • Loading the dataset off network storage at startup is not faster than diskless sync from the master's RAM — in practice on most CSI's it's slower, and it adds PVC attach latency on top.

So replicas are best run fully ephemeral. Of course there may be users who want have persistency for replicas, and don't want diskless sync - but with 2 STS thy would have choice, which they not have now.

Proposed design (two StatefulSets)

Master STS — replicas: 1, persistence enabled → exactly one PVC. If no Sentinel enabled and replicas >1 we may throw render error.
Replica STS — replicas: N (≥2), no volumeClaimTemplates, diskless sync from the master.

This is already how several alternative Valkey/Redis Helm charts model it

Benefits

  • Replicas are stateless: faster (re)scheduling, no PVC attach/detach, lower storage cost.
  • Each STS has a single, clear purpose → far less entrypoint bash and conditional init logic.
  • Cleaner, more readable values; persistence settings live only on the master.
  • The master can still float across nodes thanks to the distributed RBD volume.
  • Removes a footgun: with no combined RW endpoint, less experienced users can't inadvertently point read traffic at the master and overload it under heavy load. The separate write (master) and read (replica) endpoints make correct usage the default.

Important Notes

It would be useful to take this in consideration when Valkey Operator get it's design too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions