Skip to content

System User Password Rotation with CredentialRotation CRD#903

Draft
shunki-fujita wants to merge 16 commits into
mainfrom
credential-rotation-crd
Draft

System User Password Rotation with CredentialRotation CRD#903
shunki-fujita wants to merge 16 commits into
mainfrom
credential-rotation-crd

Conversation

@shunki-fujita

@shunki-fujita shunki-fujita commented Mar 19, 2026

Copy link
Copy Markdown
Contributor

Close: #849

Propose a dedicated CredentialRotation CRD with its own controller as an alternative to embedding password rotation into MySQLCluster.

@shunki-fujita shunki-fujita changed the title Add design doc for CredentialRotation CRD approach System User Password Rotation with CredentialRotation CRD Mar 19, 2026
@shunki-fujita shunki-fujita force-pushed the credential-rotation-crd branch 3 times, most recently from 2720d8d to 0f3bf1c Compare March 19, 2026 04:14
@shunki-fujita shunki-fujita requested a review from Copilot March 19, 2026 05:38

This comment was marked as resolved.

@shunki-fujita shunki-fujita requested a review from Copilot March 19, 2026 05:49

This comment was marked as resolved.

@shunki-fujita shunki-fujita force-pushed the credential-rotation-crd branch from 32efe48 to 8de6492 Compare March 19, 2026 06:08
@shunki-fujita shunki-fujita requested a review from Copilot March 19, 2026 06:08

This comment was marked as resolved.

@shunki-fujita shunki-fujita force-pushed the credential-rotation-crd branch from 8de6492 to 420a50b Compare March 19, 2026 06:19
@shunki-fujita shunki-fujita requested a review from Copilot March 19, 2026 06:20

This comment was marked as resolved.

@shunki-fujita shunki-fujita force-pushed the credential-rotation-crd branch from 420a50b to fa7c877 Compare March 19, 2026 06:33
@shunki-fujita shunki-fujita requested a review from Copilot March 19, 2026 06:34

This comment was marked as resolved.

@shunki-fujita shunki-fujita force-pushed the credential-rotation-crd branch from fa7c877 to f75f60c Compare March 19, 2026 06:50
@shunki-fujita shunki-fujita requested a review from Copilot March 19, 2026 06:50

This comment was marked as resolved.

@shunki-fujita shunki-fujita force-pushed the credential-rotation-crd branch from f75f60c to ec27644 Compare March 19, 2026 07:06
@shunki-fujita shunki-fujita requested a review from Copilot March 19, 2026 07:06

This comment was marked as resolved.

@shunki-fujita shunki-fujita force-pushed the credential-rotation-crd branch from ec27644 to bb28b21 Compare March 19, 2026 07:28
@shunki-fujita shunki-fujita requested a review from Copilot March 19, 2026 07:28

This comment was marked as resolved.

@shunki-fujita shunki-fujita force-pushed the credential-rotation-crd branch from bb28b21 to 423fd02 Compare March 19, 2026 07:49
@shunki-fujita shunki-fujita requested a review from Copilot March 19, 2026 07:49

This comment was marked as resolved.

@shunki-fujita shunki-fujita force-pushed the credential-rotation-crd branch from 423fd02 to b5c6f0f Compare March 19, 2026 07:57
Signed-off-by: shunki-fujita <shunki-fujita@cybozu.co.jp>
@shunki-fujita shunki-fujita force-pushed the credential-rotation-crd branch from 2332dd7 to ea2d967 Compare May 13, 2026 08:03
- Replace stale MySQLCluster controller ownerReference (different UID,
  e.g., cluster deleted and recreated) instead of requeueing forever.
- Refuse to advance to Discarding phase when cluster replicas=0;
  emit DiscardRefused Warning Event and stay in Rotated.
- Emit RotationBlocked Warning Event when Rotating is blocked by
  replicas=0, instead of silently logging.
- Migrate CredentialRotation webhook to controller-runtime's generic
  admission.Validator[*CredentialRotation] API to fix the build.
- Use Go 1.26 new(true) in place of ptr.To(true) to satisfy
  modernize/newexpr.

Signed-off-by: shunki-fujita <shunki-fujita@cybozu.co.jp>
@shunki-fujita shunki-fujita force-pushed the credential-rotation-crd branch 2 times, most recently from 531bef4 to ef37741 Compare May 15, 2026 04:58
- Self-heal per-namespace user/my.cnf Secrets during active rotation by
  re-applying pending passwords in Rotated/Discarding/Discarded phases.
  Fall back to current passwords during the brief Discarded→Completed
  window where pending keys have been promoted.
- Treat CR Get errors strictly: only NotFound/NoMatch are "no active
  rotation"; transient errors abort reconciliation instead of falling
  back to current passwords.
- Refuse to adopt or act on a stale CredentialRotation (ownerRef UID
  does not match the live cluster). Apply this consistently in the
  CredentialRotation reconciler, ClusterManager, MySQLClusterReconciler,
  and the kubectl-moco CLI.
- Validate CR delete: forbid mid-rotation deletes, but always allow
  garbage collection (owner NotFound, owner Terminating, stale UID).
- Make DISCARD OLD PASSWORD per-user idempotent via HasDualPassword,
  matching the existing RETAIN behavior; required because MySQL rejects
  DISCARD when no retained password remains.
- Reject discard while replicas=0 in handleStartDiscard and emit a
  RotationBlocked Warning Event when handleRotatingPhase is stalled by
  replicas=0.
- Replace spec.discardOldPassword (bool) with spec.discardGeneration
  (int64) to eliminate edge-trigger semantics and the webhook
  transition contortions it required (false->true reject, forced reset
  on rotation bump). discardGeneration is monotonic and must satisfy
  0 <= discardGeneration <= rotationGeneration; bumping it to match
  rotationGeneration triggers discard. Adds status.observedDiscardGeneration
  to mirror the existing rotation pattern, and updates the webhook to
  enforce monotonicity and the upper-bound invariant (phase guards
  remain: rotation bump requires \"\"/Completed, discard bump requires
  Rotated). kubectl-moco credential discard now patches
  discardGeneration to match rotationGeneration instead of toggling a
  bool.
- Update the design doc to match the implementation (MySQLClusterReconciler
  code, DISCARD idempotency, stale-CR handling, MySQLCluster deletion,
  validation webhook, discardGeneration semantics, GitOps-vs-CLI
  guidance).
- Add unit and envtest coverage for the new paths.

Signed-off-by: shunki-fujita <shunki-fujita@cybozu.co.jp>
Comment thread .gitignore Outdated
@shunki-fujita shunki-fujita force-pushed the credential-rotation-crd branch from ef37741 to 4a125b3 Compare May 15, 2026 07:39
The dedicated ignore entry existed only to hide envtest binaries
accidentally downloaded under clustering/ when tests were run directly
without ENVTEST_ASSETS_DIR. The proper flow goes through `make
clustering-envtest`, which places assets in repo-root bin/ (already
ignored via /bin), so the per-package entry is no longer needed.

Signed-off-by: shunki-fujita <shunki-fujita@cybozu.co.jp>
Reflect three implementation changes that had drifted from the doc:

- Retained → Rotated step 2: the restart annotation is written via
  Server-Side Apply under a dedicated field manager with ForceOwnership,
  not a merge Patch.
- passwordForDistribution: a partial pending state now surfaces as an
  error instead of silently falling back to the current password.
- StatefulSet Rolling Restart section: describe the dedicated field
  manager (moco-credential-rotation) and ForceOwnership, including the
  case where a user pre-sets the same annotation key in
  cluster.Spec.PodTemplate.Annotations.

Signed-off-by: shunki-fujita <shunki-fujita@cybozu.co.jp>
Comment thread api/v1beta2/credentialrotation_types.go Outdated
Comment thread api/v1beta2/credentialrotation_types.go Outdated
@arosh

arosh commented May 15, 2026

Copy link
Copy Markdown
Member

@shunki-fujita
Why don't you ask to Claude Code to adhere to Kubernetes API Convention? Please the following into CLAUDE.md.

When you edit any file matching the glob `api/**/*_types.go`, you MUST comply with the
[Kubernetes API Convention](https://github.com/kubernetes/community/blob/main/contributors/devel/sig-architecture/api-conventions.md).

- Mark RotationGeneration as required with Minimum=1. The webhook
  already rejected RotationGeneration <= 0, so keeping it as an
  optional int64 with omitempty made "0" indistinguishable from
  "omitted" while neither was a valid expression of intent.
- Add Minimum=0 (and default=0 for spec) on DiscardGeneration and on
  the observed* counters in Status, so the optional/required intent is
  expressed at the CRD schema layer.
- Mark Spec as required and add explicit +optional markers on
  ObjectMeta and Status so every field is unambiguously optional or
  required.
- Constrain RotationPhase to its enum values via
  +kubebuilder:validation:Enum, matching the pattern used elsewhere
  in moco (backuppolicy, mysqlcluster, job) and surfacing the allowed
  values in the generated OpenAPI schema.

Signed-off-by: shunki-fujita <shunki-fujita@cybozu.co.jp>
Comment thread api/v1beta2/credentialrotation_types.go Outdated
Comment thread api/v1beta2/credentialrotation_types.go Outdated
Comment thread api/v1beta2/credentialrotation_types.go Outdated
Comment thread api/v1beta2/credentialrotation_types.go
Address review feedback on api/v1beta2/credentialrotation_types.go:

- Drop omitempty from int64 generation counters so the meaningful
  zero value is preserved across PATCH and matches the Kubernetes
  API convention for optional fields with significant zero values.
- Add a UUID pattern validation to status.rotationID so the field
  shape is enforced at the API boundary.
- Add explicit +required markers on the Spec field and on
  RotationGeneration to satisfy the "fields must be either optional
  or required" rule.

Test rotationID literals in clustering/manager_test.go are updated
to valid UUIDs to satisfy the new pattern validation.

Signed-off-by: shunki-fujita <shunki-fujita@cybozu.co.jp>
Convert the CredentialRotation API from a single status.phase enum
to three orthogonal metav1.Condition entries, aligning the new CRD
with the Kubernetes API convention that discourages phase-style
state machines.

API changes:
- Remove RotationPhase type and the Phase status field.
- Add status.conditions[] with three Types:
  Rotating (in-flight cycle, current sub-step in Reason),
  OldPasswordRetained (MySQL dual-password state),
  Ready (latest spec generations fully reconciled).
- Add ReasonXxx constants for the in-flight sub-steps
  (ApplyingRetain, DistributingPassword, AwaitingDiscard,
  WaitingForRollout, ApplyingDiscard, Finalizing,
  RotationBlocked, StalePending) and the idle reasons
  (NotStarted, Completed, RotationRefused).
- Add CurrentStep, RotatingReason, IsIdle, IsDeletable, and
  SetRotating/SetOldPasswordRetained/SetReady helpers.
- Express the spec invariant
  discardGeneration <= rotationGeneration as a CEL XValidation
  rule so the API server enforces it declaratively.
- Document every Condition Type and Reason constant in godoc so
  the OpenAPI schema (and kubectl explain) surface workflow
  semantics.

Behavioral changes:
- Validating webhook gates rotationGeneration bumps on IsIdle()
  and discardGeneration bumps on CurrentStep == AwaitingDiscard.
- ValidateDelete gains an escape hatch for RotationBlocked and
  StalePending so the documented recovery procedure (which begins
  with kubectl delete credentialrotation) succeeds.
- CredentialRotationReconciler drives the cycle via CurrentStep()
  rather than a Phase enum; the rollout wait that was previously
  inlined in handleStartDiscard is now its own WaitingForRollout
  sub-step.
- ClusterManager's clustering loop transitions Rotating and
  OldPasswordRetained together.
- MySQLClusterReconciler.reconcileV1Secret chooses current vs
  pending passwords based on the Rotating condition's Reason.
- kubectl moco credential rotate/discard use IsIdle() and
  CurrentStep() instead of Phase comparisons.
- Repeated RotationRefused / StalePending events are suppressed
  so only condition transitions emit Warning events.

CRD printer columns expose Step (Rotating.Reason), Retained
(OldPasswordRetained.Status), and Ready (Ready.Status) instead of
the single Phase column.

The design doc (docs/designdoc/credential_rotation_crd.md) is
rewritten end-to-end: state diagram, CRD example, Go type
definitions, validation webhook, kubectl UI, rotate/discard step
tables, ClusterManager and MySQLClusterReconciler code samples,
crash safety table, deletion handling, and impact summary.

Tests drive condition transitions via SetRotating /
simulateClusterManagerAdvance helpers; assertions check the
Rotating Reason via crRotatingReason. All envtest suites
(api/controller/clustering/backup) pass and make lint is clean.

Signed-off-by: shunki-fujita <shunki-fujita@cybozu.co.jp>
Add a top-level status.observedGeneration field and watches on
MySQLCluster and the controller-namespace Secret so the rotation
loop reacts immediately to external changes instead of relying on
the 15s requeue.

API:
- Add status.observedGeneration to surface the metadata.generation
  that the controller has most recently reconciled. Standard
  observed-generation pattern that kstatus / ArgoCD / Flux read.
- New helper StampObservedGeneration; the Reconciler and the
  ClusterManager both call it immediately before every
  Status().Update so the field stays in sync with each condition
  transition.

Watches:
- Watch MySQLCluster with a predicate that only enqueues on
  Spec.Replicas or DeletionTimestamp changes. A RotationRefused /
  RotationBlocked CR now resumes the cycle the moment the cluster
  scales back up, without waiting for the periodic requeue.
- Watch Secrets in the controller's system namespace and
  reverse-parse the ControllerSecretName ("mysql-<ns>.<name>")
  back to the CredentialRotation that owns it. Operators who
  manually clean up a StalePending source Secret no longer have
  to wait for the requeue tick before the controller retries.
- mapClusterToCR / mapSourceSecretToCR are pure mapping helpers;
  clusterReplicasChangedPredicate avoids noise from unrelated
  cluster updates.

All envtest suites (api/controller/clustering/backup) pass and
make lint is clean.

Signed-off-by: shunki-fujita <shunki-fujita@cybozu.co.jp>
Surface a couple of guardrails that kube-api-linter (KAL) flagged
on the new types:

- Add an explicit MaxLength=36 on status.rotationID. The existing
  Pattern constrains the value to a canonical UUID, but MaxLength
  is the convention KAL expects and also limits the OpenAPI/CRD
  schema explicitly.
- Promote the Conditions field's SSA tags (patchStrategy=merge,
  patchMergeKey=type) into godoc markers in addition to the
  struct tag. This matches the modern style upstream Kubernetes
  uses on metav1.Condition slices and what KAL recommends.

Remaining KAL findings on the rest of the v1beta2 API (godoc
should start with the json field name, +default vs
+kubebuilder:default, etc.) are pre-existing project-wide and
are not addressed here to avoid churn unrelated to this PR.

Signed-off-by: shunki-fujita <shunki-fujita@cybozu.co.jp>
@yamatcha

Copy link
Copy Markdown
Contributor

@shunki-fujita
The current Status.Conditions shape violates the Kubernetes API conventions. Conditions are supposed to be used as follows:

  • Type names are "an adjective ('Ready', 'OutOfDisk') or a past-tense verb ('Succeeded', 'Failed') rather than a present-tense verb ('Rotating ')"
  • "conditions are observations and not, themselves, state machines" — workflow steps don't belong in conditions
  • Each condition is one of the "individual conditions that clients need to monitor" — i.e. orthogonal, independent observations
  • Reason is "a programmatic identifier" — the same identifier should not mean different things in different conditions

Proposal: replace with truly independent observations

For example: RotationReady, DiscardReady, OldPasswordRetained, etc...

@shunki-fujita shunki-fujita force-pushed the credential-rotation-crd branch from 8f6219b to eea952b Compare May 27, 2026 05:56
Move RotationReady / DiscardReady from "generation tracking"
(observedX == specX) to "action-availability guard" semantics:

- RotationReady=True ⟺ Idle: operator may run `credential rotate`.
- DiscardReady=True  ⟺ AwaitingDiscard: operator may run `credential discard`.
- DualPassword=True  ⟺ MySQL holds a dual-password set (unchanged).

The two Ready conditions are now mutually exclusive — at most one
is True at a time — which removes the "both True" ambiguity that
made the printer columns hard to read.

Also rename OldPasswordRetained → DualPassword (noun form parallel
to MemoryPressure; the past-participle "Retained" was easily misread
as "retain completed").

Add a StepAwaitingRollout sub-step between DistributingPassword and
AwaitingDiscard so DiscardReady=True flips only after the rolling
restart has settled. The post-distribute rollout wait moves out of
ClusterManager's DISCARD handler into the Reconciler, matching the
"DiscardReady=True means discard is genuinely safe now" semantics.

Step() trusts DualPassword as the authoritative physical-state
signal in the steady-state branch so legacy CRs persisted by the
previous controller (which could carry both Ready conditions True)
still classify correctly after upgrade.

Signed-off-by: shunki-fujita <shunki-fujita@cybozu.co.jp>
@shunki-fujita shunki-fujita force-pushed the credential-rotation-crd branch from eea952b to c12cae9 Compare May 27, 2026 08:57
Per-namespace Secrets are populated with pending passwords during
DistributingPassword and must stay that way through every post-
distribution step. DiscardRefused/DiscardBlocked were falling through
to the default arm and reverting Secrets to current, which would
leave restarted Pods with stale credentials after DISCARD eventually
completes.

Also tighten the ObservedRotationGeneration/ObservedDiscardGeneration
godoc to state that spec/observed equality is necessary but not
sufficient for the corresponding Ready condition.

Signed-off-by: shunki-fujita <shunki-fujita@cybozu.co.jp>
Trim from 1136 to 620 lines: drop the Go type/webhook/controller
code snippets and fold scattered prose into tables (Responsibility
Split, MySQLClusterReconciler step→password map). Keep the step
matrix, Reason values, crash-safety table, and runnable recovery
shell commands. Fix two inconsistencies surfaced during the rewrite:
the reconcileV1Secret table now lists every post-distribution step,
and the Conditions section no longer claims RotationReady/DiscardReady
are equivalent to IsIdle/IsAwaitingDiscard.

Signed-off-by: shunki-fujita <shunki-fujita@cybozu.co.jp>
Tighten ValidateCreate so a CR can only be created in the canonical
starting state. Without this, applying a manifest with discardGeneration
already set to a non-zero value would short-circuit straight to
ApplyingDiscard, skipping AwaitingRollout and running DISCARD before
the post-distribute rolling restart settled — leaving Pods that still
held the old password without a valid credential after DISCARD.

Both fields stay +required (no kubebuilder default) so the intent is
visible in the manifest itself.

Signed-off-by: shunki-fujita <shunki-fujita@cybozu.co.jp>
…phase

The boolean columns were derived from generation comparisons, not from
anything actually persisted on the CR — placing them next to the three
Conditions in the same table blurred the line between stored status and
runtime-computed predicates. Collapse them into a single "Outstanding
phase" column with values rotation / discard / —, and add a paragraph
that names the column as derived and explains how it disambiguates the
three (False, False, True) rows.

Signed-off-by: shunki-fujita <shunki-fujita@cybozu.co.jp>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

System users password management

4 participants