OCPBUGS-68371: fix bootstrap race condition #1524

fonta-rh · 2025-12-17T10:08:56Z

On Hold until PR 1523 is merged in

NOTE
This PR is based off of this other PR: #1523
The only specific commit is 3295f0b.

Context
During MCE assisted-installer installation of Two Nodes with Fencing, a race condition occurs where Pacemaker tries to start etcd on a node before it has been started as a static pod after node boostrap reset, failing. See more details here: https://docs.google.com/document/d/1NNXtxBiPzr93jNn2yuri1vMh79l_-6EBTFE_dy_iUWs/edit?tab=t.0#heading=h.u3jtluqz5el2

Summary
This PR adds a new check after the existing waitForEtcdBootstrapCompleted, reusing the WaitForUpdatedRevision function introduced in PR 1523 mentioned above. With this change, we won't start the handover process to Pacemaker until the static etcd cluster's members all have the latest revision

Validation
Tested manually on MCE assisted-installer for TNF. A cluster configured with both this and the other PR mentioned completes successfully. WaitForUpdated revision handles the wait properly

For better overview, no actual changes yet Signed-off-by: Marc Sluiter <msluiter@redhat.com>

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

by Carlo Signed-off-by: Marc Sluiter <msluiter@redhat.com>

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

Fixes race condition where Pacemaker tries to start etcd before /var/lib/etcd exists. The installer creates this directory, so we must wait for all nodes to reach the latest revision after bootstrap. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai · 2025-12-17T10:09:09Z

Walkthrough

This PR introduces Two-Node Fencing (TNF) orchestration for etcd clusters in Kubernetes/OpenShift. It adds job controllers for node authentication, cluster setup, fencing configuration, and post-setup operations. The implementation includes operator-level node handling with retry mechanisms, configuration refactoring for Pacemaker integration, and utility libraries for Job lifecycle management.

Changes

Cohort / File(s)	Summary
Kubernetes Manifest `bindata/tnfdeployment/job.yaml`	Added `app.kubernetes.io/component: two-node-fencing-setup` label to Job metadata; added `MY_NODE_NAME` environment variable via downward API fieldRef to tnf-job container.
Command Setup `cmd/tnf-setup-runner/main.go`	Added import for `tnf/update-setup`; introduced `NewUpdateSetupCommand()` to register update-setup as a Cobra subcommand with error propagation via `klog.Fatal`.
Runner Implementations `pkg/tnf/auth/runner.go`	Replaced token-based authentication flow with single `pcs.Authenticate(ctx, configClient, cfg)` call; removed manual token generation and exec-based sequence; added error logging for config retrieval and auth failures.
	`pkg/tnf/setup/runner.go`
	`pkg/tnf/after-setup/runner.go`
	`pkg/tnf/fencing/runner.go`
	`pkg/tnf/update-setup/runner.go`
Operator Node Handling `pkg/tnf/operator/nodehandler.go`	New file introducing TNF node handler orchestration with `handleNodesWithRetry`, `handleNodes`, `startTnfJobcontrollers`, `waitForEtcdBootstrapCompleted`, `updateSetup`, and helper functions; adds concurrency guards via mutex; validates control plane nodes (exactly 2 required); manages TNF job lifecycle with status degradation tracking.
	`pkg/tnf/operator/nodehandler_test.go`
	`pkg/tnf/operator/starter.go`
	`pkg/tnf/operator/starter_test.go`
Job Management `pkg/tnf/pkg/jobs/tnf.go`	New module with `RunTNFJobController` and `RestartJobOrRunController` for managing TNF Job lifecycle; adds concurrency-safe state tracking via mutexes; prevents duplicate controllers; handles Job restart/deletion with coordinated waits.
	`pkg/tnf/pkg/jobs/tnf_test.go`
	`pkg/tnf/pkg/jobs/utils.go`
	`pkg/tnf/pkg/jobs/utils_test.go`
	`pkg/tnf/pkg/jobs/jobcontroller.go`
Configuration & Cluster `pkg/tnf/pkg/config/cluster.go`	Added public helpers `GetClusterConfig` and `GetClusterConfigIgnoreMissingNode` delegating to unexported `getClusterConfig`; relaxed node-count validation to allow ignoreMissingNode flag; replaced IP extraction with `tools.GetNodeIPForPacemaker(node)` for both nodes.
	`pkg/tnf/pkg/etcd/etcd.go`
PCS Integration `pkg/tnf/pkg/pcs/auth.go`	New module with `Authenticate(ctx, configClient, cfg)` retrieving cluster ID from version API, writing token to `/var/lib/pcsd/token` (0600 perms), executing `pcsd accept_token` and PCS host authentication.
	`pkg/tnf/pkg/pcs/cluster.go`
	`pkg/tnf/pkg/pcs/etcd.go`
	`pkg/tnf/pkg/pcs/fencing.go`
	`pkg/tnf/pkg/pcs/fencing_test.go`
Node & Utilities `pkg/tnf/pkg/tools/nodes.go`	New module with `IsNodeReady` (checks NodeReady condition) and `GetNodeIPForPacemaker` (selects internal or fallback IP for Pacemaker config).
	`pkg/tnf/pkg/tools/nodes_test.go`
	`pkg/tnf/pkg/kubelet/kubelet.go`
	`pkg/tnf/pkg/tools/jobs.go`
Removed Utilities `pkg/tnf/pkg/tools/conditions.go`	Removed file containing `IsConditionTrue` and `IsConditionPresentAndEqual` predicates; functionality migrated to `pkg/tnf/pkg/jobs/utils.go`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

pkg/tnf/operator/nodehandler.go: Dense orchestration logic with concurrency guards, retry mechanisms, and multi-step TNF job lifecycle; validate state transitions and error propagation paths.
pkg/tnf/pkg/jobs/tnf.go: Concurrency-safe state management with mutex protection and duplicate prevention; review lock acquisition order and goroutine lifecycle to prevent deadlocks.
pkg/tnf/pkg/pcs/fencing.go: Significant signature change (config object → node names slice) and addition of delay base differentiation; verify all callers updated and fencing delay behavior is correct.
pkg/tnf/operator/nodehandler_test.go and pkg/tnf/pkg/jobs/tnf_test.go: New comprehensive test coverage; verify mock injection and error scenario handling are complete.
Cross-module integration: Verify consistency in job type handling across tools.JobType, jobs.tnf, operator handlers, and runner implementations.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2025-12-17T10:09:27Z

@fonta-rh: This pull request references Jira Issue OCPBUGS-68371, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

NOTE
This PR is based off of this other PR: #1523
The only specific commit is 3295f0b.

During MCE assisted-installer installation of Two Nodes with Fencing, a race condition occurs where Pacemaker tries to start etcd on a node before it has been started as a static pod after node boostrap reset, failing. See more details here: https://docs.google.com/document/d/1NNXtxBiPzr93jNn2yuri1vMh79l_-6EBTFE_dy_iUWs/edit?tab=t.0#heading=h.u3jtluqz5el2

This PR adds a new check after the existing waitForEtcdBootstrapCompleted, reusing the WaitForUpdatedRevision function introduced in PR 1523 mentioned above. With this change, we won't start the handover process to Pacemaker until the static etcd cluster's members all have the latest revision

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-12-17T10:10:54Z

@fonta-rh: This pull request references Jira Issue OCPBUGS-68371, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @frajamomo

Details

In response to this:

NOTE
This PR is based off of this other PR: #1523
The only specific commit is 3295f0b.

Context
During MCE assisted-installer installation of Two Nodes with Fencing, a race condition occurs where Pacemaker tries to start etcd on a node before it has been started as a static pod after node boostrap reset, failing. See more details here: https://docs.google.com/document/d/1NNXtxBiPzr93jNn2yuri1vMh79l_-6EBTFE_dy_iUWs/edit?tab=t.0#heading=h.u3jtluqz5el2

Summary
This PR adds a new check after the existing waitForEtcdBootstrapCompleted, reusing the WaitForUpdatedRevision function introduced in PR 1523 mentioned above. With this change, we won't start the handover process to Pacemaker until the static etcd cluster's members all have the latest revision

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (12)

pkg/tnf/pkg/kubelet/kubelet.go (1)
11-16: Consider checking stderr for consistency with other exec.Execute usages.

Other usages of exec.Execute in this codebase (e.g., in pkg/tnf/pkg/pcs/etcd.go) check both err and stderr:
stdOut, stdErr, err := exec.Execute(ctx, cmd)
if err != nil || len(stdErr) > 0 {
    // handle error
}
While systemctl disable typically doesn't write critical errors to stderr on success, checking stderr would maintain consistency and could catch edge cases.
pkg/tnf/pkg/pcs/auth.go (2)
49-53: The --debug flag may log sensitive authentication details.

The --debug flag on pcs host auth will produce verbose output that may include tokens or credentials in logs. Consider whether this is appropriate for production use, or if it should be conditional/removed.

19-56: Missing stderr checks for exec.Execute calls.

Similar to the pattern in pkg/tnf/pkg/pcs/etcd.go, consider checking stderr in addition to the error return to catch cases where the command succeeds but reports issues:
_, stdErr, err := exec.Execute(ctx, command)
if err != nil || len(stdErr) > 0 {
    return false, fmt.Errorf("failed to ...: err=%v, stderr=%s", err, stdErr)
}
pkg/tnf/auth/runner.go (1)
57-62: Consider using the boolean return value from Authenticate.

The pcs.Authenticate function returns (bool, error) but the boolean is discarded. Based on the relevant code snippet, this boolean indicates success. If the boolean conveys additional state (e.g., whether authentication was already done), consider logging or acting on it.

If the boolean indicates "already authenticated" or similar state, consider:
-	_, err = pcs.Authenticate(ctx, configClient, cfg)
+	authenticated, err := pcs.Authenticate(ctx, configClient, cfg)
 	if err != nil {
 		klog.Errorf("Failed to authenticate: %v", err)
 		return err
 	}
+	klog.Infof("Authentication completed: %v", authenticated)
pkg/tnf/pkg/tools/nodes.go (1)
21-43: Fallback may return non-IP address (e.g., hostname).

The fallback at line 42 returns addresses[0].Address without validation. If the first address is a hostname (e.g., NodeHostName type), this could cause issues downstream in Pacemaker configuration expecting an IP. Consider validating the fallback address or documenting this behavior.

If strict IP validation is desired for the fallback:
 	// fallback
-	return addresses[0].Address, nil
+	fallbackIP := net.ParseIP(addresses[0].Address)
+	if fallbackIP != nil {
+		return fallbackIP.String(), nil
+	}
+	return "", fmt.Errorf("node %q has no valid IP address", node.Name)
pkg/tnf/update-setup/runner.go (2)
146-160: Logic flaw: proceeds with undefined unstartedMemberID after error.

When finding the unstarted member fails (line 149), the error is only logged but execution continues. However, the unstartedMemberID variable is only defined in the else branch, so if the condition at line 149-150 is true and the code somehow falls through, you'd attempt to use an undefined variable. The current structure is correct due to the else, but the asymmetric error handling is concerning - a failure to list members is logged but swallowed, while a failure to remove a member returns an error.

Consider returning the error or at least documenting why this is intentional:
 	command = "podman exec etcd /usr/bin/etcdctl member list | grep unstarted | awk -F, '{ print $1 }'"
 	stdOut, stdErr, err = exec.Execute(ctx, command)
 	if err != nil {
 		klog.Errorf("Failed to find unstarted etcd member: %s, stdout: %s, stderr: %s, err: %v", command, stdOut, stdErr, err)
+		// Continue without removing unstarted member - this may be acceptable if none exists
 	} else {
162-164: Hardcoded sleep is fragile and undocumented.

A 10-second hardcoded sleep with a vague comment ("for some reason") indicates an unresolved timing issue. While this may work, it's brittle and could mask underlying race conditions.

Consider documenting the root cause or implementing a proper wait condition if possible.
pkg/tnf/pkg/jobs/utils_test.go (2)
154-163: errorContains field is unused in all test cases.

The errorContains field is defined in the test struct but never populated in any test case. Either remove it or add specific error message assertions.
 	tests := []struct {
 		name          string
 		setupClient   func() *fake.Clientset
 		jobName       string
 		jobNamespace  string
 		timeout       time.Duration
 		expectError   bool
-		errorContains string
 	}{
And remove the corresponding check at lines 299-301.

385-407: Test delay: The "Job deletion times out" test waits the full 1-minute hardcoded timeout in DeleteAndWait before failing.

Since DeleteAndWait has no configurable timeout parameter, consider wrapping the wait.PollUntilContextTimeout call in a testable layer or using dependency injection to allow faster timeout for tests.
pkg/tnf/pkg/jobs/tnf_test.go (3)
86-94: Tests mutate global state - not safe for parallel execution.

The tests directly modify package-level variables (runningControllers, restartJobLocks). This prevents running tests in parallel and could cause flaky behavior if tests are run concurrently. Consider refactoring the production code to use dependency injection or accepting this as a known limitation.

135-137: Timing-based assertion may cause flaky tests.

The 100ms sleep to "give the goroutine a moment to start" is timing-sensitive and could be flaky on slow CI systems. Consider using a more deterministic synchronization mechanism.

478-484: Parallel test assertion may be flaky.

The assertion that "Delete should only be called once" relies on the locking mechanism working correctly. However, if the first goroutine completes the delete before others acquire the lock, subsequent goroutines will see jobExists=false (due to NotFound from Get) and skip deletion entirely. This is actually the expected behavior, but the comment and assertion could be clearer.

Consider clarifying the comment:
-	// Verify delete was only called once due to locking
+	// Verify delete was only called once - either due to locking or because subsequent calls see NotFound

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 2e46bfd and 3295f0b.

📒 Files selected for processing (28)

bindata/tnfdeployment/job.yaml (2 hunks)
cmd/tnf-setup-runner/main.go (3 hunks)
pkg/tnf/after-setup/runner.go (4 hunks)
pkg/tnf/auth/runner.go (2 hunks)
pkg/tnf/fencing/runner.go (3 hunks)
pkg/tnf/operator/nodehandler.go (1 hunks)
pkg/tnf/operator/nodehandler_test.go (1 hunks)
pkg/tnf/operator/starter.go (3 hunks)
pkg/tnf/operator/starter_test.go (6 hunks)
pkg/tnf/pkg/config/cluster.go (4 hunks)
pkg/tnf/pkg/etcd/etcd.go (2 hunks)
pkg/tnf/pkg/jobs/jobcontroller.go (3 hunks)
pkg/tnf/pkg/jobs/tnf.go (1 hunks)
pkg/tnf/pkg/jobs/tnf_test.go (1 hunks)
pkg/tnf/pkg/jobs/utils.go (1 hunks)
pkg/tnf/pkg/jobs/utils_test.go (1 hunks)
pkg/tnf/pkg/kubelet/kubelet.go (1 hunks)
pkg/tnf/pkg/pcs/auth.go (1 hunks)
pkg/tnf/pkg/pcs/cluster.go (1 hunks)
pkg/tnf/pkg/pcs/etcd.go (1 hunks)
pkg/tnf/pkg/pcs/fencing.go (4 hunks)
pkg/tnf/pkg/pcs/fencing_test.go (6 hunks)
pkg/tnf/pkg/tools/conditions.go (0 hunks)
pkg/tnf/pkg/tools/jobs.go (3 hunks)
pkg/tnf/pkg/tools/nodes.go (1 hunks)
pkg/tnf/pkg/tools/nodes_test.go (1 hunks)
pkg/tnf/setup/runner.go (3 hunks)
pkg/tnf/update-setup/runner.go (1 hunks)

💤 Files with no reviewable changes (1)

pkg/tnf/pkg/tools/conditions.go

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

bindata/tnfdeployment/job.yaml
cmd/tnf-setup-runner/main.go
pkg/tnf/update-setup/runner.go
pkg/tnf/pkg/tools/jobs.go
pkg/tnf/pkg/etcd/etcd.go
pkg/tnf/pkg/pcs/auth.go
pkg/tnf/pkg/tools/nodes_test.go
pkg/tnf/pkg/config/cluster.go
pkg/tnf/pkg/tools/nodes.go
pkg/tnf/pkg/jobs/utils_test.go
pkg/tnf/operator/nodehandler.go
pkg/tnf/pkg/pcs/fencing_test.go
pkg/tnf/fencing/runner.go
pkg/tnf/operator/nodehandler_test.go
pkg/tnf/pkg/jobs/tnf.go
pkg/tnf/after-setup/runner.go
pkg/tnf/auth/runner.go
pkg/tnf/pkg/pcs/etcd.go
pkg/tnf/pkg/pcs/cluster.go
pkg/tnf/pkg/jobs/utils.go
pkg/tnf/operator/starter_test.go
pkg/tnf/setup/runner.go
pkg/tnf/operator/starter.go
pkg/tnf/pkg/jobs/tnf_test.go
pkg/tnf/pkg/jobs/jobcontroller.go
pkg/tnf/pkg/pcs/fencing.go
pkg/tnf/pkg/kubelet/kubelet.go

🧬 Code graph analysis (13)

pkg/tnf/pkg/etcd/etcd.go (2)

pkg/etcdcli/interfaces.go (1)

Status (45-47)

pkg/operator/ceohelpers/common.go (1)

CurrentRevision (220-238)

pkg/tnf/pkg/pcs/auth.go (2)

pkg/tnf/pkg/config/cluster.go (1)

ClusterConfig (15-20)

pkg/tnf/pkg/exec/exec.go (1)

Execute (14-47)

pkg/tnf/pkg/tools/nodes_test.go (1)

pkg/tnf/pkg/tools/nodes.go (2)

IsNodeReady (12-19)

GetNodeIPForPacemaker (24-43)

pkg/tnf/pkg/config/cluster.go (1)

pkg/tnf/pkg/tools/nodes.go (1)

GetNodeIPForPacemaker (24-43)

pkg/tnf/pkg/jobs/utils_test.go (2)

pkg/etcdcli/interfaces.go (1)

Status (45-47)

pkg/tnf/pkg/jobs/utils.go (4)

IsComplete (90-92)

IsFailed (94-96)

WaitForCompletion (23-26)

DeleteAndWait (50-84)

pkg/tnf/operator/nodehandler.go (7)

pkg/tnf/pkg/tools/nodes.go (1)

IsNodeReady (12-19)

pkg/tnf/pkg/etcd/etcd.go (1)

WaitForUpdatedRevision (69-98)

pkg/tnf/pkg/jobs/tnf.go (2)

RunTNFJobController (37-86)

RestartJobOrRunController (88-144)

pkg/tnf/pkg/jobs/jobcontroller.go (2)

DefaultConditions (30-30)

AllConditions (31-31)

pkg/operator/bootstrapteardown/waitforceo.go (1)

WaitForEtcdBootstrap (17-32)

pkg/tnf/pkg/jobs/utils.go (1)

WaitForCompletion (23-26)

pkg/operator/operatorclient/interfaces.go (1)

TargetNamespace (7-7)

pkg/tnf/fencing/runner.go (3)

pkg/tnf/pkg/jobs/utils.go (1)

IsConditionTrue (98-100)

pkg/tnf/pkg/config/cluster.go (1)

GetClusterConfigIgnoreMissingNode (28-30)

pkg/tnf/pkg/pcs/fencing.go (1)

ConfigureFencing (64-144)

pkg/tnf/after-setup/runner.go (4)

pkg/tnf/pkg/tools/jobs.go (1)

JobTypeSetup (28-28)

pkg/tnf/pkg/jobs/utils.go (1)

IsConditionTrue (98-100)

pkg/etcdcli/interfaces.go (1)

Status (45-47)

pkg/tnf/pkg/kubelet/kubelet.go (1)

Disable (11-16)

pkg/tnf/auth/runner.go (1)

pkg/tnf/pkg/pcs/auth.go (1)

Authenticate (20-56)

pkg/tnf/setup/runner.go (3)

pkg/tnf/pkg/tools/jobs.go (1)

JobTypeAuth (27-27)

pkg/tnf/pkg/jobs/utils.go (1)

IsConditionTrue (98-100)

pkg/tnf/pkg/pcs/fencing.go (1)

ConfigureFencing (64-144)

pkg/tnf/pkg/jobs/tnf_test.go (5)

pkg/tnf/pkg/tools/jobs.go (5)

JobType (24-24)

JobTypeAuth (27-27)

JobTypeSetup (28-28)

JobTypeFencing (30-30)

JobTypeAfterSetup (29-29)

pkg/testutils/testutils.go (1)

StaticPodOperatorStatus (211-222)

pkg/operator/operatorclient/interfaces.go (1)

TargetNamespace (7-7)

pkg/tnf/pkg/jobs/tnf.go (2)

RunTNFJobController (37-86)

RestartJobOrRunController (88-144)

pkg/tnf/pkg/jobs/jobcontroller.go (1)

DefaultConditions (30-30)

pkg/tnf/pkg/jobs/jobcontroller.go (1)

pkg/tnf/pkg/jobs/utils.go (2)

IsComplete (90-92)

IsFailed (94-96)

pkg/tnf/pkg/kubelet/kubelet.go (1)

pkg/tnf/pkg/exec/exec.go (1)

Execute (14-47)

🔇 Additional comments (47)

pkg/tnf/pkg/pcs/etcd.go (1)

25-26: LGTM!

Adding meta migration-threshold=5 is a sensible configuration to prevent excessive resource migrations due to transient failures, which helps address the race condition mentioned in the PR objectives.

pkg/tnf/pkg/pcs/fencing_test.go (1)

275-276: Test expectations align with implementation changes.

The updated test cases correctly reflect the new command format with pcmk_delay_base, explicit ssl_insecure values, and the increased wait timeout of 120 seconds.

Also applies to: 292-293, 309-310, 326-327, 344-345, 366-367

pkg/tnf/pkg/jobs/jobcontroller.go (1)

251-256: LGTM!

The refactor to use centralized IsComplete and IsFailed helpers from utils.go improves code reuse. The dereference pattern IsComplete(*job) is correct given the helper signatures accept value types.

Also applies to: 276-281, 307-308

pkg/tnf/pkg/pcs/auth.go (1)

30-34: The current implementation is correct and necessary for this use case. The exec.Execute function runs all commands through nsenter to execute them in the host's root namespace (as documented in the exec package). Using os.WriteFile would attempt to write to the container's filesystem instead of the host's /var/lib/pcsd/token path, breaking the functionality. The shell echo with %q quoting is appropriate here since the token value is a Kubernetes ClusterID (UUID format) and the command properly executes in the host context via nsenter.

Likely an incorrect or invalid review comment.

pkg/tnf/pkg/etcd/etcd.go (1)

49-66: LGTM!

The refactored waitForStaticContainerRemoved correctly delegates to WaitForUpdatedRevision and only updates the transition status after all nodes have the latest revision. This properly addresses the race condition described in the PR objectives.

pkg/tnf/pkg/pcs/fencing.go (2)

64-93: LGTM!

The iteration over nodeNames with empty entry filtering, combined with differentiated pcmk_delay_base values for fence devices, properly prevents fencing race conditions as described. The first device gets a longer delay (10s) while subsequent devices get 1s.

236-247: LGTM!

The stonith command construction properly includes pcmk_delay_base and explicitly sets ssl_insecure to "0" when SSL verification is enabled. The increased wait time (120s) provides adequate time for device startup verification.

pkg/tnf/pkg/pcs/cluster.go (1)

35-39: LGTM!

The addition of migration-threshold=5 and start-failure-is-fatal=false improves cluster resilience by allowing recovery from transient failures without permanent resource stoppage.

pkg/tnf/auth/runner.go (1)

50-55: LGTM!

The refactoring to use pcs.Authenticate improves code organization by centralizing the authentication logic in the pcs package.

cmd/tnf-setup-runner/main.go (1)

119-130: LGTM — The new UpdateSetup command follows the established pattern of other TNF commands, with consistent error handling via klog.Fatal. The JobTypeUpdateSetup constant is properly defined in the tools package and used consistently throughout the codebase.

bindata/tnfdeployment/job.yaml (1)

6-6: LGTM!

The label addition for component identification and the MY_NODE_NAME environment variable sourced via Downward API are correctly configured. This enables the job to be properly labeled for TNF setup orchestration and provides the container with node identity awareness needed for fencing operations.

Also applies to: 28-32

pkg/tnf/pkg/tools/nodes_test.go (2)

11-135: LGTM!

Comprehensive table-driven tests covering all expected scenarios for IsNodeReady: True/False/Unknown conditions, empty conditions, and mixed conditions. Well-structured test cases.

137-354: LGTM!

Excellent coverage for GetNodeIPForPacemaker including edge cases: IPv4/IPv6, address priority, fallback behavior, error handling for empty addresses, invalid IP skipping, and IPv4-mapped IPv6 normalization. The tests properly validate the implementation behavior.

pkg/tnf/setup/runner.go (2)

59-75: LGTM!

The auth job waiting logic is correctly updated to use the centralized jobs.IsConditionTrue helper and properly validates that exactly 2 auth jobs are complete before proceeding. The nil check combined with length validation ensures robust handling.

103-103: LGTM!

The ConfigureFencing call correctly passes node names as a slice, aligning with the updated function signature in pkg/tnf/pkg/pcs/fencing.go.

pkg/tnf/after-setup/runner.go (2)

51-68: LGTM!

The setup job polling logic correctly validates a single setup job completion using the centralized jobs.IsConditionTrue helper. The guard ensures proper handling when jobs list is nil or unexpected count.

77-81: LGTM!

Delegating kubelet disabling to kubelet.Disable(ctx) improves modularity and separates concerns. Error handling is appropriate.

pkg/tnf/pkg/tools/jobs.go (2)

13-20: LGTM!

Timeout constants are well-organized with reasonable values. AllCompletedTimeout at 30 minutes provides adequate buffer for the combined auth + setup operations.

31-31: LGTM!

JobTypeUpdateSetup follows the established enum pattern, and GetSubCommand correctly handles the new type.

Also applies to: 44-45

pkg/tnf/operator/nodehandler_test.go (2)

34-158: LGTM!

Comprehensive test table covering essential scenarios: node count validation, node readiness checks, job existence detection, and error propagation. Good coverage of the handleNodes workflow.

302-344: LGTM!

Clean helper functions for test fixtures. The ready/not-ready node helpers and TNF job helper are well-defined and reusable.

pkg/tnf/pkg/tools/nodes.go (1)

10-19: LGTM!

IsNodeReady correctly iterates through conditions and returns true only when NodeReady condition is explicitly True. Returns false for missing condition, which is the safe default.

pkg/tnf/fencing/runner.go (3)

52-73: LGTM!

The setup job polling correctly uses jobs.IsConditionTrue and validates exactly one completed setup job. The separate nil check (line 59) and length check (line 63) are explicit but could be combined for consistency with after-setup/runner.go which uses setupJobs.Items == nil || len(setupJobs.Items) != 1.

84-84: LGTM!

Using GetClusterConfigIgnoreMissingNode is appropriate for the fencing runner, as fencing configuration may need to run when one node is unavailable.

89-89: LGTM!

The ConfigureFencing call correctly passes node names as a slice, consistent with the updated function signature.

pkg/tnf/update-setup/runner.go (1)

180-190: LGTM!

The runCommands helper is clean and provides consistent logging and error handling for sequential command execution.

pkg/tnf/pkg/jobs/utils_test.go (1)

243-287: Good test for state transitions using reactors.

The "Job transitions from running to complete" test case effectively uses reactors to simulate real-world behavior where a job completes during polling.

pkg/tnf/pkg/config/cluster.go (3)

22-30: LGTM! Clean API design.

Good separation of concerns with two public functions delegating to a private implementation with the ignoreMissingNode flag.

55-72: LGTM!

Clean iteration with proper error propagation from GetNodeIPForPacemaker.

44-49: Callers of GetClusterConfigIgnoreMissingNode already handle partial ClusterConfig correctly.

The only caller of GetClusterConfigIgnoreMissingNode (in fencing/runner.go) passes NodeName2 and NodeIP2 to ConfigureFencing, which explicitly skips empty node names with a continue statement (line 72-73 of pcs/fencing.go). All other callers use the strict GetClusterConfig that requires exactly 2 nodes. No action needed.

pkg/tnf/pkg/jobs/tnf_test.go (1)

374-476: Good concurrency test design.

The test effectively validates that parallel invocations of RestartJobOrRunController are properly synchronized using the job-specific locks.

pkg/tnf/operator/starter.go (3)

70-87: LGTM! Appropriate async handling.

Running handleNodesWithRetry in a goroutine is correct to avoid blocking the event handler. The filtering for ready nodes before triggering is sensible.

246-250: LGTM!

Delegating to RestartJobOrRunController for fencing job management is clean and centralizes the restart logic.

136-140: DeleteFunc is a no-op - consider documenting or removing the handler.

The DeleteFunc does nothing. If this is intentional (as the comment suggests), consider whether the handler is needed at all, or ensure the comment is visible to future maintainers.

pkg/tnf/operator/nodehandler.go (7)

31-51: Well-structured retry and concurrency configuration.

The backoff configuration with exponential growth and cap is appropriate for handling transient failures. The test hooks via function variables enable proper unit testing.

53-103: Solid retry and degraded-status handling.

The retry logic with exponential backoff and the operator condition updates for both success and failure paths are well-implemented. The mutex ensures single execution.

Minor observation: Line 78 checks err != nil || setupErr != nil, but when wait.ExponentialBackoffWithContext returns context.DeadlineExceeded, setupErr contains the last error, so this correctly captures all failure scenarios.

105-167: Proper two-node cluster validation.

The function correctly validates exactly 2 control plane nodes and ensures both are ready before proceeding. The early returns for unsupported configurations (>2 nodes) and incomplete states (<2 nodes or not ready) are appropriate.

169-212: This is the key fix for the race condition.

Lines 193-196 add WaitForUpdatedRevision after bootstrap completion, ensuring all nodes have their static pod etcd at the latest revision before handing over to Pacemaker. This directly addresses the PR objective of preventing Pacemaker from starting etcd before the node's static pod etcd is ready.

233-295: Sequential job execution with proper completion waits.

The updateSetup function properly sequences auth → update-setup → after-setup jobs, waiting for completion between phases. This ensures proper ordering during TNF updates.

297-319: LGTM - Clean utility functions.

tnfSetupJobsExist uses label selector to detect existing TNF jobs, and waitForTnfAfterSetupJobsCompletion properly waits for all after-setup jobs to complete before returning.

214-231: This error handling is appropriate for in-cluster operation. The InClusterConfig() call is standard across the codebase (setup, after-setup, update-setup, fencing, auth runners) and is expected to succeed in this context, as the TNF operator runs within the cluster. The error propagation is consistent with the codebase pattern and provides necessary context if the config cannot be loaded.

Likely an incorrect or invalid review comment.

pkg/tnf/pkg/jobs/tnf.go (1)

25-35: Proper concurrency safeguards for controller tracking.

The dual-map approach with separate mutexes (runningControllers + restartJobLocks) correctly handles:

Preventing duplicate controller starts

Serializing restart operations per job

pkg/tnf/operator/starter_test.go (2)

275-321: Comprehensive test coverage for retry scenarios.

The tests properly cover:

Success on first attempt

Success after transient failures

Permanent failure after exhausting retries

The mock function signatures correctly match the updated handleNodes signature using NodeLister.

323-385: Good test isolation with proper cleanup.

The test correctly:

Stores and restores original function hooks

Uses shortened backoff for faster test execution

Validates operator condition status and messages

pkg/tnf/pkg/jobs/utils.go (3)

16-47: Well-designed wait utilities with appropriate error tolerance.

The decision to ignore errors including NotFound during polling (lines 33-36) is appropriate for handling deletion/recreation cycles. The condition-based approach with IsStopped and IsComplete provides clean separation.

49-84: Robust deletion with UID-based recreation detection.

The DeleteAndWait function correctly handles:

Job not existing (returns nil)

UID comparison to detect recreation (line 82)

Transient errors during polling (line 79)

The 1-minute timeout for deletion polling is reasonable.

86-109: LGTM - Clean condition helper functions.

The helper functions follow standard patterns for checking Kubernetes conditions. IsStopped correctly combines IsComplete and IsFailed.

coderabbitai · 2025-12-17T10:15:07Z

pkg/tnf/operator/nodehandler_test.go

+			// Create etcd informer
+			operatorClientFake := operatorversionedclientfake.NewClientset()
+			etcdInformers := extinfops.NewSharedInformerFactory(operatorClientFake, 10*time.Minute)
+			etcdIndexer := cache.NewIndexer(cache.MetaNamespaceKeyFunc, cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc})
+			require.NoError(t, etcdIndexer.Add(&operatorv1.Etcd{
+				ObjectMeta: metav1.ObjectMeta{
+					Name: ceohelpers.InfrastructureClusterName,
+				},
+			}))
+			etcdInformers.Operator().V1().Etcds().Informer().AddIndexers(etcdIndexer.GetIndexers())
+			ctx2, cancel := context.WithCancel(context.Background())
+			defer cancel()
+			etcdInformers.Start(ctx2.Done())
+			synced := etcdInformers.WaitForCacheSync(ctx2.Done())
+			for v, ok := range synced {
+				require.True(t, ok, "cache failed to sync: %v", v)
+			}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

rg -n "AddIndexers" --type=go -B2 -A2

Repository: openshift/cluster-etcd-operator

Length of output: 1408

🏁 Script executed:

rg -n "NewIndexer" --type=go -B3 -A3 | head -100

Repository: openshift/cluster-etcd-operator

Length of output: 9348

🏁 Script executed:

sed -n '199,260p' pkg/tnf/operator/nodehandler_test.go

Repository: openshift/cluster-etcd-operator

Length of output: 2421

🏁 Script executed:

rg -n "Informer\.GetStore" --type=go -B2 -A2

Repository: openshift/cluster-etcd-operator

Length of output: 57

🏁 Script executed:

sed -n '199,300p' pkg/tnf/operator/nodehandler_test.go | grep -n "operatorClientFake\|etcdIndexer\|etcdInformers"

Repository: openshift/cluster-etcd-operator

Length of output: 667

🏁 Script executed:

rg -n "operatorClientFake" --type=go -B3 -A10 | head -150

Repository: openshift/cluster-etcd-operator

Length of output: 2823

🏁 Script executed:

rg -n "operatorversionedclientfake.NewClientset" --type=go -B2 -A15 | head -120

Repository: openshift/cluster-etcd-operator

Length of output: 3141

🏁 Script executed:

rg -n "fakeClientset\|NewClientset" --type=go -A5 | grep -E "(Create|Add|Get).*$" | head -40

Repository: openshift/cluster-etcd-operator

Length of output: 57

🏁 Script executed:

rg -n "\.Create\(" --type=go pkg/tnf/operator/nodehandler_test.go pkg/tnf/operator/starter_test.go

Repository: openshift/cluster-etcd-operator

Length of output: 57

🏁 Script executed:

rg -n "\.GetStore\($\.Add" --type=go

Repository: openshift/cluster-etcd-operator

Length of output: 57

🏁 Script executed:

rg -n "NewClientset.*\n.*Create.*Etcd\|Create.*Etcd.*\n.*NewClientset" --type=go -A8 | head -80

Repository: openshift/cluster-etcd-operator

Length of output: 266

🏁 Script executed:

sed -n '190,280p' pkg/tnf/operator/nodehandler_test.go | cat -n

Repository: openshift/cluster-etcd-operator

Length of output: 3861

🏁 Script executed:

rg -n "Etcd.Create" --type=go -B5 -A5

Repository: openshift/cluster-etcd-operator

Length of output: 57

🏁 Script executed:

rg -n "operatorClientFake\.Operator\|fakeOperatorClient\.Operator" --type=go -B2 -A5 | head -100

Repository: openshift/cluster-etcd-operator

Length of output: 57

🏁 Script executed:

sed -n '195,210p' pkg/tnf/operator/nodehandler_test.go

Repository: openshift/cluster-etcd-operator

Length of output: 843

🏁 Script executed:

rg -n "func handleNodes" --type=go -A20

Repository: openshift/cluster-etcd-operator

Length of output: 3300

🏁 Script executed:

rg -n "etcdInformer" pkg/tnf/operator/nodehandler.go -B2 -A5

Repository: openshift/cluster-etcd-operator

Length of output: 2325

🏁 Script executed:

rg -n "\.Lister" pkg/tnf/operator/nodehandler.go -B3 -A3

Repository: openshift/cluster-etcd-operator

Length of output: 57

🏁 Script executed:

sed -n '183,186p' pkg/tnf/operator/nodehandler.go

Repository: openshift/cluster-etcd-operator

Length of output: 242

🏁 Script executed:

rg -n "GetStore\|\.Lister" pkg/tnf/operator/nodehandler.go -B2 -A2

Repository: openshift/cluster-etcd-operator

Length of output: 57

🏁 Script executed:

rg -n "\.Create$" pkg/tnf/operator/starter_test.go -B5 -A3 | head -60

Repository: openshift/cluster-etcd-operator

Length of output: 57

🏁 Script executed:

rg -n "startTnfJobcontrollersFunc\|startTnfJobcontrollers" pkg/tnf/operator/nodehandler.go -A30 | head -80

Repository: openshift/cluster-etcd-operator

Length of output: 57

🏁 Script executed:

sed -n '165,200p' pkg/tnf/operator/nodehandler.go

Repository: openshift/cluster-etcd-operator

Length of output: 1515

🏁 Script executed:

rg -n "etcdInformer.*Lister\|etcdInformer.*Get" pkg/tnf/operator/nodehandler.go

Repository: openshift/cluster-etcd-operator

Length of output: 57

🏁 Script executed:

sed -n '192,220p' pkg/tnf/operator/nodehandler_test.go | grep -A5 "nodeInformer\|etcdInformer"

Repository: openshift/cluster-etcd-operator

Length of output: 1250

🏁 Script executed:

rg -n "operatorClientFake.*Create" pkg/tnf/operator/ -B3 -A3

Repository: openshift/cluster-etcd-operator

Length of output: 57

🏁 Script executed:

rg -n "Informer\($\.GetIndexer\.Add" pkg/tnf/operator/ --type=go -B2 -A2

Repository: openshift/cluster-etcd-operator

Length of output: 498

🏁 Script executed:

rg -n "operatorClientFake\|fakeOperatorClient" pkg/tnf/operator/nodehandler_test.go -B3 -A3

Repository: openshift/cluster-etcd-operator

Length of output: 57

🏁 Script executed:

sed -n '220,250p' pkg/tnf/operator/starter_test.go

Repository: openshift/cluster-etcd-operator

Length of output: 1514

🏁 Script executed:

sed -n '225,245p' pkg/tnf/operator/starter_test.go | cat -n

Repository: openshift/cluster-etcd-operator

Length of output: 1182

Add etcd object to informer's cache, not a separate indexer.

The etcd object is added to a separate etcdIndexer but only the index definitions are transferred to the informer via AddIndexers(). The actual object remains inaccessible to the informer. Follow the same pattern as nodes (line 194): use etcdInformers.Operator().V1().Etcds().Informer().GetIndexer().Add() to add the etcd object directly to the informer's cache.

🤖 Prompt for AI Agents

In pkg/tnf/operator/nodehandler_test.go around lines 199 to 215, the test adds an Etcd object to a separate etcdIndexer and only transfers index definitions to the informer; instead add the Etcd object directly into the informer's cache so the informer can find it at runtime. Replace the separate etcdIndexer.Add(...) usage with a call to etcdInformers.Operator().V1().Etcds().Informer().GetIndexer().Add(...) (adding the same metav1.Object with Name ceohelpers.InfrastructureClusterName), keep AddIndexers if needed for custom indexes, then start the informer and WaitForCacheSync as before.

coderabbitai · 2025-12-17T10:15:08Z

pkg/tnf/operator/starter.go

+		if oldObj != nil {
+			oldSecret, ok := oldObj.(*corev1.Secret)
+			if !ok {
+				klog.Warningf("failed to convert old object to Secret %+v", oldObj)
 			}
-			klog.Errorf("failed to get fencing job, will retry: %v", err)
-			return false, nil
-		}
-		jobFound = true
-		if tools.IsConditionTrue(job.Status.Conditions, batchv1.JobComplete) || tools.IsConditionTrue(job.Status.Conditions, batchv1.JobFailed) {
-			return true, nil
+			// check if data changed
+			changed := false
+			if len(oldSecret.Data) != len(secret.Data) {
+				changed = true
+			} else {
+				for key, oldValue := range oldSecret.Data {
+					newValue, exists := secret.Data[key]
+					if !exists || !bytes.Equal(oldValue, newValue) {
+						changed = true
+						break
+					}
+				}
+			}
+			if !changed {
+				return
+			}
+			klog.Infof("handling modified fencing secret %s", secret.GetName())
+		} else {
+			klog.Infof("handling added or deleted fencing secret %s", secret.GetName())
 		}


⚠️ Potential issue | 🟠 Major

Continue using uninitialized oldSecret after failed type assertion.

If the type assertion at line 220 fails, the code logs a warning but then proceeds to use oldSecret in the comparison logic (lines 226-236). This would cause a nil pointer dereference.

if oldObj != nil { oldSecret, ok := oldObj.(*corev1.Secret) if !ok { klog.Warningf("failed to convert old object to Secret %+v", oldObj) + // Treat as changed since we can't compare + klog.Infof("handling fencing secret %s (couldn't compare with old)", secret.GetName()) + } else { - } // check if data changed changed := false if len(oldSecret.Data) != len(secret.Data) { @@ ... @@ if !changed { return } klog.Infof("handling modified fencing secret %s", secret.GetName()) + } } else {

Or restructure with early return:

if oldObj != nil { oldSecret, ok := oldObj.(*corev1.Secret) if !ok { klog.Warningf("failed to convert old object to Secret %+v", oldObj) + return }

🤖 Prompt for AI Agents

In pkg/tnf/operator/starter.go around lines 219 to 243, the code logs a warning when the old object fails to convert to *corev1.Secret but then continues to dereference oldSecret; change the flow so you never use oldSecret when the type assertion fails — either return early after logging the warning or wrap the subsequent comparison logic inside the successful assertion branch (i.e., only perform the length/value comparisons if ok is true); ensure no nil pointer dereference paths remain and keep the existing logs for modified vs added/deleted secrets.

coderabbitai · 2025-12-17T10:15:08Z

pkg/tnf/pkg/etcd/etcd.go

 			if nodeStatus.CurrentRevision == status.LatestAvailableRevision {
-				klog.Infof("static etcd removed: node %s, current rev %v, latest rev %v", nodeStatus.NodeName, nodeStatus.CurrentRevision, status.LatestAvailableRevision)
+				klog.Infof("node %q is running the latest etcd revision %q", nodeStatus.NodeName, nodeStatus.CurrentRevision)
 			} else {
-				klog.Infof("static etcd not removed yet: node %s, current rev %v, latest rev %v", nodeStatus.NodeName, nodeStatus.CurrentRevision, status.LatestAvailableRevision)
-				removed = false
+				klog.Infof("node %q is not running the latest etcd revision yet, expected %q, got %q", nodeStatus.NodeName, status.LatestAvailableRevision, nodeStatus.CurrentRevision)
+				allUpdated = false
 			}


⚠️ Potential issue | 🟡 Minor

Incorrect format specifier for integer values.

Using %q (quoted string) for int32 revision values will produce unexpected output like '\x00' instead of the numeric value.

Apply this diff to fix the format specifiers:

if nodeStatus.CurrentRevision == status.LatestAvailableRevision { - klog.Infof("node %q is running the latest etcd revision %q", nodeStatus.NodeName, nodeStatus.CurrentRevision) + klog.Infof("node %q is running the latest etcd revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision) } else { - klog.Infof("node %q is not running the latest etcd revision yet, expected %q, got %q", nodeStatus.NodeName, status.LatestAvailableRevision, nodeStatus.CurrentRevision) + klog.Infof("node %q is not running the latest etcd revision yet, expected %d, got %d", nodeStatus.NodeName, status.LatestAvailableRevision, nodeStatus.CurrentRevision) allUpdated = false }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if nodeStatus.CurrentRevision == status.LatestAvailableRevision {

klog.Infof("static etcd removed: node %s, current rev %v, latest rev %v", nodeStatus.NodeName, nodeStatus.CurrentRevision, status.LatestAvailableRevision)

klog.Infof("node %q is running the latest etcd revision %q", nodeStatus.NodeName, nodeStatus.CurrentRevision)

} else {

klog.Infof("static etcd not removed yet: node %s, current rev %v, latest rev %v", nodeStatus.NodeName, nodeStatus.CurrentRevision, status.LatestAvailableRevision)

removed = false

klog.Infof("node %q is not running the latest etcd revision yet, expected %q, got %q", nodeStatus.NodeName, status.LatestAvailableRevision, nodeStatus.CurrentRevision)

allUpdated = false

}

if nodeStatus.CurrentRevision == status.LatestAvailableRevision {

klog.Infof("node %q is running the latest etcd revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision)

} else {

klog.Infof("node %q is not running the latest etcd revision yet, expected %d, got %d", nodeStatus.NodeName, status.LatestAvailableRevision, nodeStatus.CurrentRevision)

allUpdated = false

}

🤖 Prompt for AI Agents

In pkg/tnf/pkg/etcd/etcd.go around lines 86 to 91, the log calls use the string/quoted format specifier %q for int32 revision values; change those format specifiers to numeric ones (e.g., %d or %v) so the revisions print as numbers. Update both klog.Infof calls: use %d (or %v) for nodeStatus.CurrentRevision and status.LatestAvailableRevision while keeping %q for nodeStatus.NodeName.

coderabbitai · 2025-12-17T10:15:08Z

pkg/tnf/pkg/jobs/tnf.go

+func RunTNFJobController(ctx context.Context, jobType tools.JobType, nodeName *string, controllerContext *controllercmd.ControllerContext, operatorClient v1helpers.StaticPodOperatorClient, kubeClient kubernetes.Interface, kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces, conditions []string) {
+	nodeNameForLogs := "undefined"
+	if nodeName != nil {
+		nodeNameForLogs = *nodeName
+	}
+
+	// Check if a controller for this jobType and nodeName is already running
+	controllerKey := jobType.GetJobName(nodeName)
+	runningControllersMutex.Lock()
+	if runningControllers[controllerKey] {
+		runningControllersMutex.Unlock()
+		klog.Infof("Two Node Fencing job controller for command %q on node %q is already running, skipping duplicate start", jobType.GetSubCommand(), nodeNameForLogs)
+		return
+	}
+	// Mark this controller as running
+	runningControllers[controllerKey] = true
+	runningControllersMutex.Unlock()
+
+	klog.Infof("starting Two Node Fencing job controller for command %q on node %q", jobType.GetSubCommand(), nodeNameForLogs)
+	tnfJobController := NewJobController(
+		jobType.GetJobName(nodeName),
+		bindata.MustAsset("tnfdeployment/job.yaml"),
+		controllerContext.EventRecorder,
+		operatorClient,
+		kubeClient,
+		kubeInformersForNamespaces.InformersFor(operatorclient.TargetNamespace).Batch().V1().Jobs(),
+		conditions,
+		[]factory.Informer{},
+		[]JobHookFunc{
+			func(_ *operatorv1.OperatorSpec, job *batchv1.Job) error {
+				if nodeName != nil {
+					job.Spec.Template.Spec.NodeName = *nodeName
+				}
+				job.SetName(jobType.GetJobName(nodeName))
+				job.Labels["app.kubernetes.io/name"] = jobType.GetNameLabelValue()
+				job.Spec.Template.Spec.Containers[0].Image = os.Getenv("OPERATOR_IMAGE")
+				job.Spec.Template.Spec.Containers[0].Command[1] = jobType.GetSubCommand()
+				return nil
+			}}...,
+	)
+	go func() {
+		defer func() {
+			runningControllersMutex.Lock()
+			delete(runningControllers, controllerKey)
+			runningControllersMutex.Unlock()
+			klog.Infof("Two Node Fencing job controller for command %q on node %q stopped", jobType.GetSubCommand(), nodeNameForLogs)
+		}()
+		tnfJobController.Run(ctx, 1)
+	}()
+}


⚠️ Potential issue | 🟡 Minor

Consider potential panic if job manifest lacks expected structure.

Line 72-73 directly access job.Spec.Template.Spec.Containers[0] without bounds checking. If the YAML manifest tnfdeployment/job.yaml is malformed or empty, this will panic.

Consider adding a defensive check:

func(_ *operatorv1.OperatorSpec, job *batchv1.Job) error { + if len(job.Spec.Template.Spec.Containers) == 0 { + return fmt.Errorf("job manifest must have at least one container") + } if nodeName != nil { job.Spec.Template.Spec.NodeName = *nodeName }

🤖 Prompt for AI Agents

In pkg/tnf/pkg/jobs/tnf.go around lines 37 to 86, the job hook directly indexes job.Spec.Template.Spec.Containers[0] and Containers[0].Command[1] (and assigns job.Labels["..."]) which can panic if the manifest is malformed or missing those fields; add defensive validation inside the hook: verify job != nil, job.Spec != nil, job.Spec.Template != nil, job.Spec.Template.Spec != nil, that len(job.Spec.Template.Spec.Containers) > 0 and len(job.Spec.Template.Spec.Containers[0].Command) > 1, and ensure job.Labels is non-nil (initialize map if nil); if any check fails return a descriptive error from the hook instead of mutating, so the controller can handle the failure without panicking.

coderabbitai · 2025-12-17T10:15:08Z

pkg/tnf/pkg/jobs/tnf.go

+func RestartJobOrRunController(
+	ctx context.Context,
+	jobType tools.JobType,
+	nodeName *string,
+	controllerContext *controllercmd.ControllerContext,
+	operatorClient v1helpers.StaticPodOperatorClient,
+	kubeClient kubernetes.Interface,
+	kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces,
+	conditions []string,
+	existingJobCompletionTimeout time.Duration) error {
+
+	// Acquire a lock for this specific jobType/nodeName combination to prevent parallel execution
+	jobName := jobType.GetJobName(nodeName)
+
+	restartJobLocksMutex.Lock()
+	jobLock, exists := restartJobLocks[jobName]
+	if !exists {
+		jobLock = &sync.Mutex{}
+		restartJobLocks[jobName] = jobLock
+	}
+	restartJobLocksMutex.Unlock()
+
+	jobLock.Lock()
+	defer jobLock.Unlock()
+
+	// Check if job already exists
+	jobExists := true
+	_, err := kubeClient.BatchV1().Jobs(operatorclient.TargetNamespace).Get(ctx, jobName, v1.GetOptions{})
+	if err != nil {
+		if !apierrors.IsNotFound(err) {
+			return fmt.Errorf("failed to check for existing job %s: %w", jobName, err)
+		}
+		jobExists = false
+	}
+
+	// always try to run the controller, CEO might have been restarted
+	RunTNFJobController(ctx, jobType, nodeName, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, conditions)
+
+	if !jobExists {
+		// we are done
+		return nil
+	}
+
+	// Job exists, wait for completion
+	klog.Infof("Job %s already exists, waiting for being stopped", jobName)
+	if err := WaitForStopped(ctx, kubeClient, jobName, operatorclient.TargetNamespace, existingJobCompletionTimeout); err != nil {
+		return fmt.Errorf("failed to wait for update-setup job %s to complete: %w", jobName, err)
+	}
+
+	// Delete the job so the controller can recreate it
+	klog.Infof("Deleting existing job %s", jobName)
+	if err := DeleteAndWait(ctx, kubeClient, jobName, operatorclient.TargetNamespace); err != nil {
+		return fmt.Errorf("failed to delete existing update-setup job %s: %w", jobName, err)
+	}
+
+	return nil
+}


⚠️ Potential issue | 🟡 Minor

Hardcoded "update-setup" in error messages may be misleading.

Lines 134 and 140 mention "update-setup job" in error messages, but this function handles all job types. The error messages should use the actual jobName.

if err := WaitForStopped(ctx, kubeClient, jobName, operatorclient.TargetNamespace, existingJobCompletionTimeout); err != nil { - return fmt.Errorf("failed to wait for update-setup job %s to complete: %w", jobName, err) + return fmt.Errorf("failed to wait for job %s to stop: %w", jobName, err) } // Delete the job so the controller can recreate it klog.Infof("Deleting existing job %s", jobName) if err := DeleteAndWait(ctx, kubeClient, jobName, operatorclient.TargetNamespace); err != nil { - return fmt.Errorf("failed to delete existing update-setup job %s: %w", jobName, err) + return fmt.Errorf("failed to delete existing job %s: %w", jobName, err) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func RestartJobOrRunController(

ctx context.Context,

jobType tools.JobType,

nodeName *string,

controllerContext *controllercmd.ControllerContext,

operatorClient v1helpers.StaticPodOperatorClient,

kubeClient kubernetes.Interface,

kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces,

conditions []string,

existingJobCompletionTimeout time.Duration) error {

// Acquire a lock for this specific jobType/nodeName combination to prevent parallel execution

jobName := jobType.GetJobName(nodeName)

restartJobLocksMutex.Lock()

jobLock, exists := restartJobLocks[jobName]

if !exists {

jobLock = &sync.Mutex{}

restartJobLocks[jobName] = jobLock

}

restartJobLocksMutex.Unlock()

jobLock.Lock()

defer jobLock.Unlock()

// Check if job already exists

jobExists := true

_, err := kubeClient.BatchV1().Jobs(operatorclient.TargetNamespace).Get(ctx, jobName, v1.GetOptions{})

if err != nil {

if !apierrors.IsNotFound(err) {

return fmt.Errorf("failed to check for existing job %s: %w", jobName, err)

}

jobExists = false

}

// always try to run the controller, CEO might have been restarted

RunTNFJobController(ctx, jobType, nodeName, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, conditions)

if !jobExists {

// we are done

return nil

}

// Job exists, wait for completion

klog.Infof("Job %s already exists, waiting for being stopped", jobName)

if err := WaitForStopped(ctx, kubeClient, jobName, operatorclient.TargetNamespace, existingJobCompletionTimeout); err != nil {

return fmt.Errorf("failed to wait for update-setup job %s to complete: %w", jobName, err)

}

// Delete the job so the controller can recreate it

klog.Infof("Deleting existing job %s", jobName)

if err := DeleteAndWait(ctx, kubeClient, jobName, operatorclient.TargetNamespace); err != nil {

return fmt.Errorf("failed to delete existing update-setup job %s: %w", jobName, err)

}

return nil

}

func RestartJobOrRunController(

ctx context.Context,

jobType tools.JobType,

nodeName *string,

controllerContext *controllercmd.ControllerContext,

operatorClient v1helpers.StaticPodOperatorClient,

kubeClient kubernetes.Interface,

kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces,

conditions []string,

existingJobCompletionTimeout time.Duration) error {

// Acquire a lock for this specific jobType/nodeName combination to prevent parallel execution

jobName := jobType.GetJobName(nodeName)

restartJobLocksMutex.Lock()

jobLock, exists := restartJobLocks[jobName]

if !exists {

jobLock = &sync.Mutex{}

restartJobLocks[jobName] = jobLock

}

restartJobLocksMutex.Unlock()

jobLock.Lock()

defer jobLock.Unlock()

// Check if job already exists

jobExists := true

_, err := kubeClient.BatchV1().Jobs(operatorclient.TargetNamespace).Get(ctx, jobName, v1.GetOptions{})

if err != nil {

if !apierrors.IsNotFound(err) {

return fmt.Errorf("failed to check for existing job %s: %w", jobName, err)

}

jobExists = false

}

// always try to run the controller, CEO might have been restarted

RunTNFJobController(ctx, jobType, nodeName, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, conditions)

if !jobExists {

// we are done

return nil

}

// Job exists, wait for completion

klog.Infof("Job %s already exists, waiting for being stopped", jobName)

if err := WaitForStopped(ctx, kubeClient, jobName, operatorclient.TargetNamespace, existingJobCompletionTimeout); err != nil {

return fmt.Errorf("failed to wait for job %s to stop: %w", jobName, err)

}

// Delete the job so the controller can recreate it

klog.Infof("Deleting existing job %s", jobName)

if err := DeleteAndWait(ctx, kubeClient, jobName, operatorclient.TargetNamespace); err != nil {

return fmt.Errorf("failed to delete existing job %s: %w", jobName, err)

}

return nil

}

🤖 Prompt for AI Agents

In pkg/tnf/pkg/jobs/tnf.go around lines 88 to 144, the error messages currently hardcode "update-setup job" which is misleading for other job types; update the fmt.Errorf calls at the WaitForStopped and DeleteAndWait error returns to use the actual jobName variable (or a formatted message including jobName) instead of the literal "update-setup job" so the error text correctly reflects the job that failed.

coderabbitai · 2025-12-17T10:15:08Z

pkg/tnf/update-setup/runner.go

+	err = etcd.WaitForUpdatedRevision(ctx, operatorClient)
+	if err != nil {
+		klog.Error(err, "Failed to wait for etcd container transition")
+		return err
+	}


⚠️ Potential issue | 🟡 Minor

Incorrect klog.Error usage with format arguments.

klog.Error does not support format strings. Use klog.Errorf instead.

- klog.Error(err, "Failed to wait for etcd container transition") + klog.Errorf("Failed to wait for etcd container transition: %v", err)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

err = etcd.WaitForUpdatedRevision(ctx, operatorClient)

if err != nil {

klog.Error(err, "Failed to wait for etcd container transition")

return err

}

err = etcd.WaitForUpdatedRevision(ctx, operatorClient)

if err != nil {

klog.Errorf("Failed to wait for etcd container transition: %v", err)

return err

}

🤖 Prompt for AI Agents

In pkg/tnf/update-setup/runner.go around lines 109 to 113, the call uses klog.Error with format-like arguments which is incorrect; replace it with klog.Errorf and pass a formatted string including the error (e.g., "Failed to wait for etcd container transition: %v", err) so the error is logged correctly, leaving the subsequent return err unchanged.

coderabbitai · 2025-12-17T10:15:08Z

pkg/tnf/update-setup/runner.go

+	err = pcs.ConfigureFencing(ctx, kubeClient, []string{otherNodeName, currentNodeName})
+	if err != nil {
+		klog.Error(err, "Failed to configure fencing, skipping update of etcd! Restart update-setup job when fencing config is fixed!")
+		return err
+	}


⚠️ Potential issue | 🟡 Minor

Same klog.Error format issue.

Use klog.Errorf for formatted error messages.

- klog.Error(err, "Failed to configure fencing, skipping update of etcd! Restart update-setup job when fencing config is fixed!") + klog.Errorf("Failed to configure fencing, skipping update of etcd! Restart update-setup job when fencing config is fixed! Error: %v", err)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

err = pcs.ConfigureFencing(ctx, kubeClient, []string{otherNodeName, currentNodeName})

if err != nil {

klog.Error(err, "Failed to configure fencing, skipping update of etcd! Restart update-setup job when fencing config is fixed!")

return err

}

err = pcs.ConfigureFencing(ctx, kubeClient, []string{otherNodeName, currentNodeName})

if err != nil {

klog.Errorf("Failed to configure fencing, skipping update of etcd! Restart update-setup job when fencing config is fixed! Error: %v", err)

return err

}

🤖 Prompt for AI Agents

In pkg/tnf/update-setup/runner.go around lines 129 to 133, the code calls klog.Error with a formatted message and an error; replace it with klog.Errorf so the message is formatted properly. Change the block to call klog.Errorf("Failed to configure fencing, skipping update of etcd! Restart update-setup job when fencing config is fixed!: %v", err) and keep returning the error as before.

openshift-ci-robot · 2025-12-18T17:36:15Z

@fonta-rh: This pull request references Jira Issue OCPBUGS-68371, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @frajamomo

Details

In response to this:

NOTE
This PR is based off of this other PR: #1523
The only specific commit is 3295f0b.

Context
During MCE assisted-installer installation of Two Nodes with Fencing, a race condition occurs where Pacemaker tries to start etcd on a node before it has been started as a static pod after node boostrap reset, failing. See more details here: https://docs.google.com/document/d/1NNXtxBiPzr93jNn2yuri1vMh79l_-6EBTFE_dy_iUWs/edit?tab=t.0#heading=h.u3jtluqz5el2

Summary
This PR adds a new check after the existing waitForEtcdBootstrapCompleted, reusing the WaitForUpdatedRevision function introduced in PR 1523 mentioned above. With this change, we won't start the handover process to Pacemaker until the static etcd cluster's members all have the latest revision

Validation
Tested manually on MCE assisted-installer for TNF. A cluster configured with both this and the other PR mentioned completes successfully. WaitForUpdated revision handles the wait properly

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-12-18T21:03:31Z

@fonta-rh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-serial-1of2	`3295f0b`	link	true	`/test e2e-aws-ovn-serial-1of2`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

clobrano · 2025-12-19T11:26:09Z

/lgtm

openshift-ci · 2025-12-19T16:50:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, fonta-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~bindata/tnfdeployment/OWNERS~~ [clobrano,fonta-rh]
~~cmd/tnf-setup-runner/OWNERS~~ [clobrano,fonta-rh]
~~pkg/tnf/OWNERS~~ [clobrano,fonta-rh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fonta-rh · 2025-12-19T16:51:42Z

/hold

openshift-bot · 2026-01-01T00:01:40Z

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

openshift-ci-robot · 2026-01-01T00:02:27Z

@openshift-bot: This pull request references Jira Issue OCPBUGS-68371, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @frajamomo

Details

In response to this:

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

slintes and others added 8 commits December 15, 2025 19:20

Move node handling code to own file

4421c7b

For better overview, no actual changes yet Signed-off-by: Marc Sluiter <msluiter@redhat.com>

Handle node replacement

15caf33

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

Improve waiting for etcd

73879e1

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

Fixes and improvements

61befc2

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

Update unit tests

cb6716a

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

Retry podman-etcd start on failure

f173006

by Carlo Signed-off-by: Marc Sluiter <msluiter@redhat.com>

Apply CI / AI feedback

6c686c7

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

fonta-rh changed the title ~~Ocpbugs 68371 fix bootstrap race condition~~ OCPBUGS-68371: fix bootstrap race condition Dec 17, 2025

openshift-ci bot requested review from clobrano and eggfoobar December 17, 2025 10:09

openshift-ci-robot added the jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. label Dec 17, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 17, 2025

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 17, 2025

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 17, 2025

openshift-ci bot requested a review from frajamomo December 17, 2025 10:11

fonta-rh marked this pull request as draft December 17, 2025 10:11

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 17, 2025

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

fonta-rh marked this pull request as ready for review December 18, 2025 17:36

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 18, 2025

openshift-ci bot assigned clobrano Dec 19, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 19, 2025

clobrano approved these changes Dec 19, 2025

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 19, 2025

OCPBUGS-68371: fix bootstrap race condition #1524

Are you sure you want to change the base?

OCPBUGS-68371: fix bootstrap race condition #1524

Conversation

fonta-rh commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci-robot commented Dec 17, 2025

Uh oh!

openshift-ci-robot commented Dec 17, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Dec 18, 2025

Uh oh!

openshift-ci bot commented Dec 18, 2025

Uh oh!

clobrano commented Dec 19, 2025

Uh oh!

openshift-ci bot commented Dec 19, 2025

Uh oh!

fonta-rh commented Dec 19, 2025

Uh oh!

openshift-bot commented Jan 1, 2026

Uh oh!

openshift-ci-robot commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fonta-rh commented Dec 17, 2025 •

edited

Loading

coderabbitai bot commented Dec 17, 2025 •

edited

Loading