-
Notifications
You must be signed in to change notification settings - Fork 149
OCPBUGS-68371: fix bootstrap race condition #1524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OCPBUGS-68371: fix bootstrap race condition #1524
Conversation
For better overview, no actual changes yet Signed-off-by: Marc Sluiter <msluiter@redhat.com>
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
by Carlo Signed-off-by: Marc Sluiter <msluiter@redhat.com>
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
Fixes race condition where Pacemaker tries to start etcd before /var/lib/etcd exists. The installer creates this directory, so we must wait for all nodes to reach the latest revision after bootstrap. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
WalkthroughThis PR introduces Two-Node Fencing (TNF) orchestration for etcd clusters in Kubernetes/OpenShift. It adds job controllers for node authentication, cluster setup, fencing configuration, and post-setup operations. The implementation includes operator-level node handling with retry mechanisms, configuration refactoring for Pacemaker integration, and utility libraries for Job lifecycle management. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
@fonta-rh: This pull request references Jira Issue OCPBUGS-68371, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@fonta-rh: This pull request references Jira Issue OCPBUGS-68371, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (12)
pkg/tnf/pkg/kubelet/kubelet.go (1)
11-16: Consider checking stderr for consistency with other exec.Execute usages.Other usages of
exec.Executein this codebase (e.g., inpkg/tnf/pkg/pcs/etcd.go) check botherrandstderr:stdOut, stdErr, err := exec.Execute(ctx, cmd) if err != nil || len(stdErr) > 0 { // handle error }While
systemctl disabletypically doesn't write critical errors to stderr on success, checking stderr would maintain consistency and could catch edge cases.pkg/tnf/pkg/pcs/auth.go (2)
49-53: The--debugflag may log sensitive authentication details.The
--debugflag onpcs host authwill produce verbose output that may include tokens or credentials in logs. Consider whether this is appropriate for production use, or if it should be conditional/removed.
19-56: Missing stderr checks for exec.Execute calls.Similar to the pattern in
pkg/tnf/pkg/pcs/etcd.go, consider checking stderr in addition to the error return to catch cases where the command succeeds but reports issues:_, stdErr, err := exec.Execute(ctx, command) if err != nil || len(stdErr) > 0 { return false, fmt.Errorf("failed to ...: err=%v, stderr=%s", err, stdErr) }pkg/tnf/auth/runner.go (1)
57-62: Consider using the boolean return value fromAuthenticate.The
pcs.Authenticatefunction returns(bool, error)but the boolean is discarded. Based on the relevant code snippet, this boolean indicates success. If the boolean conveys additional state (e.g., whether authentication was already done), consider logging or acting on it.If the boolean indicates "already authenticated" or similar state, consider:
- _, err = pcs.Authenticate(ctx, configClient, cfg) + authenticated, err := pcs.Authenticate(ctx, configClient, cfg) if err != nil { klog.Errorf("Failed to authenticate: %v", err) return err } + klog.Infof("Authentication completed: %v", authenticated)pkg/tnf/pkg/tools/nodes.go (1)
21-43: Fallback may return non-IP address (e.g., hostname).The fallback at line 42 returns
addresses[0].Addresswithout validation. If the first address is a hostname (e.g.,NodeHostNametype), this could cause issues downstream in Pacemaker configuration expecting an IP. Consider validating the fallback address or documenting this behavior.If strict IP validation is desired for the fallback:
// fallback - return addresses[0].Address, nil + fallbackIP := net.ParseIP(addresses[0].Address) + if fallbackIP != nil { + return fallbackIP.String(), nil + } + return "", fmt.Errorf("node %q has no valid IP address", node.Name)pkg/tnf/update-setup/runner.go (2)
146-160: Logic flaw: proceeds with undefinedunstartedMemberIDafter error.When finding the unstarted member fails (line 149), the error is only logged but execution continues. However, the
unstartedMemberIDvariable is only defined in theelsebranch, so if the condition at line 149-150 is true and the code somehow falls through, you'd attempt to use an undefined variable. The current structure is correct due to theelse, but the asymmetric error handling is concerning - a failure to list members is logged but swallowed, while a failure to remove a member returns an error.Consider returning the error or at least documenting why this is intentional:
command = "podman exec etcd /usr/bin/etcdctl member list | grep unstarted | awk -F, '{ print $1 }'" stdOut, stdErr, err = exec.Execute(ctx, command) if err != nil { klog.Errorf("Failed to find unstarted etcd member: %s, stdout: %s, stderr: %s, err: %v", command, stdOut, stdErr, err) + // Continue without removing unstarted member - this may be acceptable if none exists } else {
162-164: Hardcoded sleep is fragile and undocumented.A 10-second hardcoded sleep with a vague comment ("for some reason") indicates an unresolved timing issue. While this may work, it's brittle and could mask underlying race conditions.
Consider documenting the root cause or implementing a proper wait condition if possible.
pkg/tnf/pkg/jobs/utils_test.go (2)
154-163:errorContainsfield is unused in all test cases.The
errorContainsfield is defined in the test struct but never populated in any test case. Either remove it or add specific error message assertions.tests := []struct { name string setupClient func() *fake.Clientset jobName string jobNamespace string timeout time.Duration expectError bool - errorContains string }{And remove the corresponding check at lines 299-301.
385-407: Test delay: The "Job deletion times out" test waits the full 1-minute hardcoded timeout inDeleteAndWaitbefore failing.Since
DeleteAndWaithas no configurable timeout parameter, consider wrapping thewait.PollUntilContextTimeoutcall in a testable layer or using dependency injection to allow faster timeout for tests.pkg/tnf/pkg/jobs/tnf_test.go (3)
86-94: Tests mutate global state - not safe for parallel execution.The tests directly modify package-level variables (
runningControllers,restartJobLocks). This prevents running tests in parallel and could cause flaky behavior if tests are run concurrently. Consider refactoring the production code to use dependency injection or accepting this as a known limitation.
135-137: Timing-based assertion may cause flaky tests.The 100ms sleep to "give the goroutine a moment to start" is timing-sensitive and could be flaky on slow CI systems. Consider using a more deterministic synchronization mechanism.
478-484: Parallel test assertion may be flaky.The assertion that "Delete should only be called once" relies on the locking mechanism working correctly. However, if the first goroutine completes the delete before others acquire the lock, subsequent goroutines will see
jobExists=false(due to NotFound from Get) and skip deletion entirely. This is actually the expected behavior, but the comment and assertion could be clearer.Consider clarifying the comment:
- // Verify delete was only called once due to locking + // Verify delete was only called once - either due to locking or because subsequent calls see NotFound
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (28)
bindata/tnfdeployment/job.yaml(2 hunks)cmd/tnf-setup-runner/main.go(3 hunks)pkg/tnf/after-setup/runner.go(4 hunks)pkg/tnf/auth/runner.go(2 hunks)pkg/tnf/fencing/runner.go(3 hunks)pkg/tnf/operator/nodehandler.go(1 hunks)pkg/tnf/operator/nodehandler_test.go(1 hunks)pkg/tnf/operator/starter.go(3 hunks)pkg/tnf/operator/starter_test.go(6 hunks)pkg/tnf/pkg/config/cluster.go(4 hunks)pkg/tnf/pkg/etcd/etcd.go(2 hunks)pkg/tnf/pkg/jobs/jobcontroller.go(3 hunks)pkg/tnf/pkg/jobs/tnf.go(1 hunks)pkg/tnf/pkg/jobs/tnf_test.go(1 hunks)pkg/tnf/pkg/jobs/utils.go(1 hunks)pkg/tnf/pkg/jobs/utils_test.go(1 hunks)pkg/tnf/pkg/kubelet/kubelet.go(1 hunks)pkg/tnf/pkg/pcs/auth.go(1 hunks)pkg/tnf/pkg/pcs/cluster.go(1 hunks)pkg/tnf/pkg/pcs/etcd.go(1 hunks)pkg/tnf/pkg/pcs/fencing.go(4 hunks)pkg/tnf/pkg/pcs/fencing_test.go(6 hunks)pkg/tnf/pkg/tools/conditions.go(0 hunks)pkg/tnf/pkg/tools/jobs.go(3 hunks)pkg/tnf/pkg/tools/nodes.go(1 hunks)pkg/tnf/pkg/tools/nodes_test.go(1 hunks)pkg/tnf/setup/runner.go(3 hunks)pkg/tnf/update-setup/runner.go(1 hunks)
💤 Files with no reviewable changes (1)
- pkg/tnf/pkg/tools/conditions.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
bindata/tnfdeployment/job.yamlcmd/tnf-setup-runner/main.gopkg/tnf/update-setup/runner.gopkg/tnf/pkg/tools/jobs.gopkg/tnf/pkg/etcd/etcd.gopkg/tnf/pkg/pcs/auth.gopkg/tnf/pkg/tools/nodes_test.gopkg/tnf/pkg/config/cluster.gopkg/tnf/pkg/tools/nodes.gopkg/tnf/pkg/jobs/utils_test.gopkg/tnf/operator/nodehandler.gopkg/tnf/pkg/pcs/fencing_test.gopkg/tnf/fencing/runner.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/pkg/jobs/tnf.gopkg/tnf/after-setup/runner.gopkg/tnf/auth/runner.gopkg/tnf/pkg/pcs/etcd.gopkg/tnf/pkg/pcs/cluster.gopkg/tnf/pkg/jobs/utils.gopkg/tnf/operator/starter_test.gopkg/tnf/setup/runner.gopkg/tnf/operator/starter.gopkg/tnf/pkg/jobs/tnf_test.gopkg/tnf/pkg/jobs/jobcontroller.gopkg/tnf/pkg/pcs/fencing.gopkg/tnf/pkg/kubelet/kubelet.go
🧬 Code graph analysis (13)
pkg/tnf/pkg/etcd/etcd.go (2)
pkg/etcdcli/interfaces.go (1)
Status(45-47)pkg/operator/ceohelpers/common.go (1)
CurrentRevision(220-238)
pkg/tnf/pkg/pcs/auth.go (2)
pkg/tnf/pkg/config/cluster.go (1)
ClusterConfig(15-20)pkg/tnf/pkg/exec/exec.go (1)
Execute(14-47)
pkg/tnf/pkg/tools/nodes_test.go (1)
pkg/tnf/pkg/tools/nodes.go (2)
IsNodeReady(12-19)GetNodeIPForPacemaker(24-43)
pkg/tnf/pkg/config/cluster.go (1)
pkg/tnf/pkg/tools/nodes.go (1)
GetNodeIPForPacemaker(24-43)
pkg/tnf/pkg/jobs/utils_test.go (2)
pkg/etcdcli/interfaces.go (1)
Status(45-47)pkg/tnf/pkg/jobs/utils.go (4)
IsComplete(90-92)IsFailed(94-96)WaitForCompletion(23-26)DeleteAndWait(50-84)
pkg/tnf/operator/nodehandler.go (7)
pkg/tnf/pkg/tools/nodes.go (1)
IsNodeReady(12-19)pkg/tnf/pkg/etcd/etcd.go (1)
WaitForUpdatedRevision(69-98)pkg/tnf/pkg/jobs/tnf.go (2)
RunTNFJobController(37-86)RestartJobOrRunController(88-144)pkg/tnf/pkg/jobs/jobcontroller.go (2)
DefaultConditions(30-30)AllConditions(31-31)pkg/operator/bootstrapteardown/waitforceo.go (1)
WaitForEtcdBootstrap(17-32)pkg/tnf/pkg/jobs/utils.go (1)
WaitForCompletion(23-26)pkg/operator/operatorclient/interfaces.go (1)
TargetNamespace(7-7)
pkg/tnf/fencing/runner.go (3)
pkg/tnf/pkg/jobs/utils.go (1)
IsConditionTrue(98-100)pkg/tnf/pkg/config/cluster.go (1)
GetClusterConfigIgnoreMissingNode(28-30)pkg/tnf/pkg/pcs/fencing.go (1)
ConfigureFencing(64-144)
pkg/tnf/after-setup/runner.go (4)
pkg/tnf/pkg/tools/jobs.go (1)
JobTypeSetup(28-28)pkg/tnf/pkg/jobs/utils.go (1)
IsConditionTrue(98-100)pkg/etcdcli/interfaces.go (1)
Status(45-47)pkg/tnf/pkg/kubelet/kubelet.go (1)
Disable(11-16)
pkg/tnf/auth/runner.go (1)
pkg/tnf/pkg/pcs/auth.go (1)
Authenticate(20-56)
pkg/tnf/setup/runner.go (3)
pkg/tnf/pkg/tools/jobs.go (1)
JobTypeAuth(27-27)pkg/tnf/pkg/jobs/utils.go (1)
IsConditionTrue(98-100)pkg/tnf/pkg/pcs/fencing.go (1)
ConfigureFencing(64-144)
pkg/tnf/pkg/jobs/tnf_test.go (5)
pkg/tnf/pkg/tools/jobs.go (5)
JobType(24-24)JobTypeAuth(27-27)JobTypeSetup(28-28)JobTypeFencing(30-30)JobTypeAfterSetup(29-29)pkg/testutils/testutils.go (1)
StaticPodOperatorStatus(211-222)pkg/operator/operatorclient/interfaces.go (1)
TargetNamespace(7-7)pkg/tnf/pkg/jobs/tnf.go (2)
RunTNFJobController(37-86)RestartJobOrRunController(88-144)pkg/tnf/pkg/jobs/jobcontroller.go (1)
DefaultConditions(30-30)
pkg/tnf/pkg/jobs/jobcontroller.go (1)
pkg/tnf/pkg/jobs/utils.go (2)
IsComplete(90-92)IsFailed(94-96)
pkg/tnf/pkg/kubelet/kubelet.go (1)
pkg/tnf/pkg/exec/exec.go (1)
Execute(14-47)
🔇 Additional comments (47)
pkg/tnf/pkg/pcs/etcd.go (1)
25-26: LGTM!Adding
meta migration-threshold=5is a sensible configuration to prevent excessive resource migrations due to transient failures, which helps address the race condition mentioned in the PR objectives.pkg/tnf/pkg/pcs/fencing_test.go (1)
275-276: Test expectations align with implementation changes.The updated test cases correctly reflect the new command format with
pcmk_delay_base, explicitssl_insecurevalues, and the increased wait timeout of 120 seconds.Also applies to: 292-293, 309-310, 326-327, 344-345, 366-367
pkg/tnf/pkg/jobs/jobcontroller.go (1)
251-256: LGTM!The refactor to use centralized
IsCompleteandIsFailedhelpers fromutils.goimproves code reuse. The dereference patternIsComplete(*job)is correct given the helper signatures accept value types.Also applies to: 276-281, 307-308
pkg/tnf/pkg/pcs/auth.go (1)
30-34: The current implementation is correct and necessary for this use case. Theexec.Executefunction runs all commands throughnsenterto execute them in the host's root namespace (as documented in the exec package). Usingos.WriteFilewould attempt to write to the container's filesystem instead of the host's/var/lib/pcsd/tokenpath, breaking the functionality. The shellechowith%qquoting is appropriate here since the token value is a Kubernetes ClusterID (UUID format) and the command properly executes in the host context via nsenter.Likely an incorrect or invalid review comment.
pkg/tnf/pkg/etcd/etcd.go (1)
49-66: LGTM!The refactored
waitForStaticContainerRemovedcorrectly delegates toWaitForUpdatedRevisionand only updates the transition status after all nodes have the latest revision. This properly addresses the race condition described in the PR objectives.pkg/tnf/pkg/pcs/fencing.go (2)
64-93: LGTM!The iteration over
nodeNameswith empty entry filtering, combined with differentiatedpcmk_delay_basevalues for fence devices, properly prevents fencing race conditions as described. The first device gets a longer delay (10s) while subsequent devices get 1s.
236-247: LGTM!The stonith command construction properly includes
pcmk_delay_baseand explicitly setsssl_insecureto"0"when SSL verification is enabled. The increased wait time (120s) provides adequate time for device startup verification.pkg/tnf/pkg/pcs/cluster.go (1)
35-39: LGTM!The addition of
migration-threshold=5andstart-failure-is-fatal=falseimproves cluster resilience by allowing recovery from transient failures without permanent resource stoppage.pkg/tnf/auth/runner.go (1)
50-55: LGTM!The refactoring to use
pcs.Authenticateimproves code organization by centralizing the authentication logic in the pcs package.cmd/tnf-setup-runner/main.go (1)
119-130: LGTM — The newUpdateSetupcommand follows the established pattern of other TNF commands, with consistent error handling viaklog.Fatal. TheJobTypeUpdateSetupconstant is properly defined in the tools package and used consistently throughout the codebase.bindata/tnfdeployment/job.yaml (1)
6-6: LGTM!The label addition for component identification and the
MY_NODE_NAMEenvironment variable sourced via Downward API are correctly configured. This enables the job to be properly labeled for TNF setup orchestration and provides the container with node identity awareness needed for fencing operations.Also applies to: 28-32
pkg/tnf/pkg/tools/nodes_test.go (2)
11-135: LGTM!Comprehensive table-driven tests covering all expected scenarios for
IsNodeReady: True/False/Unknown conditions, empty conditions, and mixed conditions. Well-structured test cases.
137-354: LGTM!Excellent coverage for
GetNodeIPForPacemakerincluding edge cases: IPv4/IPv6, address priority, fallback behavior, error handling for empty addresses, invalid IP skipping, and IPv4-mapped IPv6 normalization. The tests properly validate the implementation behavior.pkg/tnf/setup/runner.go (2)
59-75: LGTM!The auth job waiting logic is correctly updated to use the centralized
jobs.IsConditionTruehelper and properly validates that exactly 2 auth jobs are complete before proceeding. The nil check combined with length validation ensures robust handling.
103-103: LGTM!The
ConfigureFencingcall correctly passes node names as a slice, aligning with the updated function signature inpkg/tnf/pkg/pcs/fencing.go.pkg/tnf/after-setup/runner.go (2)
51-68: LGTM!The setup job polling logic correctly validates a single setup job completion using the centralized
jobs.IsConditionTruehelper. The guard ensures proper handling when jobs list is nil or unexpected count.
77-81: LGTM!Delegating kubelet disabling to
kubelet.Disable(ctx)improves modularity and separates concerns. Error handling is appropriate.pkg/tnf/pkg/tools/jobs.go (2)
13-20: LGTM!Timeout constants are well-organized with reasonable values.
AllCompletedTimeoutat 30 minutes provides adequate buffer for the combined auth + setup operations.
31-31: LGTM!
JobTypeUpdateSetupfollows the established enum pattern, andGetSubCommandcorrectly handles the new type.Also applies to: 44-45
pkg/tnf/operator/nodehandler_test.go (2)
34-158: LGTM!Comprehensive test table covering essential scenarios: node count validation, node readiness checks, job existence detection, and error propagation. Good coverage of the handleNodes workflow.
302-344: LGTM!Clean helper functions for test fixtures. The ready/not-ready node helpers and TNF job helper are well-defined and reusable.
pkg/tnf/pkg/tools/nodes.go (1)
10-19: LGTM!
IsNodeReadycorrectly iterates through conditions and returnstrueonly whenNodeReadycondition is explicitlyTrue. Returnsfalsefor missing condition, which is the safe default.pkg/tnf/fencing/runner.go (3)
52-73: LGTM!The setup job polling correctly uses
jobs.IsConditionTrueand validates exactly one completed setup job. The separate nil check (line 59) and length check (line 63) are explicit but could be combined for consistency withafter-setup/runner.gowhich usessetupJobs.Items == nil || len(setupJobs.Items) != 1.
84-84: LGTM!Using
GetClusterConfigIgnoreMissingNodeis appropriate for the fencing runner, as fencing configuration may need to run when one node is unavailable.
89-89: LGTM!The
ConfigureFencingcall correctly passes node names as a slice, consistent with the updated function signature.pkg/tnf/update-setup/runner.go (1)
180-190: LGTM!The
runCommandshelper is clean and provides consistent logging and error handling for sequential command execution.pkg/tnf/pkg/jobs/utils_test.go (1)
243-287: Good test for state transitions using reactors.The "Job transitions from running to complete" test case effectively uses reactors to simulate real-world behavior where a job completes during polling.
pkg/tnf/pkg/config/cluster.go (3)
22-30: LGTM! Clean API design.Good separation of concerns with two public functions delegating to a private implementation with the
ignoreMissingNodeflag.
55-72: LGTM!Clean iteration with proper error propagation from
GetNodeIPForPacemaker.
44-49: Callers ofGetClusterConfigIgnoreMissingNodealready handle partialClusterConfigcorrectly.The only caller of
GetClusterConfigIgnoreMissingNode(infencing/runner.go) passesNodeName2andNodeIP2toConfigureFencing, which explicitly skips empty node names with acontinuestatement (line 72-73 ofpcs/fencing.go). All other callers use the strictGetClusterConfigthat requires exactly 2 nodes. No action needed.pkg/tnf/pkg/jobs/tnf_test.go (1)
374-476: Good concurrency test design.The test effectively validates that parallel invocations of
RestartJobOrRunControllerare properly synchronized using the job-specific locks.pkg/tnf/operator/starter.go (3)
70-87: LGTM! Appropriate async handling.Running
handleNodesWithRetryin a goroutine is correct to avoid blocking the event handler. The filtering for ready nodes before triggering is sensible.
246-250: LGTM!Delegating to
RestartJobOrRunControllerfor fencing job management is clean and centralizes the restart logic.
136-140: DeleteFunc is a no-op - consider documenting or removing the handler.The
DeleteFuncdoes nothing. If this is intentional (as the comment suggests), consider whether the handler is needed at all, or ensure the comment is visible to future maintainers.pkg/tnf/operator/nodehandler.go (7)
31-51: Well-structured retry and concurrency configuration.The backoff configuration with exponential growth and cap is appropriate for handling transient failures. The test hooks via function variables enable proper unit testing.
53-103: Solid retry and degraded-status handling.The retry logic with exponential backoff and the operator condition updates for both success and failure paths are well-implemented. The mutex ensures single execution.
Minor observation: Line 78 checks
err != nil || setupErr != nil, but whenwait.ExponentialBackoffWithContextreturnscontext.DeadlineExceeded,setupErrcontains the last error, so this correctly captures all failure scenarios.
105-167: Proper two-node cluster validation.The function correctly validates exactly 2 control plane nodes and ensures both are ready before proceeding. The early returns for unsupported configurations (>2 nodes) and incomplete states (<2 nodes or not ready) are appropriate.
169-212: This is the key fix for the race condition.Lines 193-196 add
WaitForUpdatedRevisionafter bootstrap completion, ensuring all nodes have their static pod etcd at the latest revision before handing over to Pacemaker. This directly addresses the PR objective of preventing Pacemaker from starting etcd before the node's static pod etcd is ready.
233-295: Sequential job execution with proper completion waits.The
updateSetupfunction properly sequences auth → update-setup → after-setup jobs, waiting for completion between phases. This ensures proper ordering during TNF updates.
297-319: LGTM - Clean utility functions.
tnfSetupJobsExistuses label selector to detect existing TNF jobs, andwaitForTnfAfterSetupJobsCompletionproperly waits for all after-setup jobs to complete before returning.
214-231: This error handling is appropriate for in-cluster operation. TheInClusterConfig()call is standard across the codebase (setup, after-setup, update-setup, fencing, auth runners) and is expected to succeed in this context, as the TNF operator runs within the cluster. The error propagation is consistent with the codebase pattern and provides necessary context if the config cannot be loaded.Likely an incorrect or invalid review comment.
pkg/tnf/pkg/jobs/tnf.go (1)
25-35: Proper concurrency safeguards for controller tracking.The dual-map approach with separate mutexes (
runningControllers+restartJobLocks) correctly handles:
- Preventing duplicate controller starts
- Serializing restart operations per job
pkg/tnf/operator/starter_test.go (2)
275-321: Comprehensive test coverage for retry scenarios.The tests properly cover:
- Success on first attempt
- Success after transient failures
- Permanent failure after exhausting retries
The mock function signatures correctly match the updated
handleNodessignature usingNodeLister.
323-385: Good test isolation with proper cleanup.The test correctly:
- Stores and restores original function hooks
- Uses shortened backoff for faster test execution
- Validates operator condition status and messages
pkg/tnf/pkg/jobs/utils.go (3)
16-47: Well-designed wait utilities with appropriate error tolerance.The decision to ignore errors including
NotFoundduring polling (lines 33-36) is appropriate for handling deletion/recreation cycles. The condition-based approach withIsStoppedandIsCompleteprovides clean separation.
49-84: Robust deletion with UID-based recreation detection.The
DeleteAndWaitfunction correctly handles:
- Job not existing (returns nil)
- UID comparison to detect recreation (line 82)
- Transient errors during polling (line 79)
The 1-minute timeout for deletion polling is reasonable.
86-109: LGTM - Clean condition helper functions.The helper functions follow standard patterns for checking Kubernetes conditions.
IsStoppedcorrectly combinesIsCompleteandIsFailed.
| // Create etcd informer | ||
| operatorClientFake := operatorversionedclientfake.NewClientset() | ||
| etcdInformers := extinfops.NewSharedInformerFactory(operatorClientFake, 10*time.Minute) | ||
| etcdIndexer := cache.NewIndexer(cache.MetaNamespaceKeyFunc, cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc}) | ||
| require.NoError(t, etcdIndexer.Add(&operatorv1.Etcd{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: ceohelpers.InfrastructureClusterName, | ||
| }, | ||
| })) | ||
| etcdInformers.Operator().V1().Etcds().Informer().AddIndexers(etcdIndexer.GetIndexers()) | ||
| ctx2, cancel := context.WithCancel(context.Background()) | ||
| defer cancel() | ||
| etcdInformers.Start(ctx2.Done()) | ||
| synced := etcdInformers.WaitForCacheSync(ctx2.Done()) | ||
| for v, ok := range synced { | ||
| require.True(t, ok, "cache failed to sync: %v", v) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
rg -n "AddIndexers" --type=go -B2 -A2Repository: openshift/cluster-etcd-operator
Length of output: 1408
🏁 Script executed:
rg -n "NewIndexer" --type=go -B3 -A3 | head -100Repository: openshift/cluster-etcd-operator
Length of output: 9348
🏁 Script executed:
sed -n '199,260p' pkg/tnf/operator/nodehandler_test.goRepository: openshift/cluster-etcd-operator
Length of output: 2421
🏁 Script executed:
rg -n "Informer\(\)\.GetStore" --type=go -B2 -A2Repository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
sed -n '199,300p' pkg/tnf/operator/nodehandler_test.go | grep -n "operatorClientFake\|etcdIndexer\|etcdInformers"Repository: openshift/cluster-etcd-operator
Length of output: 667
🏁 Script executed:
rg -n "operatorClientFake" --type=go -B3 -A10 | head -150Repository: openshift/cluster-etcd-operator
Length of output: 2823
🏁 Script executed:
rg -n "operatorversionedclientfake.NewClientset" --type=go -B2 -A15 | head -120Repository: openshift/cluster-etcd-operator
Length of output: 3141
🏁 Script executed:
rg -n "fakeClientset\|NewClientset" --type=go -A5 | grep -E "(Create|Add|Get).*\(" | head -40Repository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
rg -n "\.Create\(" --type=go pkg/tnf/operator/nodehandler_test.go pkg/tnf/operator/starter_test.goRepository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
rg -n "\.GetStore\(\)\.Add" --type=goRepository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
rg -n "NewClientset.*\n.*Create.*Etcd\|Create.*Etcd.*\n.*NewClientset" --type=go -A8 | head -80Repository: openshift/cluster-etcd-operator
Length of output: 266
🏁 Script executed:
sed -n '190,280p' pkg/tnf/operator/nodehandler_test.go | cat -nRepository: openshift/cluster-etcd-operator
Length of output: 3861
🏁 Script executed:
rg -n "Etcd\(\).Create" --type=go -B5 -A5Repository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
rg -n "operatorClientFake\.Operator\|fakeOperatorClient\.Operator" --type=go -B2 -A5 | head -100Repository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
sed -n '195,210p' pkg/tnf/operator/nodehandler_test.goRepository: openshift/cluster-etcd-operator
Length of output: 843
🏁 Script executed:
rg -n "func handleNodes" --type=go -A20Repository: openshift/cluster-etcd-operator
Length of output: 3300
🏁 Script executed:
rg -n "etcdInformer" pkg/tnf/operator/nodehandler.go -B2 -A5Repository: openshift/cluster-etcd-operator
Length of output: 2325
🏁 Script executed:
rg -n "\.Lister\(\)" pkg/tnf/operator/nodehandler.go -B3 -A3Repository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
sed -n '183,186p' pkg/tnf/operator/nodehandler.goRepository: openshift/cluster-etcd-operator
Length of output: 242
🏁 Script executed:
rg -n "GetStore\|\.Lister" pkg/tnf/operator/nodehandler.go -B2 -A2Repository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
rg -n "\.Create\(" pkg/tnf/operator/starter_test.go -B5 -A3 | head -60Repository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
rg -n "startTnfJobcontrollersFunc\|startTnfJobcontrollers" pkg/tnf/operator/nodehandler.go -A30 | head -80Repository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
sed -n '165,200p' pkg/tnf/operator/nodehandler.goRepository: openshift/cluster-etcd-operator
Length of output: 1515
🏁 Script executed:
rg -n "etcdInformer.*Lister\|etcdInformer.*Get" pkg/tnf/operator/nodehandler.goRepository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
sed -n '192,220p' pkg/tnf/operator/nodehandler_test.go | grep -A5 "nodeInformer\|etcdInformer"Repository: openshift/cluster-etcd-operator
Length of output: 1250
🏁 Script executed:
rg -n "operatorClientFake.*Create" pkg/tnf/operator/ -B3 -A3Repository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
rg -n "Informer\(\)\.GetIndexer\(\)\.Add" pkg/tnf/operator/ --type=go -B2 -A2Repository: openshift/cluster-etcd-operator
Length of output: 498
🏁 Script executed:
rg -n "operatorClientFake\|fakeOperatorClient" pkg/tnf/operator/nodehandler_test.go -B3 -A3Repository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
sed -n '220,250p' pkg/tnf/operator/starter_test.goRepository: openshift/cluster-etcd-operator
Length of output: 1514
🏁 Script executed:
sed -n '225,245p' pkg/tnf/operator/starter_test.go | cat -nRepository: openshift/cluster-etcd-operator
Length of output: 1182
Add etcd object to informer's cache, not a separate indexer.
The etcd object is added to a separate etcdIndexer but only the index definitions are transferred to the informer via AddIndexers(). The actual object remains inaccessible to the informer. Follow the same pattern as nodes (line 194): use etcdInformers.Operator().V1().Etcds().Informer().GetIndexer().Add() to add the etcd object directly to the informer's cache.
🤖 Prompt for AI Agents
In pkg/tnf/operator/nodehandler_test.go around lines 199 to 215, the test adds
an Etcd object to a separate etcdIndexer and only transfers index definitions to
the informer; instead add the Etcd object directly into the informer's cache so
the informer can find it at runtime. Replace the separate etcdIndexer.Add(...)
usage with a call to
etcdInformers.Operator().V1().Etcds().Informer().GetIndexer().Add(...) (adding
the same metav1.Object with Name ceohelpers.InfrastructureClusterName), keep
AddIndexers if needed for custom indexes, then start the informer and
WaitForCacheSync as before.
| if oldObj != nil { | ||
| oldSecret, ok := oldObj.(*corev1.Secret) | ||
| if !ok { | ||
| klog.Warningf("failed to convert old object to Secret %+v", oldObj) | ||
| } | ||
| klog.Errorf("failed to get fencing job, will retry: %v", err) | ||
| return false, nil | ||
| } | ||
| jobFound = true | ||
| if tools.IsConditionTrue(job.Status.Conditions, batchv1.JobComplete) || tools.IsConditionTrue(job.Status.Conditions, batchv1.JobFailed) { | ||
| return true, nil | ||
| // check if data changed | ||
| changed := false | ||
| if len(oldSecret.Data) != len(secret.Data) { | ||
| changed = true | ||
| } else { | ||
| for key, oldValue := range oldSecret.Data { | ||
| newValue, exists := secret.Data[key] | ||
| if !exists || !bytes.Equal(oldValue, newValue) { | ||
| changed = true | ||
| break | ||
| } | ||
| } | ||
| } | ||
| if !changed { | ||
| return | ||
| } | ||
| klog.Infof("handling modified fencing secret %s", secret.GetName()) | ||
| } else { | ||
| klog.Infof("handling added or deleted fencing secret %s", secret.GetName()) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continue using uninitialized oldSecret after failed type assertion.
If the type assertion at line 220 fails, the code logs a warning but then proceeds to use oldSecret in the comparison logic (lines 226-236). This would cause a nil pointer dereference.
if oldObj != nil {
oldSecret, ok := oldObj.(*corev1.Secret)
if !ok {
klog.Warningf("failed to convert old object to Secret %+v", oldObj)
+ // Treat as changed since we can't compare
+ klog.Infof("handling fencing secret %s (couldn't compare with old)", secret.GetName())
+ } else {
- }
// check if data changed
changed := false
if len(oldSecret.Data) != len(secret.Data) {
@@ ... @@
if !changed {
return
}
klog.Infof("handling modified fencing secret %s", secret.GetName())
+ }
} else {Or restructure with early return:
if oldObj != nil {
oldSecret, ok := oldObj.(*corev1.Secret)
if !ok {
klog.Warningf("failed to convert old object to Secret %+v", oldObj)
+ return
}🤖 Prompt for AI Agents
In pkg/tnf/operator/starter.go around lines 219 to 243, the code logs a warning
when the old object fails to convert to *corev1.Secret but then continues to
dereference oldSecret; change the flow so you never use oldSecret when the type
assertion fails — either return early after logging the warning or wrap the
subsequent comparison logic inside the successful assertion branch (i.e., only
perform the length/value comparisons if ok is true); ensure no nil pointer
dereference paths remain and keep the existing logs for modified vs
added/deleted secrets.
| if nodeStatus.CurrentRevision == status.LatestAvailableRevision { | ||
| klog.Infof("static etcd removed: node %s, current rev %v, latest rev %v", nodeStatus.NodeName, nodeStatus.CurrentRevision, status.LatestAvailableRevision) | ||
| klog.Infof("node %q is running the latest etcd revision %q", nodeStatus.NodeName, nodeStatus.CurrentRevision) | ||
| } else { | ||
| klog.Infof("static etcd not removed yet: node %s, current rev %v, latest rev %v", nodeStatus.NodeName, nodeStatus.CurrentRevision, status.LatestAvailableRevision) | ||
| removed = false | ||
| klog.Infof("node %q is not running the latest etcd revision yet, expected %q, got %q", nodeStatus.NodeName, status.LatestAvailableRevision, nodeStatus.CurrentRevision) | ||
| allUpdated = false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect format specifier for integer values.
Using %q (quoted string) for int32 revision values will produce unexpected output like '\x00' instead of the numeric value.
Apply this diff to fix the format specifiers:
if nodeStatus.CurrentRevision == status.LatestAvailableRevision {
- klog.Infof("node %q is running the latest etcd revision %q", nodeStatus.NodeName, nodeStatus.CurrentRevision)
+ klog.Infof("node %q is running the latest etcd revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision)
} else {
- klog.Infof("node %q is not running the latest etcd revision yet, expected %q, got %q", nodeStatus.NodeName, status.LatestAvailableRevision, nodeStatus.CurrentRevision)
+ klog.Infof("node %q is not running the latest etcd revision yet, expected %d, got %d", nodeStatus.NodeName, status.LatestAvailableRevision, nodeStatus.CurrentRevision)
allUpdated = false
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if nodeStatus.CurrentRevision == status.LatestAvailableRevision { | |
| klog.Infof("static etcd removed: node %s, current rev %v, latest rev %v", nodeStatus.NodeName, nodeStatus.CurrentRevision, status.LatestAvailableRevision) | |
| klog.Infof("node %q is running the latest etcd revision %q", nodeStatus.NodeName, nodeStatus.CurrentRevision) | |
| } else { | |
| klog.Infof("static etcd not removed yet: node %s, current rev %v, latest rev %v", nodeStatus.NodeName, nodeStatus.CurrentRevision, status.LatestAvailableRevision) | |
| removed = false | |
| klog.Infof("node %q is not running the latest etcd revision yet, expected %q, got %q", nodeStatus.NodeName, status.LatestAvailableRevision, nodeStatus.CurrentRevision) | |
| allUpdated = false | |
| } | |
| if nodeStatus.CurrentRevision == status.LatestAvailableRevision { | |
| klog.Infof("node %q is running the latest etcd revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision) | |
| } else { | |
| klog.Infof("node %q is not running the latest etcd revision yet, expected %d, got %d", nodeStatus.NodeName, status.LatestAvailableRevision, nodeStatus.CurrentRevision) | |
| allUpdated = false | |
| } |
🤖 Prompt for AI Agents
In pkg/tnf/pkg/etcd/etcd.go around lines 86 to 91, the log calls use the
string/quoted format specifier %q for int32 revision values; change those format
specifiers to numeric ones (e.g., %d or %v) so the revisions print as numbers.
Update both klog.Infof calls: use %d (or %v) for nodeStatus.CurrentRevision and
status.LatestAvailableRevision while keeping %q for nodeStatus.NodeName.
| func RunTNFJobController(ctx context.Context, jobType tools.JobType, nodeName *string, controllerContext *controllercmd.ControllerContext, operatorClient v1helpers.StaticPodOperatorClient, kubeClient kubernetes.Interface, kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces, conditions []string) { | ||
| nodeNameForLogs := "undefined" | ||
| if nodeName != nil { | ||
| nodeNameForLogs = *nodeName | ||
| } | ||
|
|
||
| // Check if a controller for this jobType and nodeName is already running | ||
| controllerKey := jobType.GetJobName(nodeName) | ||
| runningControllersMutex.Lock() | ||
| if runningControllers[controllerKey] { | ||
| runningControllersMutex.Unlock() | ||
| klog.Infof("Two Node Fencing job controller for command %q on node %q is already running, skipping duplicate start", jobType.GetSubCommand(), nodeNameForLogs) | ||
| return | ||
| } | ||
| // Mark this controller as running | ||
| runningControllers[controllerKey] = true | ||
| runningControllersMutex.Unlock() | ||
|
|
||
| klog.Infof("starting Two Node Fencing job controller for command %q on node %q", jobType.GetSubCommand(), nodeNameForLogs) | ||
| tnfJobController := NewJobController( | ||
| jobType.GetJobName(nodeName), | ||
| bindata.MustAsset("tnfdeployment/job.yaml"), | ||
| controllerContext.EventRecorder, | ||
| operatorClient, | ||
| kubeClient, | ||
| kubeInformersForNamespaces.InformersFor(operatorclient.TargetNamespace).Batch().V1().Jobs(), | ||
| conditions, | ||
| []factory.Informer{}, | ||
| []JobHookFunc{ | ||
| func(_ *operatorv1.OperatorSpec, job *batchv1.Job) error { | ||
| if nodeName != nil { | ||
| job.Spec.Template.Spec.NodeName = *nodeName | ||
| } | ||
| job.SetName(jobType.GetJobName(nodeName)) | ||
| job.Labels["app.kubernetes.io/name"] = jobType.GetNameLabelValue() | ||
| job.Spec.Template.Spec.Containers[0].Image = os.Getenv("OPERATOR_IMAGE") | ||
| job.Spec.Template.Spec.Containers[0].Command[1] = jobType.GetSubCommand() | ||
| return nil | ||
| }}..., | ||
| ) | ||
| go func() { | ||
| defer func() { | ||
| runningControllersMutex.Lock() | ||
| delete(runningControllers, controllerKey) | ||
| runningControllersMutex.Unlock() | ||
| klog.Infof("Two Node Fencing job controller for command %q on node %q stopped", jobType.GetSubCommand(), nodeNameForLogs) | ||
| }() | ||
| tnfJobController.Run(ctx, 1) | ||
| }() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider potential panic if job manifest lacks expected structure.
Line 72-73 directly access job.Spec.Template.Spec.Containers[0] without bounds checking. If the YAML manifest tnfdeployment/job.yaml is malformed or empty, this will panic.
Consider adding a defensive check:
func(_ *operatorv1.OperatorSpec, job *batchv1.Job) error {
+ if len(job.Spec.Template.Spec.Containers) == 0 {
+ return fmt.Errorf("job manifest must have at least one container")
+ }
if nodeName != nil {
job.Spec.Template.Spec.NodeName = *nodeName
}🤖 Prompt for AI Agents
In pkg/tnf/pkg/jobs/tnf.go around lines 37 to 86, the job hook directly indexes
job.Spec.Template.Spec.Containers[0] and Containers[0].Command[1] (and assigns
job.Labels["..."]) which can panic if the manifest is malformed or missing those
fields; add defensive validation inside the hook: verify job != nil, job.Spec !=
nil, job.Spec.Template != nil, job.Spec.Template.Spec != nil, that
len(job.Spec.Template.Spec.Containers) > 0 and
len(job.Spec.Template.Spec.Containers[0].Command) > 1, and ensure job.Labels is
non-nil (initialize map if nil); if any check fails return a descriptive error
from the hook instead of mutating, so the controller can handle the failure
without panicking.
| func RestartJobOrRunController( | ||
| ctx context.Context, | ||
| jobType tools.JobType, | ||
| nodeName *string, | ||
| controllerContext *controllercmd.ControllerContext, | ||
| operatorClient v1helpers.StaticPodOperatorClient, | ||
| kubeClient kubernetes.Interface, | ||
| kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces, | ||
| conditions []string, | ||
| existingJobCompletionTimeout time.Duration) error { | ||
|
|
||
| // Acquire a lock for this specific jobType/nodeName combination to prevent parallel execution | ||
| jobName := jobType.GetJobName(nodeName) | ||
|
|
||
| restartJobLocksMutex.Lock() | ||
| jobLock, exists := restartJobLocks[jobName] | ||
| if !exists { | ||
| jobLock = &sync.Mutex{} | ||
| restartJobLocks[jobName] = jobLock | ||
| } | ||
| restartJobLocksMutex.Unlock() | ||
|
|
||
| jobLock.Lock() | ||
| defer jobLock.Unlock() | ||
|
|
||
| // Check if job already exists | ||
| jobExists := true | ||
| _, err := kubeClient.BatchV1().Jobs(operatorclient.TargetNamespace).Get(ctx, jobName, v1.GetOptions{}) | ||
| if err != nil { | ||
| if !apierrors.IsNotFound(err) { | ||
| return fmt.Errorf("failed to check for existing job %s: %w", jobName, err) | ||
| } | ||
| jobExists = false | ||
| } | ||
|
|
||
| // always try to run the controller, CEO might have been restarted | ||
| RunTNFJobController(ctx, jobType, nodeName, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, conditions) | ||
|
|
||
| if !jobExists { | ||
| // we are done | ||
| return nil | ||
| } | ||
|
|
||
| // Job exists, wait for completion | ||
| klog.Infof("Job %s already exists, waiting for being stopped", jobName) | ||
| if err := WaitForStopped(ctx, kubeClient, jobName, operatorclient.TargetNamespace, existingJobCompletionTimeout); err != nil { | ||
| return fmt.Errorf("failed to wait for update-setup job %s to complete: %w", jobName, err) | ||
| } | ||
|
|
||
| // Delete the job so the controller can recreate it | ||
| klog.Infof("Deleting existing job %s", jobName) | ||
| if err := DeleteAndWait(ctx, kubeClient, jobName, operatorclient.TargetNamespace); err != nil { | ||
| return fmt.Errorf("failed to delete existing update-setup job %s: %w", jobName, err) | ||
| } | ||
|
|
||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded "update-setup" in error messages may be misleading.
Lines 134 and 140 mention "update-setup job" in error messages, but this function handles all job types. The error messages should use the actual jobName.
if err := WaitForStopped(ctx, kubeClient, jobName, operatorclient.TargetNamespace, existingJobCompletionTimeout); err != nil {
- return fmt.Errorf("failed to wait for update-setup job %s to complete: %w", jobName, err)
+ return fmt.Errorf("failed to wait for job %s to stop: %w", jobName, err)
}
// Delete the job so the controller can recreate it
klog.Infof("Deleting existing job %s", jobName)
if err := DeleteAndWait(ctx, kubeClient, jobName, operatorclient.TargetNamespace); err != nil {
- return fmt.Errorf("failed to delete existing update-setup job %s: %w", jobName, err)
+ return fmt.Errorf("failed to delete existing job %s: %w", jobName, err)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func RestartJobOrRunController( | |
| ctx context.Context, | |
| jobType tools.JobType, | |
| nodeName *string, | |
| controllerContext *controllercmd.ControllerContext, | |
| operatorClient v1helpers.StaticPodOperatorClient, | |
| kubeClient kubernetes.Interface, | |
| kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces, | |
| conditions []string, | |
| existingJobCompletionTimeout time.Duration) error { | |
| // Acquire a lock for this specific jobType/nodeName combination to prevent parallel execution | |
| jobName := jobType.GetJobName(nodeName) | |
| restartJobLocksMutex.Lock() | |
| jobLock, exists := restartJobLocks[jobName] | |
| if !exists { | |
| jobLock = &sync.Mutex{} | |
| restartJobLocks[jobName] = jobLock | |
| } | |
| restartJobLocksMutex.Unlock() | |
| jobLock.Lock() | |
| defer jobLock.Unlock() | |
| // Check if job already exists | |
| jobExists := true | |
| _, err := kubeClient.BatchV1().Jobs(operatorclient.TargetNamespace).Get(ctx, jobName, v1.GetOptions{}) | |
| if err != nil { | |
| if !apierrors.IsNotFound(err) { | |
| return fmt.Errorf("failed to check for existing job %s: %w", jobName, err) | |
| } | |
| jobExists = false | |
| } | |
| // always try to run the controller, CEO might have been restarted | |
| RunTNFJobController(ctx, jobType, nodeName, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, conditions) | |
| if !jobExists { | |
| // we are done | |
| return nil | |
| } | |
| // Job exists, wait for completion | |
| klog.Infof("Job %s already exists, waiting for being stopped", jobName) | |
| if err := WaitForStopped(ctx, kubeClient, jobName, operatorclient.TargetNamespace, existingJobCompletionTimeout); err != nil { | |
| return fmt.Errorf("failed to wait for update-setup job %s to complete: %w", jobName, err) | |
| } | |
| // Delete the job so the controller can recreate it | |
| klog.Infof("Deleting existing job %s", jobName) | |
| if err := DeleteAndWait(ctx, kubeClient, jobName, operatorclient.TargetNamespace); err != nil { | |
| return fmt.Errorf("failed to delete existing update-setup job %s: %w", jobName, err) | |
| } | |
| return nil | |
| } | |
| func RestartJobOrRunController( | |
| ctx context.Context, | |
| jobType tools.JobType, | |
| nodeName *string, | |
| controllerContext *controllercmd.ControllerContext, | |
| operatorClient v1helpers.StaticPodOperatorClient, | |
| kubeClient kubernetes.Interface, | |
| kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces, | |
| conditions []string, | |
| existingJobCompletionTimeout time.Duration) error { | |
| // Acquire a lock for this specific jobType/nodeName combination to prevent parallel execution | |
| jobName := jobType.GetJobName(nodeName) | |
| restartJobLocksMutex.Lock() | |
| jobLock, exists := restartJobLocks[jobName] | |
| if !exists { | |
| jobLock = &sync.Mutex{} | |
| restartJobLocks[jobName] = jobLock | |
| } | |
| restartJobLocksMutex.Unlock() | |
| jobLock.Lock() | |
| defer jobLock.Unlock() | |
| // Check if job already exists | |
| jobExists := true | |
| _, err := kubeClient.BatchV1().Jobs(operatorclient.TargetNamespace).Get(ctx, jobName, v1.GetOptions{}) | |
| if err != nil { | |
| if !apierrors.IsNotFound(err) { | |
| return fmt.Errorf("failed to check for existing job %s: %w", jobName, err) | |
| } | |
| jobExists = false | |
| } | |
| // always try to run the controller, CEO might have been restarted | |
| RunTNFJobController(ctx, jobType, nodeName, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, conditions) | |
| if !jobExists { | |
| // we are done | |
| return nil | |
| } | |
| // Job exists, wait for completion | |
| klog.Infof("Job %s already exists, waiting for being stopped", jobName) | |
| if err := WaitForStopped(ctx, kubeClient, jobName, operatorclient.TargetNamespace, existingJobCompletionTimeout); err != nil { | |
| return fmt.Errorf("failed to wait for job %s to stop: %w", jobName, err) | |
| } | |
| // Delete the job so the controller can recreate it | |
| klog.Infof("Deleting existing job %s", jobName) | |
| if err := DeleteAndWait(ctx, kubeClient, jobName, operatorclient.TargetNamespace); err != nil { | |
| return fmt.Errorf("failed to delete existing job %s: %w", jobName, err) | |
| } | |
| return nil | |
| } |
🤖 Prompt for AI Agents
In pkg/tnf/pkg/jobs/tnf.go around lines 88 to 144, the error messages currently
hardcode "update-setup job" which is misleading for other job types; update the
fmt.Errorf calls at the WaitForStopped and DeleteAndWait error returns to use
the actual jobName variable (or a formatted message including jobName) instead
of the literal "update-setup job" so the error text correctly reflects the job
that failed.
| err = etcd.WaitForUpdatedRevision(ctx, operatorClient) | ||
| if err != nil { | ||
| klog.Error(err, "Failed to wait for etcd container transition") | ||
| return err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect klog.Error usage with format arguments.
klog.Error does not support format strings. Use klog.Errorf instead.
- klog.Error(err, "Failed to wait for etcd container transition")
+ klog.Errorf("Failed to wait for etcd container transition: %v", err)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| err = etcd.WaitForUpdatedRevision(ctx, operatorClient) | |
| if err != nil { | |
| klog.Error(err, "Failed to wait for etcd container transition") | |
| return err | |
| } | |
| err = etcd.WaitForUpdatedRevision(ctx, operatorClient) | |
| if err != nil { | |
| klog.Errorf("Failed to wait for etcd container transition: %v", err) | |
| return err | |
| } |
🤖 Prompt for AI Agents
In pkg/tnf/update-setup/runner.go around lines 109 to 113, the call uses
klog.Error with format-like arguments which is incorrect; replace it with
klog.Errorf and pass a formatted string including the error (e.g., "Failed to
wait for etcd container transition: %v", err) so the error is logged correctly,
leaving the subsequent return err unchanged.
| err = pcs.ConfigureFencing(ctx, kubeClient, []string{otherNodeName, currentNodeName}) | ||
| if err != nil { | ||
| klog.Error(err, "Failed to configure fencing, skipping update of etcd! Restart update-setup job when fencing config is fixed!") | ||
| return err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same klog.Error format issue.
Use klog.Errorf for formatted error messages.
- klog.Error(err, "Failed to configure fencing, skipping update of etcd! Restart update-setup job when fencing config is fixed!")
+ klog.Errorf("Failed to configure fencing, skipping update of etcd! Restart update-setup job when fencing config is fixed! Error: %v", err)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| err = pcs.ConfigureFencing(ctx, kubeClient, []string{otherNodeName, currentNodeName}) | |
| if err != nil { | |
| klog.Error(err, "Failed to configure fencing, skipping update of etcd! Restart update-setup job when fencing config is fixed!") | |
| return err | |
| } | |
| err = pcs.ConfigureFencing(ctx, kubeClient, []string{otherNodeName, currentNodeName}) | |
| if err != nil { | |
| klog.Errorf("Failed to configure fencing, skipping update of etcd! Restart update-setup job when fencing config is fixed! Error: %v", err) | |
| return err | |
| } |
🤖 Prompt for AI Agents
In pkg/tnf/update-setup/runner.go around lines 129 to 133, the code calls
klog.Error with a formatted message and an error; replace it with klog.Errorf so
the message is formatted properly. Change the block to call klog.Errorf("Failed
to configure fencing, skipping update of etcd! Restart update-setup job when
fencing config is fixed!: %v", err) and keep returning the error as before.
|
@fonta-rh: This pull request references Jira Issue OCPBUGS-68371, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@fonta-rh: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: clobrano, fonta-rh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
|
/jira refresh The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity. |
|
@openshift-bot: This pull request references Jira Issue OCPBUGS-68371, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
On Hold until PR 1523 is merged in
NOTE
This PR is based off of this other PR: #1523
The only specific commit is 3295f0b.
Context
During MCE assisted-installer installation of Two Nodes with Fencing, a race condition occurs where Pacemaker tries to start etcd on a node before it has been started as a static pod after node boostrap reset, failing. See more details here: https://docs.google.com/document/d/1NNXtxBiPzr93jNn2yuri1vMh79l_-6EBTFE_dy_iUWs/edit?tab=t.0#heading=h.u3jtluqz5el2
Summary
This PR adds a new check after the existing waitForEtcdBootstrapCompleted, reusing the WaitForUpdatedRevision function introduced in PR 1523 mentioned above. With this change, we won't start the handover process to Pacemaker until the static etcd cluster's members all have the latest revision
Validation
Tested manually on MCE assisted-installer for TNF. A cluster configured with both this and the other PR mentioned completes successfully. WaitForUpdated revision handles the wait properly