Skip to content

refactor: Migrate Machine online-repair handler to WithTx#564

Open
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:with-tx-machine-online-repair
Open

refactor: Migrate Machine online-repair handler to WithTx#564
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:with-tx-machine-online-repair

Conversation

@chet
Copy link
Copy Markdown
Contributor

@chet chet commented May 21, 2026

Description

Picks up the last BeginTx site left in machine.go, which is the new the online-repair flow that was recently added.

Callouts are:

  • Uses WithTx (not WithTxResult) — um is set after success and timeoutResp already pulls toward WithTx.
  • Both inner branches (enable + disable) apply timeoutResp with their respective literal workflow names.
  • The iDAO.GetAll instance fetch stays inside the tx since it drives the status-mutation decision. No new advisory lock -- matches the existing maintenance-mode handler nearby.
  • The handler triggers a workflow (either CreateMachineHealthReportOverride or DeleteMachineHealthReportOverride depending on enable/disable) and waits on we.Get, so it's also adopting what now seems like our timeoutResp pattern.
  • Errors switched to NewAPIError.

Signed-off-by: Chet Nichols III chetn@nvidia.com

Type of Change

  • Feature - New feature or functionality (feat:)
  • Fix - Bug fixes (fix:)
  • Chore - Modification or removal of existing functionality (chore:)
  • Refactor - Refactoring of existing functionality (refactor:)
  • Docs - Changes in documentation or OpenAPI schema (docs:)
  • CI - Changes in GitHub workflows. Requires additional scrutiny (ci:)
  • Version - Issuing a new release version (version:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • DB - DB DAOs or migrations updated
  • Site Manager - Site Manager updated
  • Cert Manager - Cert Manager updated
  • Site Agent - Site Agent updated
  • Flow - Flow service updated
  • Powershelf Manager - Powershelf Manager updated
  • NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@chet chet requested a review from a team as a code owner May 21, 2026 20:16
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9f2daac3-2981-483e-aad8-7e32c6df4f56

📥 Commits

Reviewing files that changed from the base of the PR and between adc2229 and e328d84.

📒 Files selected for processing (1)
  • api/pkg/api/handler/machine.go

Summary by CodeRabbit

  • Bug Fixes
    • More reliable enable/disable of machine online repair with improved error and timeout reporting, reducing inconsistent states and failure cases.
  • Refactor
    • Internal transaction handling streamlined to ensure state changes and external workflow calls behave consistently and surface meaningful errors to users.

Walkthrough

The PR refactors the online machine repair logic in UpdateMachineHandler from explicit transaction control to the cdb.WithTx helper pattern. Both enable and disable paths now execute Temporal health report override workflows synchronously, with symmetric label management and timeout-aware error handling.

Changes

Online Repair Transaction & Workflow Integration

Layer / File(s) Summary
Transaction Infrastructure & Import Cleanup
api/pkg/api/handler/machine.go
Import list cleaned to remove unused database/sql. Transaction handling refactored from manual BeginTx/Commit to cdb.WithTx closure pattern. Instance lookup and subsequent DB/remote actions now execute within a single transaction context, with timeoutResp variable introduced to capture workflow timeout conditions.
Enable Online Repair Path
api/pkg/api/handler/machine.go
Within the cdb.WithTx closure, enable branch validates Instance is Ready, builds CreateMachineHealthReportOverride request, manages instance labels to control auto-deletion on failure, updates Instance status to Repairing, and synchronously executes the Temporal workflow with timeout mapping to timeoutResp.
Disable Online Repair Path
api/pkg/api/handler/machine.go
Within the cdb.WithTx closure, disable branch validates Instance is Repairing, removes online-repair auto-deletion label, updates Instance status to Ready, and synchronously executes DeleteMachineHealthReportOverride Temporal workflow with timeout mapping to timeoutResp.
Post-Transaction Error Handling
api/pkg/api/handler/machine.go
After closure completion, non-timeout DB transaction errors are surfaced through common.HandleTxError. Timeout conditions captured during workflow execution are processed via timeoutResp() only when no earlier errors occurred, preserving proper error precedence.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the primary change: migrating the Machine online-repair handler from BeginTx to WithTx pattern, which matches the changeset.
Description check ✅ Passed The description directly addresses the changeset, explaining the refactoring rationale, implementation details (WithTx usage, timeoutResp pattern, transaction scope, workflow integration), and error handling changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-05-21 20:18:09 UTC | Commit: adc2229

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/pkg/api/handler/machine.go`:
- Around line 1524-1530: The response is being built from stale data because the
handler reassigns um = machine after the DB update; remove the assignment "um =
machine" in the transaction completion block (the one checking timeoutResp and
err) so that any in-memory updates applied to um (e.g., label or maintenance
changes) are preserved; rely on the existing fallback "if um == nil { um =
machine }" to cover the online-repair-only case and keep the existing error
handling via common.HandleTxError.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: daedffd7-3dde-440c-831c-3c66674a219b

📥 Commits

Reviewing files that changed from the base of the PR and between f813f8d and adc2229.

📒 Files selected for processing (1)
  • api/pkg/api/handler/machine.go

Comment thread api/pkg/api/handler/machine.go Outdated
Picks up the last `BeginTx` site left in `machine.go`, which is the new the online-repair flow that was recently added.

Callouts are:
- Uses `WithTx` (not `WithTxResult`) — `um` is set after success and `timeoutResp` already pulls toward `WithTx`.
- Both inner branches (enable + disable) apply `timeoutResp` with their respective literal workflow names.
- The `iDAO.GetAll` instance fetch stays inside the tx since it drives the status-mutation decision. No new advisory lock -- matches the existing maintenance-mode handler nearby.
- The handler triggers a workflow (either `CreateMachineHealthReportOverride` or `DeleteMachineHealthReportOverride` depending on enable/disable) and waits on `we.Get`, so it's also adopting what now seems like our `timeoutResp` pattern.
- Errors switched to `NewAPIError`.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@chet chet force-pushed the with-tx-machine-online-repair branch from adc2229 to e328d84 Compare May 21, 2026 20:54
@github-actions
Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
nico-flow 66 4 34 18 2 8
nico-nsm 82 2 28 43 9 0
nico-psm 67 4 35 18 2 8
nico-rest-api 100 6 53 30 3 8
nico-rest-cert-manager 65 4 34 18 1 8
nico-rest-db 66 4 34 18 2 8
nico-rest-site-agent 65 4 34 18 1 8
nico-rest-site-manager 65 4 34 18 1 8
nico-rest-workflow 67 4 35 18 2 8
TOTAL 643 36 321 199 23 64

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant