Skip to content

feat: Add switch leak detection and powering off switches on detection of l…#566

Open
srinivasadmurthy wants to merge 6 commits into
NVIDIA:mainfrom
srinivasadmurthy:sdmswitchleak
Open

feat: Add switch leak detection and powering off switches on detection of l…#566
srinivasadmurthy wants to merge 6 commits into
NVIDIA:mainfrom
srinivasadmurthy:sdmswitchleak

Conversation

@srinivasadmurthy
Copy link
Copy Markdown
Contributor

…eak.

Description

Query NICO for leaking switch ids and power them off when leak is reported.

Type of Change

  • Feature - New feature or functionality (feat:)
  • Fix - Bug fixes (fix:)
  • Chore - Modification or removal of existing functionality (chore:)
  • Refactor - Refactoring of existing functionality (refactor:)
  • Docs - Changes in documentation or OpenAPI schema (docs:)
  • CI - Changes in GitHub workflows. Requires additional scrutiny (ci:)
  • Version - Issuing a new release version (version:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • DB - DB DAOs or migrations updated
  • Site Manager - Site Manager updated
  • Cert Manager - Cert Manager updated
  • Site Agent - Site Agent updated
  • Flow - Flow service updated
  • Powershelf Manager - Powershelf Manager updated
  • NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

…eak.

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
@srinivasadmurthy srinivasadmurthy requested a review from a team as a code owner May 21, 2026 22:19
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Review Change Stack

Walkthrough

Adds GetLeakingSwitchIds to the NICo Client (grpc and mock), integrates it into leak-detection to submit NVSwitch power-off tasks, and adds an optional per‑VPC routing_profile field to FlatInterfaceConfig in the proto.

Changes

Switch Leak Detection Feature

Layer / File(s) Summary
NICo API contract and implementations
flow/internal/nicoapi/mod.go, flow/internal/nicoapi/grpc.go, flow/internal/nicoapi/mock.go
Adds GetLeakingSwitchIds(ctx) ([]string, error) to the Client interface; grpcClient performs a timeout-scoped gRPC query filtering hardware-health.tray-leak-detection and returns switch IDs; mockClient stores and returns leakingSwitchIds.
Leak detection scheduler integration
flow/internal/scheduler/jobs/leakdetection/leakdetection.go, flow/internal/scheduler/jobs/leakdetection/leakdetection_test.go
runLeakDetectionOne now calls GetLeakingSwitchIds, logs the count, and submits force power-off tasks per switch using submitPowerOffTask with a componentType parameter; updated tests pass the component type when invoking submitPowerOffTask.

Routing Profile Schema Extension

Layer / File(s) Summary
FlatInterfaceConfig routing profile
flow/internal/nicoapi/nicoproto/nico.proto
Adds optional routing_profile (field 20) to FlatInterfaceConfig for per‑VPC routing and marks ManagedHostNetworkConfigRequest.routing_profile as deprecated in favor of the new field.
sequenceDiagram
  participant LeakDetection as LeakDetectionScheduler
  participant NICo as grpcClient
  participant Tasks as TaskSubmitter
  LeakDetection->>NICo: GetLeakingSwitchIds(ctx)
  NICo-->>LeakDetection: []switchIDs / error
  loop for each switchID
    LeakDetection->>Tasks: submitPowerOffTask(ctx, switchID, ComponentTypeNVSwitch)
    Tasks-->>LeakDetection: task creation result
  end
Loading

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description accurately conveys the PR's intent to query NICO for leaking switch IDs and power them off upon detection, matching the implemented changes across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly describes the main feature: adding switch leak detection and powering off switches when leaks are detected, which aligns with all file changes across the nicoapi and scheduler packages.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-05-21 22:21:50 UTC | Commit: 4f36a52

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
flow/internal/nicoapi/nicoproto/nico.proto (1)

3979-3985: ⚡ Quick win

Encode deprecation in the field option, not only in comments.

Line 3985 is documented as deprecated but not marked deprecated in the schema, so generated clients won’t surface deprecation signals.

Suggested proto change
-  optional RoutingProfile routing_profile = 114;
+  optional RoutingProfile routing_profile = 114 [deprecated = true];

As per coding guidelines: **/*.proto: Review the Protobuf definitions, point out issues relative to compatibility, and expressiveness.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flow/internal/nicoapi/nicoproto/nico.proto` around lines 3979 - 3985, The
field routing_profile (optional RoutingProfile routing_profile = 114) is only
marked deprecated in comments; update the proto to encode deprecation by adding
the field option [deprecated = true] to that field so generators surface
deprecation warnings, keep the explanatory comment, then regenerate
language-specific clients/IDLs (references: symbol RoutingProfile and field
routing_profile = 114) to ensure tooling emits deprecation signals.
flow/internal/nicoapi/mock.go (1)

66-68: ⚡ Quick win

Expose a setter for leakingSwitchIds in the mock client.

The mock now supports reading leaking switch IDs but not configuring them through the same interface pattern used for leaking machines. Please add SetLeakingSwitchIds(ids []string) for predictable unit setup.

As per coding guidelines: "Document when you have intentionally omitted code that the reader might otherwise expect to be present."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flow/internal/nicoapi/mock.go` around lines 66 - 68, Add a public setter
method SetLeakingSwitchIds(ids []string) on the mockClient to mirror the
existing pattern used for leaking machines: it should assign the provided slice
to the mockClient.leakingSwitchIds field so tests can configure returned values
from GetLeakingSwitchIds(ctx). Also update or add a short comment above the
getter/setter block explaining the setter is intentionally provided to configure
mock state for unit tests.
flow/internal/nicoapi/mod.go (1)

35-35: ⚡ Quick win

Add a mock setter for leaking switch IDs to keep test hooks symmetric.

Client includes test-only mutators (e.g., SetLeakingMachineIds) but no equivalent for switches. Adding SetLeakingSwitchIds([]string) will keep scheduler tests decoupled from concrete mock type assertions.

As per coding guidelines: "Document when you have intentionally omitted code that the reader might otherwise expect to be present" and "Add TODO comments for features or nuances not important to implement right away."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flow/internal/nicoapi/mod.go` at line 35, Add a test-only setter for leaking
switch IDs to keep the test hooks symmetric: add SetLeakingSwitchIds([]string)
to the Client interface alongside the existing GetLeakingSwitchIds(ctx
context.Context) ([]string, error) and implement the method in the mock client
type the same way SetLeakingMachineIds is implemented (store the slice on the
mock and return it from GetLeakingSwitchIds). Mark the implementation with a
brief TODO comment that this is a test-only mutator and why it’s intentionally
present to avoid surprising readers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@flow/internal/nicoapi/nicoproto/nico.proto`:
- Around line 4204-4206: The proto lacks a clear precedence rule when both the
per-interface RoutingProfile (routing_profile on the interface message) and the
legacy response-level routing_profile are set; update the comments to state that
the interface-level routing_profile takes precedence and is used for routing
decisions, and that if it is unset the implementation should fall back to the
legacy response-level routing_profile; apply this clarification to the doc
comment on the interface's routing_profile field and to the legacy
response-level routing_profile comment (and optionally mark the legacy field as
deprecated with the deprecation option if intended).

In `@flow/internal/scheduler/jobs/leakdetection/leakdetection.go`:
- Around line 65-71: The loop over leakingSwitchIds is calling
submitPowerOffTask which currently hardcodes ComponentTypeCompute and
machine-specific metadata; change the callsite and helper so switch remediation
targets switches: either create a new helper (e.g., submitPowerOffTaskForSwitch)
or extend submitPowerOffTask to accept a component type and switch-specific
metadata, and ensure it uses ComponentTypeSwitch and builds switch-relevant
metadata (not machine fields). Update the leakingSwitchIds loop to call the
new/updated function with ComponentTypeSwitch and appropriate switch identifiers
so the power-off task targets switch components correctly.

---

Nitpick comments:
In `@flow/internal/nicoapi/mock.go`:
- Around line 66-68: Add a public setter method SetLeakingSwitchIds(ids
[]string) on the mockClient to mirror the existing pattern used for leaking
machines: it should assign the provided slice to the mockClient.leakingSwitchIds
field so tests can configure returned values from GetLeakingSwitchIds(ctx). Also
update or add a short comment above the getter/setter block explaining the
setter is intentionally provided to configure mock state for unit tests.

In `@flow/internal/nicoapi/mod.go`:
- Line 35: Add a test-only setter for leaking switch IDs to keep the test hooks
symmetric: add SetLeakingSwitchIds([]string) to the Client interface alongside
the existing GetLeakingSwitchIds(ctx context.Context) ([]string, error) and
implement the method in the mock client type the same way SetLeakingMachineIds
is implemented (store the slice on the mock and return it from
GetLeakingSwitchIds). Mark the implementation with a brief TODO comment that
this is a test-only mutator and why it’s intentionally present to avoid
surprising readers.

In `@flow/internal/nicoapi/nicoproto/nico.proto`:
- Around line 3979-3985: The field routing_profile (optional RoutingProfile
routing_profile = 114) is only marked deprecated in comments; update the proto
to encode deprecation by adding the field option [deprecated = true] to that
field so generators surface deprecation warnings, keep the explanatory comment,
then regenerate language-specific clients/IDLs (references: symbol
RoutingProfile and field routing_profile = 114) to ensure tooling emits
deprecation signals.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 50f79ee6-f22a-45d8-a639-a29a8c38702e

📥 Commits

Reviewing files that changed from the base of the PR and between f813f8d and 4f36a52.

⛔ Files ignored due to path filters (1)
  • flow/internal/nicoapi/gen/nico.pb.go is excluded by !**/*.pb.go, !**/gen/**, !**/*.pb.go
📒 Files selected for processing (5)
  • flow/internal/nicoapi/grpc.go
  • flow/internal/nicoapi/mock.go
  • flow/internal/nicoapi/mod.go
  • flow/internal/nicoapi/nicoproto/nico.proto
  • flow/internal/scheduler/jobs/leakdetection/leakdetection.go

Comment on lines +4204 to +4206
// Route imports and tagging details for exports used by FNN configs.
// This is scoped to the VPC that owns this interface.
optional RoutingProfile routing_profile = 20;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Define precedence when both routing profile fields are present.

Line 4206 adds per-interface routing_profile, while the legacy response-level field at Line 3985 still exists. The contract should explicitly state precedence/fallback behavior to avoid divergent client behavior when both are populated.

Suggested comment-level contract clarification
-  // Route imports and tagging details for exports used by FNN configs.
-  // This is scoped to the VPC that owns this interface.
+  // Route imports and tagging details for exports used by FNN configs.
+  // This is scoped to the VPC that owns this interface.
+  // Precedence: when set, this field overrides ManagedHostNetworkConfigResponse.routing_profile.
+  // Fallback: use ManagedHostNetworkConfigResponse.routing_profile only when this field is unset.
   optional RoutingProfile routing_profile = 20;

As per coding guidelines: **/*.proto: Review the Protobuf definitions, point out issues relative to compatibility, and expressiveness.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Route imports and tagging details for exports used by FNN configs.
// This is scoped to the VPC that owns this interface.
optional RoutingProfile routing_profile = 20;
// Route imports and tagging details for exports used by FNN configs.
// This is scoped to the VPC that owns this interface.
// Precedence: when set, this field overrides ManagedHostNetworkConfigResponse.routing_profile.
// Fallback: use ManagedHostNetworkConfigResponse.routing_profile only when this field is unset.
optional RoutingProfile routing_profile = 20;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flow/internal/nicoapi/nicoproto/nico.proto` around lines 4204 - 4206, The
proto lacks a clear precedence rule when both the per-interface RoutingProfile
(routing_profile on the interface message) and the legacy response-level
routing_profile are set; update the comments to state that the interface-level
routing_profile takes precedence and is used for routing decisions, and that if
it is unset the implementation should fall back to the legacy response-level
routing_profile; apply this clarification to the doc comment on the interface's
routing_profile field and to the legacy response-level routing_profile comment
(and optionally mark the legacy field as deprecated with the deprecation option
if intended).

Comment thread flow/internal/scheduler/jobs/leakdetection/leakdetection.go
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
nico-flow 66 4 34 18 2 8
nico-nsm 82 2 28 43 9 0
nico-psm 67 4 35 18 2 8
nico-rest-api 100 6 53 30 3 8
nico-rest-cert-manager 65 4 34 18 1 8
nico-rest-db 66 4 34 18 2 8
nico-rest-site-agent 65 4 34 18 1 8
nico-rest-site-manager 65 4 34 18 1 8
nico-rest-workflow 67 4 35 18 2 8
TOTAL 643 36 321 199 23 64

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@zhaozhongn zhaozhongn requested a review from kunzhao-nv May 22, 2026 00:04
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@flow/internal/scheduler/jobs/leakdetection/leakdetection.go`:
- Line 51: Split the inline assign-and-check into two statements for both calls
to submitPowerOffTask: replace constructs like `if err :=
submitPowerOffTask(ctx, taskMgr, machineID, devicetypes.ComponentTypeCompute);
err != nil { ... }` with a separate assignment `derr := submitPowerOffTask(ctx,
taskMgr, machineID, devicetypes.ComponentTypeCompute)` followed by `if derr !=
nil { ... }`, and do the same for the second occurrence (the other
submitPowerOffTask call at the later site); ensure you update the error variable
name consistently (e.g., derr) and use that variable inside the existing error
handling block.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 30474e4d-5ddf-40b6-8f2e-dea821fb5699

📥 Commits

Reviewing files that changed from the base of the PR and between 4f36a52 and 530def3.

📒 Files selected for processing (1)
  • flow/internal/scheduler/jobs/leakdetection/leakdetection.go

Comment thread flow/internal/scheduler/jobs/leakdetection/leakdetection.go Outdated
@srinivasadmurthy srinivasadmurthy changed the title Add switch leak detection and powering off switches on detection of l… feat: Add switch leak detection and powering off switches on detection of l… May 22, 2026

log.Info().Msgf("Found %d leaking switch IDs", len(leakingSwitchIds))

for _, switchID := range leakingSwitchIds {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
We have TestRunLeakDetectionOne_SubmitsTaskPerMachine but nothing for switches.

We can add e.g. TestRunLeakDetectionOne_SubmitsTaskPerSwitch:

  • nicoClient.SetLeakingSwitchIds([]string{"switch-1", "switch-2"}) (needs setter on Client — see mock comment)
  • run runLeakDetectionOne
  • assert 2 requests with External.Type == ComponentTypeNVSwitch and External.ID matching the switch IDs

func submitPowerOffTask(
ctx context.Context,
taskMgr taskmanager.Manager,
machineID string,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this helper serves both compute and NVSwitch, the parameter name machineID is misleading. Consider renaming to externalComponentID or componentExternalID.

},
},
},
Description: fmt.Sprintf("Leak detection: force power-off machine %s", machineID),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description, the empty-task error, and the success log still say "machine" / use machine_id even when componentType is ComponentTypeNVSwitch.

For switch remediation this will confuse on-call debugging. Suggest either:

  • branch on componentType for description + structured log field (switch_id vs machine_id), or
  • generic wording: "Leak detection: force power-off component %s" with Str("external_id", ...).

}
}

leakingSwitchIds, err := nicoClient.GetLeakingSwitchIds(ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If GetLeakingSwitchIds fails after machines were processed, we return without attempting switches. That's reasonable, but machine remediation already ran.

Is it as expectation?

SetAdminPowerControlError(err error)
AddMachineInterface(iface MachineInterface)
AddExpectedSwitchInfo(info ExpectedSwitchInfo)
SetLeakingMachineIds(ids []string)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
SetLeakingMachineIds exists on Client for tests, but there's no SetLeakingSwitchIds. Tests that need switch leaks must type-assert to *mockClient or can't configure switch IDs cleanly.

Please add SetLeakingSwitchIds([]string) to the mock-only section of Client and implement it on mockClient (mirror SetLeakingMachineIds).

@@ -65,7 +65,7 @@ func TestSubmitPowerOffTask_Success(t *testing.T) {
mgr := &mockManager{}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests were updated for ComponentTypeCompute on submitPowerOffTask, but there is no coverage for:

  • ComponentTypeNVSwitch in TestSubmitPowerOffTask_Success
  • runLeakDetectionOne with leaking switches

Please extend the existing tests.

return ids, nil
}

func (c *grpcClient) GetLeakingSwitchIds(ctx context.Context) ([]string, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core SwitchSearchFilter (forge.proto) has only_with_health_alert but no only_with_power_state, unlike MachineSearchConfig. So we can't mirror the machine "on" filter without a Core proto/API change.

Is it acceptable to power off all switches with the leak alert regardless of power state? If not, we need a Core field (or post-filter via another RPC) before Flow can match machine behavior.

@@ -136,6 +136,27 @@ func (c *grpcClient) GetLeakingMachineIds(ctx context.Context) ([]string, error)
return ids, nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add some comments on GetLeakingSwitchIds (same style as GetLeakingMachineIds): what NICo API is called, what IDs are returned (Core SwitchId), and that callers use them as Flow external_id for ComponentTypeNVSwitch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants