feat: Add switch leak detection and powering off switches on detection of l…#566
feat: Add switch leak detection and powering off switches on detection of l…#566srinivasadmurthy wants to merge 6 commits into
Conversation
…eak. Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
WalkthroughAdds GetLeakingSwitchIds to the NICo Client (grpc and mock), integrates it into leak-detection to submit NVSwitch power-off tasks, and adds an optional per‑VPC routing_profile field to FlatInterfaceConfig in the proto. ChangesSwitch Leak Detection Feature
Routing Profile Schema Extension
sequenceDiagram
participant LeakDetection as LeakDetectionScheduler
participant NICo as grpcClient
participant Tasks as TaskSubmitter
LeakDetection->>NICo: GetLeakingSwitchIds(ctx)
NICo-->>LeakDetection: []switchIDs / error
loop for each switchID
LeakDetection->>Tasks: submitPowerOffTask(ctx, switchID, ComponentTypeNVSwitch)
Tasks-->>LeakDetection: task creation result
end
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-05-21 22:21:50 UTC | Commit: 4f36a52 |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
flow/internal/nicoapi/nicoproto/nico.proto (1)
3979-3985: ⚡ Quick winEncode deprecation in the field option, not only in comments.
Line 3985 is documented as deprecated but not marked deprecated in the schema, so generated clients won’t surface deprecation signals.
Suggested proto change
- optional RoutingProfile routing_profile = 114; + optional RoutingProfile routing_profile = 114 [deprecated = true];As per coding guidelines:
**/*.proto: Review the Protobuf definitions, point out issues relative to compatibility, and expressiveness.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@flow/internal/nicoapi/nicoproto/nico.proto` around lines 3979 - 3985, The field routing_profile (optional RoutingProfile routing_profile = 114) is only marked deprecated in comments; update the proto to encode deprecation by adding the field option [deprecated = true] to that field so generators surface deprecation warnings, keep the explanatory comment, then regenerate language-specific clients/IDLs (references: symbol RoutingProfile and field routing_profile = 114) to ensure tooling emits deprecation signals.flow/internal/nicoapi/mock.go (1)
66-68: ⚡ Quick winExpose a setter for
leakingSwitchIdsin the mock client.The mock now supports reading leaking switch IDs but not configuring them through the same interface pattern used for leaking machines. Please add
SetLeakingSwitchIds(ids []string)for predictable unit setup.As per coding guidelines: "Document when you have intentionally omitted code that the reader might otherwise expect to be present."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@flow/internal/nicoapi/mock.go` around lines 66 - 68, Add a public setter method SetLeakingSwitchIds(ids []string) on the mockClient to mirror the existing pattern used for leaking machines: it should assign the provided slice to the mockClient.leakingSwitchIds field so tests can configure returned values from GetLeakingSwitchIds(ctx). Also update or add a short comment above the getter/setter block explaining the setter is intentionally provided to configure mock state for unit tests.flow/internal/nicoapi/mod.go (1)
35-35: ⚡ Quick winAdd a mock setter for leaking switch IDs to keep test hooks symmetric.
Clientincludes test-only mutators (e.g.,SetLeakingMachineIds) but no equivalent for switches. AddingSetLeakingSwitchIds([]string)will keep scheduler tests decoupled from concrete mock type assertions.As per coding guidelines: "Document when you have intentionally omitted code that the reader might otherwise expect to be present" and "Add TODO comments for features or nuances not important to implement right away."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@flow/internal/nicoapi/mod.go` at line 35, Add a test-only setter for leaking switch IDs to keep the test hooks symmetric: add SetLeakingSwitchIds([]string) to the Client interface alongside the existing GetLeakingSwitchIds(ctx context.Context) ([]string, error) and implement the method in the mock client type the same way SetLeakingMachineIds is implemented (store the slice on the mock and return it from GetLeakingSwitchIds). Mark the implementation with a brief TODO comment that this is a test-only mutator and why it’s intentionally present to avoid surprising readers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@flow/internal/nicoapi/nicoproto/nico.proto`:
- Around line 4204-4206: The proto lacks a clear precedence rule when both the
per-interface RoutingProfile (routing_profile on the interface message) and the
legacy response-level routing_profile are set; update the comments to state that
the interface-level routing_profile takes precedence and is used for routing
decisions, and that if it is unset the implementation should fall back to the
legacy response-level routing_profile; apply this clarification to the doc
comment on the interface's routing_profile field and to the legacy
response-level routing_profile comment (and optionally mark the legacy field as
deprecated with the deprecation option if intended).
In `@flow/internal/scheduler/jobs/leakdetection/leakdetection.go`:
- Around line 65-71: The loop over leakingSwitchIds is calling
submitPowerOffTask which currently hardcodes ComponentTypeCompute and
machine-specific metadata; change the callsite and helper so switch remediation
targets switches: either create a new helper (e.g., submitPowerOffTaskForSwitch)
or extend submitPowerOffTask to accept a component type and switch-specific
metadata, and ensure it uses ComponentTypeSwitch and builds switch-relevant
metadata (not machine fields). Update the leakingSwitchIds loop to call the
new/updated function with ComponentTypeSwitch and appropriate switch identifiers
so the power-off task targets switch components correctly.
---
Nitpick comments:
In `@flow/internal/nicoapi/mock.go`:
- Around line 66-68: Add a public setter method SetLeakingSwitchIds(ids
[]string) on the mockClient to mirror the existing pattern used for leaking
machines: it should assign the provided slice to the mockClient.leakingSwitchIds
field so tests can configure returned values from GetLeakingSwitchIds(ctx). Also
update or add a short comment above the getter/setter block explaining the
setter is intentionally provided to configure mock state for unit tests.
In `@flow/internal/nicoapi/mod.go`:
- Line 35: Add a test-only setter for leaking switch IDs to keep the test hooks
symmetric: add SetLeakingSwitchIds([]string) to the Client interface alongside
the existing GetLeakingSwitchIds(ctx context.Context) ([]string, error) and
implement the method in the mock client type the same way SetLeakingMachineIds
is implemented (store the slice on the mock and return it from
GetLeakingSwitchIds). Mark the implementation with a brief TODO comment that
this is a test-only mutator and why it’s intentionally present to avoid
surprising readers.
In `@flow/internal/nicoapi/nicoproto/nico.proto`:
- Around line 3979-3985: The field routing_profile (optional RoutingProfile
routing_profile = 114) is only marked deprecated in comments; update the proto
to encode deprecation by adding the field option [deprecated = true] to that
field so generators surface deprecation warnings, keep the explanatory comment,
then regenerate language-specific clients/IDLs (references: symbol
RoutingProfile and field routing_profile = 114) to ensure tooling emits
deprecation signals.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 50f79ee6-f22a-45d8-a639-a29a8c38702e
⛔ Files ignored due to path filters (1)
flow/internal/nicoapi/gen/nico.pb.gois excluded by!**/*.pb.go,!**/gen/**,!**/*.pb.go
📒 Files selected for processing (5)
flow/internal/nicoapi/grpc.goflow/internal/nicoapi/mock.goflow/internal/nicoapi/mod.goflow/internal/nicoapi/nicoproto/nico.protoflow/internal/scheduler/jobs/leakdetection/leakdetection.go
| // Route imports and tagging details for exports used by FNN configs. | ||
| // This is scoped to the VPC that owns this interface. | ||
| optional RoutingProfile routing_profile = 20; |
There was a problem hiding this comment.
Define precedence when both routing profile fields are present.
Line 4206 adds per-interface routing_profile, while the legacy response-level field at Line 3985 still exists. The contract should explicitly state precedence/fallback behavior to avoid divergent client behavior when both are populated.
Suggested comment-level contract clarification
- // Route imports and tagging details for exports used by FNN configs.
- // This is scoped to the VPC that owns this interface.
+ // Route imports and tagging details for exports used by FNN configs.
+ // This is scoped to the VPC that owns this interface.
+ // Precedence: when set, this field overrides ManagedHostNetworkConfigResponse.routing_profile.
+ // Fallback: use ManagedHostNetworkConfigResponse.routing_profile only when this field is unset.
optional RoutingProfile routing_profile = 20;As per coding guidelines: **/*.proto: Review the Protobuf definitions, point out issues relative to compatibility, and expressiveness.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Route imports and tagging details for exports used by FNN configs. | |
| // This is scoped to the VPC that owns this interface. | |
| optional RoutingProfile routing_profile = 20; | |
| // Route imports and tagging details for exports used by FNN configs. | |
| // This is scoped to the VPC that owns this interface. | |
| // Precedence: when set, this field overrides ManagedHostNetworkConfigResponse.routing_profile. | |
| // Fallback: use ManagedHostNetworkConfigResponse.routing_profile only when this field is unset. | |
| optional RoutingProfile routing_profile = 20; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@flow/internal/nicoapi/nicoproto/nico.proto` around lines 4204 - 4206, The
proto lacks a clear precedence rule when both the per-interface RoutingProfile
(routing_profile on the interface message) and the legacy response-level
routing_profile are set; update the comments to state that the interface-level
routing_profile takes precedence and is used for routing decisions, and that if
it is unset the implementation should fall back to the legacy response-level
routing_profile; apply this clarification to the doc comment on the interface's
routing_profile field and to the legacy response-level routing_profile comment
(and optionally mark the legacy field as deprecated with the deprecation option
if intended).
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@flow/internal/scheduler/jobs/leakdetection/leakdetection.go`:
- Line 51: Split the inline assign-and-check into two statements for both calls
to submitPowerOffTask: replace constructs like `if err :=
submitPowerOffTask(ctx, taskMgr, machineID, devicetypes.ComponentTypeCompute);
err != nil { ... }` with a separate assignment `derr := submitPowerOffTask(ctx,
taskMgr, machineID, devicetypes.ComponentTypeCompute)` followed by `if derr !=
nil { ... }`, and do the same for the second occurrence (the other
submitPowerOffTask call at the later site); ensure you update the error variable
name consistently (e.g., derr) and use that variable inside the existing error
handling block.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 30474e4d-5ddf-40b6-8f2e-dea821fb5699
📒 Files selected for processing (1)
flow/internal/scheduler/jobs/leakdetection/leakdetection.go
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
|
|
||
| log.Info().Msgf("Found %d leaking switch IDs", len(leakingSwitchIds)) | ||
|
|
||
| for _, switchID := range leakingSwitchIds { |
There was a problem hiding this comment.
nit:
We have TestRunLeakDetectionOne_SubmitsTaskPerMachine but nothing for switches.
We can add e.g. TestRunLeakDetectionOne_SubmitsTaskPerSwitch:
nicoClient.SetLeakingSwitchIds([]string{"switch-1", "switch-2"})(needs setter onClient— see mock comment)- run
runLeakDetectionOne - assert 2 requests with
External.Type == ComponentTypeNVSwitchandExternal.IDmatching the switch IDs
| func submitPowerOffTask( | ||
| ctx context.Context, | ||
| taskMgr taskmanager.Manager, | ||
| machineID string, |
There was a problem hiding this comment.
Now that this helper serves both compute and NVSwitch, the parameter name machineID is misleading. Consider renaming to externalComponentID or componentExternalID.
| }, | ||
| }, | ||
| }, | ||
| Description: fmt.Sprintf("Leak detection: force power-off machine %s", machineID), |
There was a problem hiding this comment.
Description, the empty-task error, and the success log still say "machine" / use machine_id even when componentType is ComponentTypeNVSwitch.
For switch remediation this will confuse on-call debugging. Suggest either:
- branch on
componentTypefor description + structured log field (switch_idvsmachine_id), or - generic wording:
"Leak detection: force power-off component %s"withStr("external_id", ...).
| } | ||
| } | ||
|
|
||
| leakingSwitchIds, err := nicoClient.GetLeakingSwitchIds(ctx) |
There was a problem hiding this comment.
If GetLeakingSwitchIds fails after machines were processed, we return without attempting switches. That's reasonable, but machine remediation already ran.
Is it as expectation?
| SetAdminPowerControlError(err error) | ||
| AddMachineInterface(iface MachineInterface) | ||
| AddExpectedSwitchInfo(info ExpectedSwitchInfo) | ||
| SetLeakingMachineIds(ids []string) |
There was a problem hiding this comment.
nit:
SetLeakingMachineIds exists on Client for tests, but there's no SetLeakingSwitchIds. Tests that need switch leaks must type-assert to *mockClient or can't configure switch IDs cleanly.
Please add SetLeakingSwitchIds([]string) to the mock-only section of Client and implement it on mockClient (mirror SetLeakingMachineIds).
| @@ -65,7 +65,7 @@ func TestSubmitPowerOffTask_Success(t *testing.T) { | |||
| mgr := &mockManager{} | |||
There was a problem hiding this comment.
Tests were updated for ComponentTypeCompute on submitPowerOffTask, but there is no coverage for:
ComponentTypeNVSwitchinTestSubmitPowerOffTask_SuccessrunLeakDetectionOnewith leaking switches
Please extend the existing tests.
| return ids, nil | ||
| } | ||
|
|
||
| func (c *grpcClient) GetLeakingSwitchIds(ctx context.Context) ([]string, error) { |
There was a problem hiding this comment.
Core SwitchSearchFilter (forge.proto) has only_with_health_alert but no only_with_power_state, unlike MachineSearchConfig. So we can't mirror the machine "on" filter without a Core proto/API change.
Is it acceptable to power off all switches with the leak alert regardless of power state? If not, we need a Core field (or post-filter via another RPC) before Flow can match machine behavior.
| @@ -136,6 +136,27 @@ func (c *grpcClient) GetLeakingMachineIds(ctx context.Context) ([]string, error) | |||
| return ids, nil | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
We can add some comments on GetLeakingSwitchIds (same style as GetLeakingMachineIds): what NICo API is called, what IDs are returned (Core SwitchId), and that callers use them as Flow external_id for ComponentTypeNVSwitch.
…eak.
Description
Query NICO for leaking switch ids and power them off when leak is reported.
Type of Change
Services Affected
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes