Skip to content

refactor(api)!: clean up and align rack manager proto with gRPC conventions#39

Open
jayzhudev wants to merge 3 commits into
NVIDIA:mainfrom
jayzhudev:rpc/cleanup
Open

refactor(api)!: clean up and align rack manager proto with gRPC conventions#39
jayzhudev wants to merge 3 commits into
NVIDIA:mainfrom
jayzhudev:rpc/cleanup

Conversation

@jayzhudev
Copy link
Copy Markdown
Contributor

@jayzhudev jayzhudev commented May 22, 2026

Description

This PR cleans up rack_manager.proto and updates the RMS code to match the new API.

The main goal is to make the proto easier for clients to use and harder to misuse. The changes remove fields RMS was not actually using, make naming more consistent, and use protobuf-native types where they fit better.

Key updates:

  • Removed request/response Metadata fields. Correlation and identity should come from gRPC metadata, headers, TLS, or interceptors.
  • Added explicit UNSPECIFIED = 0 enum values so proto3 defaults do not mean a real operation or success.
  • Standardized node naming: NewNodeInfo is now NodeInfo, devices is now nodes, and DeviceList RPCs are now NodeList.
  • Changed numeric fields that cannot be negative to unsigned types.
  • Changed firmware updateable from string to bool.
  • Replaced raw millisecond timestamps with google.protobuf.Timestamp.
  • Removed switch request fields such as port and verify_ssl where the handler ignored them and used inventory data instead.
  • Removed redundant response status fields from simple inventory list RPCs.
  • Aligned FromJson RPC/message naming.
  • Marked the deprecated component filter field with [deprecated = true].

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

This is a breaking proto/API cleanup. Clients will need to update renamed messages, fields, RPCs, and enum values. They will also need to stop sending removed metadata fields and ignored switch fields.

Migration notes:

  • Use gRPC metadata, headers, TLS identity, or interceptors instead of request/response Metadata.
  • Use NodeInfo and NodeSet.nodes.
  • Use *NodeList RPC/request/response names instead of *DeviceList.
  • Update enum references to the new normalized names.
  • Use protobuf Timestamp helpers for job time fields.
  • Stop sending port and verify_ssl on inventory-backed switch RPCs.

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

This PR does not try to normalize every result/status response shape. That would require broader behavior changes in RMS, especially for batch and partial-failure flows. This PR keeps that behavior stable and focuses on the proto cleanup that can be done safely now.

…ntions

## Description
This PR cleans up `rack_manager.proto` and updates the RMS code to match the new API.

The main goal is to make the proto easier for clients to use and harder to misuse. The changes remove fields RMS was not actually using, make naming more consistent, and use protobuf-native types where they fit better.

Key updates:
- Removed request/response `Metadata` fields. Correlation and identity should come from gRPC metadata, headers, TLS, or interceptors.
- Added explicit `UNSPECIFIED = 0` enum values so proto3 defaults do not mean a real operation or success.
- Standardized node naming: `NewNodeInfo` is now `NodeInfo`, `devices` is now `nodes`, and `DeviceList` RPCs are now `NodeList`.
- Changed numeric fields that cannot be negative to unsigned types.
- Changed firmware `updateable` from `string` to `bool`.
- Replaced raw millisecond timestamps with `google.protobuf.Timestamp`.
- Removed switch request fields such as `port` and `verify_ssl` where the handler ignored them and used inventory data instead.
- Removed redundant response status fields from simple inventory list RPCs.
- Aligned `FromJson` RPC/message naming.
- Marked the deprecated component filter field with `[deprecated = true]`.
- Updated RMS handlers, client code, tests, and benchmarks for the new proto.

## Type of Change
- [ ] **Add** - New feature or capability
- [x] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Breaking Changes
- [x] This PR contains breaking changes

This is a breaking proto/API cleanup. Clients will need to update renamed messages, fields, RPCs, and enum values. They will also need to stop sending removed metadata fields and ignored switch fields.

Migration notes:
- Use gRPC metadata, headers, TLS identity, or interceptors instead of request/response `Metadata`.
- Use `NodeInfo` and `NodeSet.nodes`.
- Use `*NodeList` RPC/request/response names instead of `*DeviceList`.
- Update enum references to the new normalized names.
- Use protobuf `Timestamp` helpers for job time fields.
- Stop sending `port` and `verify_ssl` on inventory-backed switch RPCs.

## Testing
- [x] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
This PR does not try to normalize every result/status response shape. That would require broader behavior changes in RMS, especially for batch and partial-failure flows. This PR keeps that behavior stable and focuses on the proto cleanup that can be done safely now.

Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
@jayzhudev jayzhudev self-assigned this May 22, 2026
@jayzhudev jayzhudev changed the title refactor(api)!: clean up and align rack manager proto with gRPC conve… refactor(api)!: clean up and align rack manager proto with gRPC conventions May 22, 2026
Comment thread proto/rack_manager.proto Outdated
int32 port = 4; // API port number (default: 443)
bool verify_ssl = 5; // Whether to verify SSL certificates
reserved 1, 4, 5;
reserved "metadata", "port", "verify_ssl";
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: removal of port and verify_ssl are reasoned by the request message being used in stateful (inventory) paths in RMS. These fields are never used in implementation.

Copy link
Copy Markdown

@Matthias247 Matthias247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick passthrough so far. My initial thoughts:

  • I'd avoid the UNSPECIFIED_ changes.
  • Removal of metadata seems ok
  • Timestamp changes too
  • verify_ssl -> danger_accept_invalid_certs seems like a good idea
  • we should figure out what to do with the aggregate status results

rest is some details about certain APIs

Comment thread proto/rack_manager.proto
POWER_GRACEFUL_SHUTDOWN = 5; // Redfish ResetType GracefulShutdown
POWER_GRACEFUL_RESTART = 6; // Redfish ResetType GracefulRestart
POWER_FORCE_RESTART = 7; // Redfish ResetType ForceRestart
POWER_OPERATION_UNSPECIFIED = 0; // Caller left the operation missing or at the proto3 default
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old way of having a reasonable default was fine.
unspecified is maybe good to distinct between "not set and set". But we can always use the optional quantifier to achieve the same

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not trying to be pedantic: https://protobuf.dev/programming-guides/style/ but since we are doing this cleanup:

The first listed value should be a zero value enum and have the suffix of either _UNSPECIFIED or _UNKNOWN. This value may be used as an unknown/default value and should be distinct from any of the semantic values you expect to be explicitly set.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reason for not wanting the 0 defaults is that some are not really the desired operation, for example for enum PowerOperation, the previous 0 value was POWER_OPERATION_OFF which could be harmful.

Comment thread proto/rack_manager.proto Outdated
Comment thread proto/rack_manager.proto Outdated
Comment thread proto/rack_manager.proto Outdated
// If firmware_targets is non-empty, it takes precedence over the single filename/target fields.
message UpdateNodeFirmwareRequest {
Metadata metadata = 1;
message UpdateNodeFirmwareAsyncRequest {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anunna0 Do we actually want to continue calling this Async? It should become the only mechanism that is used and supported

Comment thread proto/rack_manager.proto
Comment thread proto/rack_manager.proto
NodeSet nodes = 2; // Switch nodes to update
string image_filename = 3; // Binary filename as it should exist on the switch
string local_file_path = 4; // Local file path on the target switch
string local_file_path = 4; // Local file path on the RMS host
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the caller know where its stored on RMS? It should be a remote file path, shouldn't it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed not ideal. @vinodchitraliNVIDIA What do you think?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need now. As logic is moving to RMS

Comment thread proto/rack_manager.proto Outdated
Comment thread proto/rack_manager.proto Outdated
reserved 1;
reserved "metadata";
NodeSet nodes = 2; // Switch nodes to update
optional bool enabled = 3; // true = enabled, false = disabled
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as suggested by @amit-pabalkar earlier, we can remove optional here.

Comment thread proto/rack_manager.proto Outdated
@@ -1166,8 +1219,9 @@ message ConfigureScaleUpFabricManagerResponse {
* and config save across caller-supplied switches.
*/
message SetScaleUpFabricStateRequest {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for renaming all the API with request and response:

rpc ConfigureScaleUpFabric(ScaleUpFabricConfigurationRequest) returns (ScaleUpFabricConfigurationResponse) {}

Comment thread proto/rack_manager.proto
Comment thread proto/rack_manager.proto
POWER_GRACEFUL_SHUTDOWN = 5; // Redfish ResetType GracefulShutdown
POWER_GRACEFUL_RESTART = 6; // Redfish ResetType GracefulRestart
POWER_FORCE_RESTART = 7; // Redfish ResetType ForceRestart
POWER_OPERATION_UNSPECIFIED = 0; // Caller left the operation missing or at the proto3 default
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think UNSPECIFIED, OFF, ON is enough.

The NICo has clippy integration where it gave Suggetion like this

enum PowerOperation {
On,
Off,
Reset
}

Copy link
Copy Markdown
Contributor Author

@jayzhudev jayzhudev May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's legacy reason mostly because of naming collision in C++ so the best practices for proto enum variant naming suggest including the prefix, which is unfortunate.

https://protobuf.dev/programming-guides/style/

Comment thread proto/rack_manager.proto
RACK_POWER_ON = 0; // Power on all nodes in the rack following power-on order
RACK_POWER_OFF = 1; // Power off all nodes in the rack
RACK_POWER_CYCLE = 2; // Power cycle all nodes in the rack
RACK_POWER_OPERATION_UNSPECIFIED = 0; // Caller left the operation missing or at the proto3 default
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Comment thread proto/rack_manager.proto
enum ReturnCode {
SUCCESS = 0; // Operation completed successfully
FAILURE = 1; // Operation failed (see message field for details)
RETURN_CODE_UNSPECIFIED = 0; // Server left the status unset or at the proto3 default
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment thread proto/rack_manager.proto
FW_UPDATE_MONITORING_TIMEOUT = 9; // Timed out waiting for update to complete
FW_UPDATE_TASK_EXCEPTION = 10; // Exception in task monitoring
FW_UPDATE_UNKNOWN = 11; // Unknown or unclassified error
FIRMWARE_UPDATE_ERROR_UNSPECIFIED = 0; // Server left the error code unset or at the proto3 default
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Comment thread proto/rack_manager.proto
NODE_TYPE_COMPUTE = 1; // Compute node (GPU server, CPU server)
NODE_TYPE_POWERSHELF = 2; // Power shelf (power distribution unit)
NODE_TYPE_SWITCH = 3; // Network switch
NODE_TYPE_UNSPECIFIED = 0; // Caller left the node type missing or at the proto3 default
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Comment thread proto/rack_manager.proto Outdated
}

// NodeResult reports the outcome of an operation on a specific node.
message NodeResult {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just Result or Some other name. This is used for all Ops.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result may have too large of a scope. Maybe we rename this to NodeOperationResult for what it actually represents. This way we are open to other types of results in the future.

Comment thread proto/rack_manager.proto
}

// NodeBatchResponse reports the common result shape for multi-node RPCs.
message NodeBatchResponse {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need node prepend ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need it.

Comment thread proto/rack_manager.proto Outdated
Comment thread proto/rack_manager.proto Outdated
* access_token only for artifact download and does not persist the token or object.
*/
rpc ApplyFirmwareObjectFromJSON(ApplyFirmwareObjectFromJsonRequest) returns (ApplyFirmwareObjectResponse) {}
rpc ApplyFirmwareObjectFromJson(ApplyFirmwareObjectFromJsonRequest) returns (ApplyFirmwareObjectFromJsonResponse) {}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rpc ApplyFirmwareObject(ApplyFirmwareObjectRequest) returns (ApplyFirmwareObjectResponse) {}

and

message ApplyFirmwareObjectRequest {
        FormatType format; 
....
}

Comment thread proto/rack_manager.proto Outdated

/*
* ApplySwitchSystemImageFromJSON ingests a firmware object from SOT JSON, downloads
* ApplySwitchSystemImageFromJson ingests a firmware object from SOT JSON, downloads
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can eliminate FromJson

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

RPC renames, request/response type renames, endpoint model consolidation,
inventory component naming, batch result stats, firmware object apply split,
switch firmware naming, ScaleUpFabric naming, and GetVersion probing.

Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
Comment thread proto/rack_manager.proto
Metadata metadata = 1;
ReturnCode status = 2; // SUCCESS if jobs created, FAILURE otherwise
// BatchUpdateFirmwareByNodeTypeResponse reports the created async jobs.
message BatchUpdateFirmwareByNodeTypeResponse {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks the same as NodeBatchResponse to me?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah they do look similar but are actually different. I restructured BatchUpdateFirmwareByNodeTypeResponse, will push a commit in a sec.

Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
Comment thread proto/rack_manager.proto
* config across all supplied switches.
*/
rpc SetScaleUpFabricState(SetScaleUpFabricStateRequest) returns (SetScaleUpFabricStateResponse) {}
rpc BatchSetScaleUpFabricState(BatchSetScaleUpFabricStateRequest) returns (BatchSetScaleUpFabricStateResponse) {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still prefer ConfigureScaleUpFabric here since nobody is really expected to call this on a subset of nodes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also since the configuration is not applied the same way to every node. Only one node will ever be configured as primary, and others as secondary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the implementation, this RPC is used for the stateless path, so RMS doesn't know the physical topology, and the requested switches can span across racks. I think we can keep this as-is for the stateless path. If we add a stateful/inventory-backed flow later, the new path would require enough RMS context to operate on whole node sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants