refactor(api)!: clean up and align rack manager proto with gRPC conventions#39
refactor(api)!: clean up and align rack manager proto with gRPC conventions#39jayzhudev wants to merge 3 commits into
Conversation
…ntions ## Description This PR cleans up `rack_manager.proto` and updates the RMS code to match the new API. The main goal is to make the proto easier for clients to use and harder to misuse. The changes remove fields RMS was not actually using, make naming more consistent, and use protobuf-native types where they fit better. Key updates: - Removed request/response `Metadata` fields. Correlation and identity should come from gRPC metadata, headers, TLS, or interceptors. - Added explicit `UNSPECIFIED = 0` enum values so proto3 defaults do not mean a real operation or success. - Standardized node naming: `NewNodeInfo` is now `NodeInfo`, `devices` is now `nodes`, and `DeviceList` RPCs are now `NodeList`. - Changed numeric fields that cannot be negative to unsigned types. - Changed firmware `updateable` from `string` to `bool`. - Replaced raw millisecond timestamps with `google.protobuf.Timestamp`. - Removed switch request fields such as `port` and `verify_ssl` where the handler ignored them and used inventory data instead. - Removed redundant response status fields from simple inventory list RPCs. - Aligned `FromJson` RPC/message naming. - Marked the deprecated component filter field with `[deprecated = true]`. - Updated RMS handlers, client code, tests, and benchmarks for the new proto. ## Type of Change - [ ] **Add** - New feature or capability - [x] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Breaking Changes - [x] This PR contains breaking changes This is a breaking proto/API cleanup. Clients will need to update renamed messages, fields, RPCs, and enum values. They will also need to stop sending removed metadata fields and ignored switch fields. Migration notes: - Use gRPC metadata, headers, TLS identity, or interceptors instead of request/response `Metadata`. - Use `NodeInfo` and `NodeSet.nodes`. - Use `*NodeList` RPC/request/response names instead of `*DeviceList`. - Update enum references to the new normalized names. - Use protobuf `Timestamp` helpers for job time fields. - Stop sending `port` and `verify_ssl` on inventory-backed switch RPCs. ## Testing - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes This PR does not try to normalize every result/status response shape. That would require broader behavior changes in RMS, especially for batch and partial-failure flows. This PR keeps that behavior stable and focuses on the proto cleanup that can be done safely now. Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
| int32 port = 4; // API port number (default: 443) | ||
| bool verify_ssl = 5; // Whether to verify SSL certificates | ||
| reserved 1, 4, 5; | ||
| reserved "metadata", "port", "verify_ssl"; |
There was a problem hiding this comment.
Note: removal of port and verify_ssl are reasoned by the request message being used in stateful (inventory) paths in RMS. These fields are never used in implementation.
Matthias247
left a comment
There was a problem hiding this comment.
I did a quick passthrough so far. My initial thoughts:
- I'd avoid the
UNSPECIFIED_changes. - Removal of
metadataseems ok - Timestamp changes too
verify_ssl->danger_accept_invalid_certsseems like a good idea- we should figure out what to do with the aggregate status results
rest is some details about certain APIs
| POWER_GRACEFUL_SHUTDOWN = 5; // Redfish ResetType GracefulShutdown | ||
| POWER_GRACEFUL_RESTART = 6; // Redfish ResetType GracefulRestart | ||
| POWER_FORCE_RESTART = 7; // Redfish ResetType ForceRestart | ||
| POWER_OPERATION_UNSPECIFIED = 0; // Caller left the operation missing or at the proto3 default |
There was a problem hiding this comment.
I think the old way of having a reasonable default was fine.
unspecified is maybe good to distinct between "not set and set". But we can always use the optional quantifier to achieve the same
There was a problem hiding this comment.
Not trying to be pedantic: https://protobuf.dev/programming-guides/style/ but since we are doing this cleanup:
The first listed value should be a zero value enum and have the suffix of either _UNSPECIFIED or _UNKNOWN. This value may be used as an unknown/default value and should be distinct from any of the semantic values you expect to be explicitly set.
There was a problem hiding this comment.
A reason for not wanting the 0 defaults is that some are not really the desired operation, for example for enum PowerOperation, the previous 0 value was POWER_OPERATION_OFF which could be harmful.
| // If firmware_targets is non-empty, it takes precedence over the single filename/target fields. | ||
| message UpdateNodeFirmwareRequest { | ||
| Metadata metadata = 1; | ||
| message UpdateNodeFirmwareAsyncRequest { |
There was a problem hiding this comment.
@anunna0 Do we actually want to continue calling this Async? It should become the only mechanism that is used and supported
| NodeSet nodes = 2; // Switch nodes to update | ||
| string image_filename = 3; // Binary filename as it should exist on the switch | ||
| string local_file_path = 4; // Local file path on the target switch | ||
| string local_file_path = 4; // Local file path on the RMS host |
There was a problem hiding this comment.
How would the caller know where its stored on RMS? It should be a remote file path, shouldn't it?
There was a problem hiding this comment.
This is indeed not ideal. @vinodchitraliNVIDIA What do you think?
There was a problem hiding this comment.
we dont need now. As logic is moving to RMS
| reserved 1; | ||
| reserved "metadata"; | ||
| NodeSet nodes = 2; // Switch nodes to update | ||
| optional bool enabled = 3; // true = enabled, false = disabled |
There was a problem hiding this comment.
as suggested by @amit-pabalkar earlier, we can remove optional here.
| @@ -1166,8 +1219,9 @@ message ConfigureScaleUpFabricManagerResponse { | |||
| * and config save across caller-supplied switches. | |||
| */ | |||
| message SetScaleUpFabricStateRequest { | |||
There was a problem hiding this comment.
Suggestion for renaming all the API with request and response:
rpc ConfigureScaleUpFabric(ScaleUpFabricConfigurationRequest) returns (ScaleUpFabricConfigurationResponse) {}
| POWER_GRACEFUL_SHUTDOWN = 5; // Redfish ResetType GracefulShutdown | ||
| POWER_GRACEFUL_RESTART = 6; // Redfish ResetType GracefulRestart | ||
| POWER_FORCE_RESTART = 7; // Redfish ResetType ForceRestart | ||
| POWER_OPERATION_UNSPECIFIED = 0; // Caller left the operation missing or at the proto3 default |
There was a problem hiding this comment.
I think UNSPECIFIED, OFF, ON is enough.
The NICo has clippy integration where it gave Suggetion like this
enum PowerOperation {
On,
Off,
Reset
}
There was a problem hiding this comment.
There's legacy reason mostly because of naming collision in C++ so the best practices for proto enum variant naming suggest including the prefix, which is unfortunate.
| RACK_POWER_ON = 0; // Power on all nodes in the rack following power-on order | ||
| RACK_POWER_OFF = 1; // Power off all nodes in the rack | ||
| RACK_POWER_CYCLE = 2; // Power cycle all nodes in the rack | ||
| RACK_POWER_OPERATION_UNSPECIFIED = 0; // Caller left the operation missing or at the proto3 default |
| enum ReturnCode { | ||
| SUCCESS = 0; // Operation completed successfully | ||
| FAILURE = 1; // Operation failed (see message field for details) | ||
| RETURN_CODE_UNSPECIFIED = 0; // Server left the status unset or at the proto3 default |
| FW_UPDATE_MONITORING_TIMEOUT = 9; // Timed out waiting for update to complete | ||
| FW_UPDATE_TASK_EXCEPTION = 10; // Exception in task monitoring | ||
| FW_UPDATE_UNKNOWN = 11; // Unknown or unclassified error | ||
| FIRMWARE_UPDATE_ERROR_UNSPECIFIED = 0; // Server left the error code unset or at the proto3 default |
| NODE_TYPE_COMPUTE = 1; // Compute node (GPU server, CPU server) | ||
| NODE_TYPE_POWERSHELF = 2; // Power shelf (power distribution unit) | ||
| NODE_TYPE_SWITCH = 3; // Network switch | ||
| NODE_TYPE_UNSPECIFIED = 0; // Caller left the node type missing or at the proto3 default |
| } | ||
|
|
||
| // NodeResult reports the outcome of an operation on a specific node. | ||
| message NodeResult { |
There was a problem hiding this comment.
Just Result or Some other name. This is used for all Ops.
There was a problem hiding this comment.
Result may have too large of a scope. Maybe we rename this to NodeOperationResult for what it actually represents. This way we are open to other types of results in the future.
| } | ||
|
|
||
| // NodeBatchResponse reports the common result shape for multi-node RPCs. | ||
| message NodeBatchResponse { |
There was a problem hiding this comment.
Do we need node prepend ?
| * access_token only for artifact download and does not persist the token or object. | ||
| */ | ||
| rpc ApplyFirmwareObjectFromJSON(ApplyFirmwareObjectFromJsonRequest) returns (ApplyFirmwareObjectResponse) {} | ||
| rpc ApplyFirmwareObjectFromJson(ApplyFirmwareObjectFromJsonRequest) returns (ApplyFirmwareObjectFromJsonResponse) {} |
There was a problem hiding this comment.
rpc ApplyFirmwareObject(ApplyFirmwareObjectRequest) returns (ApplyFirmwareObjectResponse) {}
and
message ApplyFirmwareObjectRequest {
FormatType format;
....
}
|
|
||
| /* | ||
| * ApplySwitchSystemImageFromJSON ingests a firmware object from SOT JSON, downloads | ||
| * ApplySwitchSystemImageFromJson ingests a firmware object from SOT JSON, downloads |
There was a problem hiding this comment.
You can eliminate FromJson
RPC renames, request/response type renames, endpoint model consolidation, inventory component naming, batch result stats, firmware object apply split, switch firmware naming, ScaleUpFabric naming, and GetVersion probing. Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
| Metadata metadata = 1; | ||
| ReturnCode status = 2; // SUCCESS if jobs created, FAILURE otherwise | ||
| // BatchUpdateFirmwareByNodeTypeResponse reports the created async jobs. | ||
| message BatchUpdateFirmwareByNodeTypeResponse { |
There was a problem hiding this comment.
looks the same as NodeBatchResponse to me?
There was a problem hiding this comment.
Yeah they do look similar but are actually different. I restructured BatchUpdateFirmwareByNodeTypeResponse, will push a commit in a sec.
Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
| * config across all supplied switches. | ||
| */ | ||
| rpc SetScaleUpFabricState(SetScaleUpFabricStateRequest) returns (SetScaleUpFabricStateResponse) {} | ||
| rpc BatchSetScaleUpFabricState(BatchSetScaleUpFabricStateRequest) returns (BatchSetScaleUpFabricStateResponse) {} |
There was a problem hiding this comment.
Still prefer ConfigureScaleUpFabric here since nobody is really expected to call this on a subset of nodes.
There was a problem hiding this comment.
also since the configuration is not applied the same way to every node. Only one node will ever be configured as primary, and others as secondary.
There was a problem hiding this comment.
Looking at the implementation, this RPC is used for the stateless path, so RMS doesn't know the physical topology, and the requested switches can span across racks. I think we can keep this as-is for the stateless path. If we add a stateful/inventory-backed flow later, the new path would require enough RMS context to operate on whole node sets.
Description
This PR cleans up
rack_manager.protoand updates the RMS code to match the new API.The main goal is to make the proto easier for clients to use and harder to misuse. The changes remove fields RMS was not actually using, make naming more consistent, and use protobuf-native types where they fit better.
Key updates:
Metadatafields. Correlation and identity should come from gRPC metadata, headers, TLS, or interceptors.UNSPECIFIED = 0enum values so proto3 defaults do not mean a real operation or success.NewNodeInfois nowNodeInfo,devicesis nownodes, andDeviceListRPCs are nowNodeList.updateablefromstringtobool.google.protobuf.Timestamp.portandverify_sslwhere the handler ignored them and used inventory data instead.FromJsonRPC/message naming.[deprecated = true].Type of Change
Breaking Changes
This is a breaking proto/API cleanup. Clients will need to update renamed messages, fields, RPCs, and enum values. They will also need to stop sending removed metadata fields and ignored switch fields.
Migration notes:
Metadata.NodeInfoandNodeSet.nodes.*NodeListRPC/request/response names instead of*DeviceList.Timestamphelpers for job time fields.portandverify_sslon inventory-backed switch RPCs.Testing
Additional Notes
This PR does not try to normalize every result/status response shape. That would require broader behavior changes in RMS, especially for batch and partial-failure flows. This PR keeps that behavior stable and focuses on the proto cleanup that can be done safely now.