From e57bdbcccda8366b8066b1fe68086e7949359f70 Mon Sep 17 00:00:00 2001 From: Martin Raumann Date: Wed, 20 May 2026 14:02:17 -0700 Subject: [PATCH 1/4] Add tenant lifecycle cleanup operations guide --- docs/index.yml | 2 + docs/operations/tenant-lifecycle-cleanup.md | 253 ++++++++++++++++++++ 2 files changed, 255 insertions(+) create mode 100644 docs/operations/tenant-lifecycle-cleanup.md diff --git a/docs/index.yml b/docs/index.yml index edff95343f..270ffa0294 100644 --- a/docs/index.yml +++ b/docs/index.yml @@ -108,6 +108,8 @@ navigation: - section: Operations contents: + - page: Tenant Lifecycle Cleanup + path: operations/tenant-lifecycle-cleanup.md - page: NVLink Partitioning path: manuals/nvlink_partitioning.md - page: Release Instance API Enhancements diff --git a/docs/operations/tenant-lifecycle-cleanup.md b/docs/operations/tenant-lifecycle-cleanup.md new file mode 100644 index 0000000000..cbab1b5f27 --- /dev/null +++ b/docs/operations/tenant-lifecycle-cleanup.md @@ -0,0 +1,253 @@ +# Tenant Lifecycle Cleanup + +This page covers what NICo does after a tenant releases an instance, how to +track cleanup progress, and how to verify that a host is ready for reuse. + +When an instance is released, NICo removes the host from tenant service, returns +networking to the admin side, runs cleanup and sanitization workflows, performs +the configured trust checks, validates the host, and returns the managed host to +`Ready` when it is eligible for allocation again. + +For reference, see: + +- [Managed Host State Diagrams](../architecture/state_machines/managedhost.md) +- [Release Instance API Enhancements](../manuals/breakfix_integration.md) +- [Measured Boot Ingest Guidance](../provisioning/ingesting-hosts.md#measured-boot) +- [Core Metrics](../manuals/metrics/core_metrics.md) + +## Release an Instance + +Use `ReleaseInstance` to release an allocated tenant instance. The admin CLI can +release by instance ID or by machine ID: + +```bash +carbide-admin-cli -c instance release --instance +carbide-admin-cli -c instance release --machine +``` + +The REST API also exposes instance release: + +```bash +curl -X POST /api/v1/instances/release \ + -H 'Content-Type: application/json' \ + -d '{"id": ""}' +``` + +When the release request is accepted, cleanup is asynchronous. Track the managed +host state until the host reaches `Ready` or enters a failure state. + +## Cleanup Flow + +NICo drives tenant cleanup through the managed-host state machine. The normal +release-to-ready flow is: + +```text +Assigned/BootingWithDiscoveryImage +Assigned/SwitchToAdminNetwork +Assigned/WaitingForNetworkReconfig +PostAssignedMeasuring/WaitingForMeasurements (when attestation is enabled) +WaitingForCleanup/Init +WaitingForCleanup/SecureEraseBoss (Dell BOSS platforms) +WaitingForCleanup/HostCleanup +WaitingForCleanup/CreateBossVolume (Dell BOSS platforms) +BomValidating/UpdatingInventory +Ready +``` + +If attestation is disabled, NICo moves from +`Assigned/WaitingForNetworkReconfig` directly into `WaitingForCleanup/Init`. + +During the flow, NICo: + +1. Reboots the host into the discovery image used by Scout. +2. Switches DPU and DPA networking back to the admin network. +3. Waits for network configuration, extension services, and cleanup-related + health reports to converge. +4. Deletes the instance record and releases tenant network resources. +5. Runs measured boot or attestation checks when configured. +6. Runs storage, memory-overwrite, and InfiniBand cleanup from Scout. +7. Applies Redfish power control where needed to complete cleanup and pending + platform changes. +8. Validates inventory before returning the host to `Ready`. + +## Track Progress + +Start with the managed-host state: + +```bash +carbide-admin-cli -c managed-host show +``` + +Check the machine view for state history and platform details: + +```bash +carbide-admin-cli -c machine show +``` + +Check health reports when cleanup appears blocked: + +```bash +carbide-admin-cli -c machine health-report show +``` + +Useful metrics for fleet-level monitoring include: + +| Metric | Use | +|---|---| +| `carbide_machines_per_state` | Count machines in each managed-host state. | +| `carbide_machines_per_state_above_sla` | Find machines that have remained in a state longer than the state-machine SLA. | +| `carbide_machines_time_in_state_seconds` | Review time spent in each state. | +| `carbide_reboot_attempts_in_booting_with_discovery_image` | Detect hosts that require repeated discovery-image reboots. | +| `carbide_measured_boot_machines_per_machine_state_total` | Review measured boot machine state coverage. | +| `carbide_pending_host_firmware_update_count` | Count hosts that need host firmware updates. | +| `carbide_pending_dpu_nic_firmware_update_count` | Count DPUs that need NIC firmware updates. | +| `carbide_active_host_firmware_update_count` | Count hosts actively updating firmware. | +| `carbide_running_dpu_updates_count` | Count DPUs actively updating firmware. | + +## Sanitization Steps + +Scout reports cleanup through `CleanupMachineCompleted`. The cleanup report can +include these step results: + +| Field | Meaning | +|---|---| +| `nvme` | NVMe cleanup result. | +| `hdd` | HDD/SAS block-device cleanup result. | +| `ram` | RAM cleanup result, when present. | +| `mem_overwrite` | UEFI `MemoryOverwriteRequestControl` validation result. | +| `ib` | InfiniBand cleanup result. | + +Each step has a result and a message. A failed NVMe cleanup moves the host to an +`NVMECleanFailed` failure state and keeps the host out of `Ready`. + +### NVMe Secure Erase + +Scout discovers NVMe controller devices and formats each namespace with secure +erase: + +```bash +nvme format -s2 -f -n +``` + +When namespace management is supported, Scout deletes existing namespaces after +format, creates a replacement namespace sized from controller capacity, and +attaches it to the controller. + +On supported Lenovo M.2 NVMe 2-Bay RAID Kit systems, Scout uses `mnv_cli` to +remove RAID virtual disks and send NVMe passthrough cleanup commands to the +underlying disks. + +### HDD and SAS Cleanup + +Scout also reports an `hdd` cleanup result for HDD/SAS block-device cleanup. +Treat a failed `hdd` result the same way as other cleanup failures: keep the +host out of allocation until the failure is remediated and the cleanup path +completes successfully. + +### Memory Overwrite + +Scout validates the UEFI memory-overwrite control variable: + +```text +MemoryOverwriteRequestControl-e20939be-32d4-41be-a150-897f85d49829 +``` + +The `mem_overwrite` cleanup step passes when the variable is set to `1`. If site +policy requires a manual volatile-memory procedure, such as a full AC drain, +complete that procedure before returning the host to allocation. + +### Dell BOSS Cleanup + +On supported Dell platforms with a BOSS controller, NICo performs additional +storage cleanup: + +1. Disable iDRAC lockdown for the storage operation. +2. Decommission the BOSS storage controller through Redfish. +3. Wait for the Redfish job to complete. +4. Run Scout host cleanup. +5. Recreate the BOSS virtual disk as `VD_0`. +6. Re-enable host lockdown. +7. Continue to post-cleanup validation. + +If the Redfish job fails, NICo retries the job path and may power-cycle the host +as part of the recovery loop. + +### InfiniBand Cleanup + +Scout reports InfiniBand cleanup through the `ib` cleanup step. NICo also uses +cleanup-related health reports, including `IbCleanupPending`, to prevent the +state machine from advancing before InfiniBand cleanup has cleared. + +## Platform Reset and Trust Controls + +Tenant cleanup includes platform and trust controls that run through Redfish, +firmware management, measured boot, and site policy. + +| Control | How to verify | +|---|---| +| Redfish power control | The state machine uses `ForceRestart` during cleanup and after Scout cleanup completion. Redfish `ForceRestart` is also the reset type used to apply pending BIOS or UEFI changes. | +| TPM clear | NICo includes vendor-specific Redfish support for TPM clear. Verify completion through the platform-specific cleanup evidence used by the site. | +| BIOS recommit | Verify that pending BIOS or UEFI settings have been applied after the cleanup `ForceRestart` path. | +| DPU restricted mode and BMC in-band restrictions | Verify that tenant-side network configuration has been removed, admin-network configuration has synced, and platform lockdown settings are in the expected post-cleanup state. | +| Firmware default version | Verify that host and DPU firmware match the configured site default or are under an approved firmware update workflow. | +| Measured boot | Verify measured boot state when attestation is enabled. Measured boot may be configured in permissive mode; in that mode, use measurement results as cleanup evidence according to site policy. | + +Useful attestation commands include: + +```bash +carbide-admin-cli -c attestation measured-boot machine show +carbide-admin-cli -c att mb machine show +``` + +## Return-to-Pool Checklist + +A released host is ready for reuse when all required gates pass: + +- The prior instance is released and no longer active. +- Tenant VPC prefix segments and DPU loopback IP allocations are released. +- DPU and DPA networking have returned to the admin network. +- Extension services from the prior tenant have terminated. +- Scout cleanup has completed. +- NVMe and HDD/SAS cleanup have succeeded, or an approved exception exists. +- The memory-overwrite check has passed, and any required manual + volatile-memory procedure is complete. +- InfiniBand cleanup has completed and blocking cleanup health reports are + clear. +- TPM, BIOS/UEFI, lockdown, and firmware checks satisfy site policy. +- Measured boot or attestation checks satisfy site policy. +- Inventory validation has completed. +- The managed host is in `Ready`. +- No blocking health report prevents allocation. + +## Troubleshooting Stuck Cleanup + +Use the current managed-host state to choose the next check. + +| State | What it means | Checks | +|---|---|---| +| `Assigned/BootingWithDiscoveryImage` | The host is rebooting into the discovery image. | Check BMC reachability, host power state, boot order, and repeated reboot metrics. | +| `Assigned/SwitchToAdminNetwork` | NICo is moving the host out of tenant networking. | Check DPU agent status, DPA status, and admin-network config generation. | +| `Assigned/WaitingForNetworkReconfig` | NICo is waiting for network configuration to converge. | Check DPU sync, DPA sync, extension-service termination, and cleanup-related health reports. | +| `PostAssignedMeasuring/WaitingForMeasurements` | Attestation is enabled and NICo is waiting for measurements. | Check measured boot machine state, trusted profile or bundle status, and site policy for permissive mode. | +| `WaitingForCleanup/SecureEraseBoss` | NICo is decommissioning Dell BOSS storage. | Check iDRAC lockdown state, Redfish job status, and BOSS controller reachability. | +| `WaitingForCleanup/HostCleanup` | NICo is waiting for Scout cleanup completion. | Check Scout logs, cleanup report submission, NVMe/HDD cleanup, memory-overwrite result, and InfiniBand cleanup result. | +| `WaitingForCleanup/CreateBossVolume` | NICo is recreating the Dell BOSS virtual disk. | Check Redfish job status and confirm the recreated volume is `VD_0`. | +| `BomValidating/UpdatingInventory` | Cleanup completed and NICo is validating inventory. | Check BMC reachability, inventory collection, firmware update status, and blocking health reports. | +| `Failed` with `NVMECleanFailed` | Storage cleanup failed. | Keep the host out of allocation, inspect the cleanup error message, remediate the storage issue, and rerun the approved cleanup recovery path. | + +For log review, start with the NICo API or state-controller logs, Scout cleanup +logs, DPU agent logs, hardware-health logs, and Redfish job status from the BMC. + +## Manual Procedures + +Some environments require additional manual assurance before a host is reused. +Apply these only when required by site policy: + +- Full AC drain for volatile-memory handling. +- Firmware bundle reflash. +- Manual TPM clear if automated platform cleanup is unavailable. +- Manual firmware remediation when a host or DPU does not match the configured + site default. + +Record the completed procedure, the target machine ID, the reason, the operator, +and the evidence used to approve return to allocation. From 47bd6c070fa349c1a82a102fd97e413bbb811deb Mon Sep 17 00:00:00 2001 From: Martin Raumann Date: Wed, 20 May 2026 14:58:23 -0700 Subject: [PATCH 2/4] Clarify tenant cleanup API URL placeholders --- docs/operations/tenant-lifecycle-cleanup.md | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/docs/operations/tenant-lifecycle-cleanup.md b/docs/operations/tenant-lifecycle-cleanup.md index cbab1b5f27..78d6fb3c1a 100644 --- a/docs/operations/tenant-lifecycle-cleanup.md +++ b/docs/operations/tenant-lifecycle-cleanup.md @@ -21,11 +21,16 @@ Use `ReleaseInstance` to release an allocated tenant instance. The admin CLI can release by instance ID or by machine ID: ```bash -carbide-admin-cli -c instance release --instance -carbide-admin-cli -c instance release --machine +carbide-admin-cli -c instance release --instance +carbide-admin-cli -c instance release --machine ``` -The REST API also exposes instance release: +`` is the NICo Core gRPC API endpoint used by +`carbide-admin-cli`. REST and `nicocli` commands use the REST API base URL from +the `nicocli` config. + +The REST API also exposes instance release. When calling it directly, use the +REST API base URL for the deployment: ```bash curl -X POST /api/v1/instances/release \ @@ -75,19 +80,19 @@ During the flow, NICo: Start with the managed-host state: ```bash -carbide-admin-cli -c managed-host show +carbide-admin-cli -c managed-host show ``` Check the machine view for state history and platform details: ```bash -carbide-admin-cli -c machine show +carbide-admin-cli -c machine show ``` Check health reports when cleanup appears blocked: ```bash -carbide-admin-cli -c machine health-report show +carbide-admin-cli -c machine health-report show ``` Useful metrics for fleet-level monitoring include: @@ -195,8 +200,8 @@ firmware management, measured boot, and site policy. Useful attestation commands include: ```bash -carbide-admin-cli -c attestation measured-boot machine show -carbide-admin-cli -c att mb machine show +carbide-admin-cli -c attestation measured-boot machine show +carbide-admin-cli -c att mb machine show ``` ## Return-to-Pool Checklist From f4e126f28e9b81e1c3ecdde1fd56e69ad180455c Mon Sep 17 00:00:00 2001 From: Martin Raumann Date: Wed, 20 May 2026 15:15:33 -0700 Subject: [PATCH 3/4] Prefer nicocli for tenant cleanup workflow --- docs/operations/tenant-lifecycle-cleanup.md | 90 ++++++++++++++++++--- 1 file changed, 77 insertions(+), 13 deletions(-) diff --git a/docs/operations/tenant-lifecycle-cleanup.md b/docs/operations/tenant-lifecycle-cleanup.md index 78d6fb3c1a..b2b22fe8a7 100644 --- a/docs/operations/tenant-lifecycle-cleanup.md +++ b/docs/operations/tenant-lifecycle-cleanup.md @@ -17,8 +17,29 @@ For reference, see: ## Release an Instance -Use `ReleaseInstance` to release an allocated tenant instance. The admin CLI can -release by instance ID or by machine ID: +Use `nicocli` for the tenant-facing release path: + +```bash +nicocli instance delete +``` + +In TUI mode: + +```text +nicocli tui +> instance delete +``` + +Instance deletion triggers the same cleanup and sanitization workflow described +on this page. Track the REST-side instance lifecycle with: + +```bash +nicocli instance status-history +nicocli instance get +``` + +`carbide-admin-cli` can also release by instance ID or by machine ID when a +Core gRPC operation is required: ```bash carbide-admin-cli -c instance release --instance @@ -29,17 +50,12 @@ carbide-admin-cli -c instance release --machine `carbide-admin-cli`. REST and `nicocli` commands use the REST API base URL from the `nicocli` config. -The REST API also exposes instance release. When calling it directly, use the -REST API base URL for the deployment: +To report a hardware, network, performance, or other issue during release, see +[Release Instance API Enhancements](../manuals/breakfix_integration.md). -```bash -curl -X POST /api/v1/instances/release \ - -H 'Content-Type: application/json' \ - -d '{"id": ""}' -``` - -When the release request is accepted, cleanup is asynchronous. Track the managed -host state until the host reaches `Ready` or enters a failure state. +When the release request is accepted, cleanup is asynchronous. Track the +instance lifecycle first, then inspect the managed-host state when site-level +cleanup detail is needed. ## Cleanup Flow @@ -77,7 +93,22 @@ During the flow, NICo: ## Track Progress -Start with the managed-host state: +Use two layers of inspection: + +| Layer | Tool | Use | +|---|---|---| +| REST tenant and provider lifecycle | `nicocli` | Instance deletion, instance status, status history, and tenant-visible errors. | +| Core site cleanup lifecycle | `carbide-admin-cli` | Managed-host state, machine state history, health reports, measured boot, and cleanup-specific debugging. | + +Start with the REST-side instance status: + +```bash +nicocli instance status-history +nicocli instance get +``` + +If cleanup progress is unclear from the instance lifecycle, check the +managed-host state: ```bash carbide-admin-cli -c managed-host show @@ -95,6 +126,24 @@ Check health reports when cleanup appears blocked: carbide-admin-cli -c machine health-report show ``` +### Happy Path Verification + +A normal release can be verified with this sequence: + +```bash +nicocli instance delete +nicocli instance status-history +carbide-admin-cli -c managed-host show +carbide-admin-cli -c machine health-report show +``` + +Success indicators: + +- The instance moves through deletion or termination from the REST perspective. +- The managed host progresses through the cleanup states and reaches `Ready`. +- Cleanup-related health reports are clear. +- No blocking health report prevents allocation. + Useful metrics for fleet-level monitoring include: | Metric | Use | @@ -228,6 +277,21 @@ A released host is ready for reuse when all required gates pass: Use the current managed-host state to choose the next check. +Start with the REST lifecycle: + +```bash +nicocli instance status-history +nicocli instance list --status error --output table +``` + +If the REST lifecycle does not explain the stall, inspect the Core cleanup +state: + +```bash +carbide-admin-cli -c managed-host show +carbide-admin-cli -c machine health-report show +``` + | State | What it means | Checks | |---|---|---| | `Assigned/BootingWithDiscoveryImage` | The host is rebooting into the discovery image. | Check BMC reachability, host power state, boot order, and repeated reboot metrics. | From 15e4c9a7a7d0c4cc180a8e9e990063dd9237a5f9 Mon Sep 17 00:00:00 2001 From: Martin Raumann Date: Thu, 21 May 2026 15:34:43 -0700 Subject: [PATCH 4/4] docs: tighten tenant cleanup introduction --- docs/operations/tenant-lifecycle-cleanup.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/operations/tenant-lifecycle-cleanup.md b/docs/operations/tenant-lifecycle-cleanup.md index b2b22fe8a7..4178a54055 100644 --- a/docs/operations/tenant-lifecycle-cleanup.md +++ b/docs/operations/tenant-lifecycle-cleanup.md @@ -1,7 +1,7 @@ # Tenant Lifecycle Cleanup -This page covers what NICo does after a tenant releases an instance, how to -track cleanup progress, and how to verify that a host is ready for reuse. +Use this workflow to release an instance, track NICo cleanup progress, and +verify that the host is ready for reuse. When an instance is released, NICo removes the host from tenant service, returns networking to the admin side, runs cleanup and sanitization workflows, performs @@ -17,7 +17,7 @@ For reference, see: ## Release an Instance -Use `nicocli` for the tenant-facing release path: +Release the instance: ```bash nicocli instance delete