diff --git a/docs/index.yml b/docs/index.yml index edff95343f..270ffa0294 100644 --- a/docs/index.yml +++ b/docs/index.yml @@ -108,6 +108,8 @@ navigation: - section: Operations contents: + - page: Tenant Lifecycle Cleanup + path: operations/tenant-lifecycle-cleanup.md - page: NVLink Partitioning path: manuals/nvlink_partitioning.md - page: Release Instance API Enhancements diff --git a/docs/operations/tenant-lifecycle-cleanup.md b/docs/operations/tenant-lifecycle-cleanup.md new file mode 100644 index 0000000000..4178a54055 --- /dev/null +++ b/docs/operations/tenant-lifecycle-cleanup.md @@ -0,0 +1,322 @@ +# Tenant Lifecycle Cleanup + +Use this workflow to release an instance, track NICo cleanup progress, and +verify that the host is ready for reuse. + +When an instance is released, NICo removes the host from tenant service, returns +networking to the admin side, runs cleanup and sanitization workflows, performs +the configured trust checks, validates the host, and returns the managed host to +`Ready` when it is eligible for allocation again. + +For reference, see: + +- [Managed Host State Diagrams](../architecture/state_machines/managedhost.md) +- [Release Instance API Enhancements](../manuals/breakfix_integration.md) +- [Measured Boot Ingest Guidance](../provisioning/ingesting-hosts.md#measured-boot) +- [Core Metrics](../manuals/metrics/core_metrics.md) + +## Release an Instance + +Release the instance: + +```bash +nicocli instance delete +``` + +In TUI mode: + +```text +nicocli tui +> instance delete +``` + +Instance deletion triggers the same cleanup and sanitization workflow described +on this page. Track the REST-side instance lifecycle with: + +```bash +nicocli instance status-history +nicocli instance get +``` + +`carbide-admin-cli` can also release by instance ID or by machine ID when a +Core gRPC operation is required: + +```bash +carbide-admin-cli -c instance release --instance +carbide-admin-cli -c instance release --machine +``` + +`` is the NICo Core gRPC API endpoint used by +`carbide-admin-cli`. REST and `nicocli` commands use the REST API base URL from +the `nicocli` config. + +To report a hardware, network, performance, or other issue during release, see +[Release Instance API Enhancements](../manuals/breakfix_integration.md). + +When the release request is accepted, cleanup is asynchronous. Track the +instance lifecycle first, then inspect the managed-host state when site-level +cleanup detail is needed. + +## Cleanup Flow + +NICo drives tenant cleanup through the managed-host state machine. The normal +release-to-ready flow is: + +```text +Assigned/BootingWithDiscoveryImage +Assigned/SwitchToAdminNetwork +Assigned/WaitingForNetworkReconfig +PostAssignedMeasuring/WaitingForMeasurements (when attestation is enabled) +WaitingForCleanup/Init +WaitingForCleanup/SecureEraseBoss (Dell BOSS platforms) +WaitingForCleanup/HostCleanup +WaitingForCleanup/CreateBossVolume (Dell BOSS platforms) +BomValidating/UpdatingInventory +Ready +``` + +If attestation is disabled, NICo moves from +`Assigned/WaitingForNetworkReconfig` directly into `WaitingForCleanup/Init`. + +During the flow, NICo: + +1. Reboots the host into the discovery image used by Scout. +2. Switches DPU and DPA networking back to the admin network. +3. Waits for network configuration, extension services, and cleanup-related + health reports to converge. +4. Deletes the instance record and releases tenant network resources. +5. Runs measured boot or attestation checks when configured. +6. Runs storage, memory-overwrite, and InfiniBand cleanup from Scout. +7. Applies Redfish power control where needed to complete cleanup and pending + platform changes. +8. Validates inventory before returning the host to `Ready`. + +## Track Progress + +Use two layers of inspection: + +| Layer | Tool | Use | +|---|---|---| +| REST tenant and provider lifecycle | `nicocli` | Instance deletion, instance status, status history, and tenant-visible errors. | +| Core site cleanup lifecycle | `carbide-admin-cli` | Managed-host state, machine state history, health reports, measured boot, and cleanup-specific debugging. | + +Start with the REST-side instance status: + +```bash +nicocli instance status-history +nicocli instance get +``` + +If cleanup progress is unclear from the instance lifecycle, check the +managed-host state: + +```bash +carbide-admin-cli -c managed-host show +``` + +Check the machine view for state history and platform details: + +```bash +carbide-admin-cli -c machine show +``` + +Check health reports when cleanup appears blocked: + +```bash +carbide-admin-cli -c machine health-report show +``` + +### Happy Path Verification + +A normal release can be verified with this sequence: + +```bash +nicocli instance delete +nicocli instance status-history +carbide-admin-cli -c managed-host show +carbide-admin-cli -c machine health-report show +``` + +Success indicators: + +- The instance moves through deletion or termination from the REST perspective. +- The managed host progresses through the cleanup states and reaches `Ready`. +- Cleanup-related health reports are clear. +- No blocking health report prevents allocation. + +Useful metrics for fleet-level monitoring include: + +| Metric | Use | +|---|---| +| `carbide_machines_per_state` | Count machines in each managed-host state. | +| `carbide_machines_per_state_above_sla` | Find machines that have remained in a state longer than the state-machine SLA. | +| `carbide_machines_time_in_state_seconds` | Review time spent in each state. | +| `carbide_reboot_attempts_in_booting_with_discovery_image` | Detect hosts that require repeated discovery-image reboots. | +| `carbide_measured_boot_machines_per_machine_state_total` | Review measured boot machine state coverage. | +| `carbide_pending_host_firmware_update_count` | Count hosts that need host firmware updates. | +| `carbide_pending_dpu_nic_firmware_update_count` | Count DPUs that need NIC firmware updates. | +| `carbide_active_host_firmware_update_count` | Count hosts actively updating firmware. | +| `carbide_running_dpu_updates_count` | Count DPUs actively updating firmware. | + +## Sanitization Steps + +Scout reports cleanup through `CleanupMachineCompleted`. The cleanup report can +include these step results: + +| Field | Meaning | +|---|---| +| `nvme` | NVMe cleanup result. | +| `hdd` | HDD/SAS block-device cleanup result. | +| `ram` | RAM cleanup result, when present. | +| `mem_overwrite` | UEFI `MemoryOverwriteRequestControl` validation result. | +| `ib` | InfiniBand cleanup result. | + +Each step has a result and a message. A failed NVMe cleanup moves the host to an +`NVMECleanFailed` failure state and keeps the host out of `Ready`. + +### NVMe Secure Erase + +Scout discovers NVMe controller devices and formats each namespace with secure +erase: + +```bash +nvme format -s2 -f -n +``` + +When namespace management is supported, Scout deletes existing namespaces after +format, creates a replacement namespace sized from controller capacity, and +attaches it to the controller. + +On supported Lenovo M.2 NVMe 2-Bay RAID Kit systems, Scout uses `mnv_cli` to +remove RAID virtual disks and send NVMe passthrough cleanup commands to the +underlying disks. + +### HDD and SAS Cleanup + +Scout also reports an `hdd` cleanup result for HDD/SAS block-device cleanup. +Treat a failed `hdd` result the same way as other cleanup failures: keep the +host out of allocation until the failure is remediated and the cleanup path +completes successfully. + +### Memory Overwrite + +Scout validates the UEFI memory-overwrite control variable: + +```text +MemoryOverwriteRequestControl-e20939be-32d4-41be-a150-897f85d49829 +``` + +The `mem_overwrite` cleanup step passes when the variable is set to `1`. If site +policy requires a manual volatile-memory procedure, such as a full AC drain, +complete that procedure before returning the host to allocation. + +### Dell BOSS Cleanup + +On supported Dell platforms with a BOSS controller, NICo performs additional +storage cleanup: + +1. Disable iDRAC lockdown for the storage operation. +2. Decommission the BOSS storage controller through Redfish. +3. Wait for the Redfish job to complete. +4. Run Scout host cleanup. +5. Recreate the BOSS virtual disk as `VD_0`. +6. Re-enable host lockdown. +7. Continue to post-cleanup validation. + +If the Redfish job fails, NICo retries the job path and may power-cycle the host +as part of the recovery loop. + +### InfiniBand Cleanup + +Scout reports InfiniBand cleanup through the `ib` cleanup step. NICo also uses +cleanup-related health reports, including `IbCleanupPending`, to prevent the +state machine from advancing before InfiniBand cleanup has cleared. + +## Platform Reset and Trust Controls + +Tenant cleanup includes platform and trust controls that run through Redfish, +firmware management, measured boot, and site policy. + +| Control | How to verify | +|---|---| +| Redfish power control | The state machine uses `ForceRestart` during cleanup and after Scout cleanup completion. Redfish `ForceRestart` is also the reset type used to apply pending BIOS or UEFI changes. | +| TPM clear | NICo includes vendor-specific Redfish support for TPM clear. Verify completion through the platform-specific cleanup evidence used by the site. | +| BIOS recommit | Verify that pending BIOS or UEFI settings have been applied after the cleanup `ForceRestart` path. | +| DPU restricted mode and BMC in-band restrictions | Verify that tenant-side network configuration has been removed, admin-network configuration has synced, and platform lockdown settings are in the expected post-cleanup state. | +| Firmware default version | Verify that host and DPU firmware match the configured site default or are under an approved firmware update workflow. | +| Measured boot | Verify measured boot state when attestation is enabled. Measured boot may be configured in permissive mode; in that mode, use measurement results as cleanup evidence according to site policy. | + +Useful attestation commands include: + +```bash +carbide-admin-cli -c attestation measured-boot machine show +carbide-admin-cli -c att mb machine show +``` + +## Return-to-Pool Checklist + +A released host is ready for reuse when all required gates pass: + +- The prior instance is released and no longer active. +- Tenant VPC prefix segments and DPU loopback IP allocations are released. +- DPU and DPA networking have returned to the admin network. +- Extension services from the prior tenant have terminated. +- Scout cleanup has completed. +- NVMe and HDD/SAS cleanup have succeeded, or an approved exception exists. +- The memory-overwrite check has passed, and any required manual + volatile-memory procedure is complete. +- InfiniBand cleanup has completed and blocking cleanup health reports are + clear. +- TPM, BIOS/UEFI, lockdown, and firmware checks satisfy site policy. +- Measured boot or attestation checks satisfy site policy. +- Inventory validation has completed. +- The managed host is in `Ready`. +- No blocking health report prevents allocation. + +## Troubleshooting Stuck Cleanup + +Use the current managed-host state to choose the next check. + +Start with the REST lifecycle: + +```bash +nicocli instance status-history +nicocli instance list --status error --output table +``` + +If the REST lifecycle does not explain the stall, inspect the Core cleanup +state: + +```bash +carbide-admin-cli -c managed-host show +carbide-admin-cli -c machine health-report show +``` + +| State | What it means | Checks | +|---|---|---| +| `Assigned/BootingWithDiscoveryImage` | The host is rebooting into the discovery image. | Check BMC reachability, host power state, boot order, and repeated reboot metrics. | +| `Assigned/SwitchToAdminNetwork` | NICo is moving the host out of tenant networking. | Check DPU agent status, DPA status, and admin-network config generation. | +| `Assigned/WaitingForNetworkReconfig` | NICo is waiting for network configuration to converge. | Check DPU sync, DPA sync, extension-service termination, and cleanup-related health reports. | +| `PostAssignedMeasuring/WaitingForMeasurements` | Attestation is enabled and NICo is waiting for measurements. | Check measured boot machine state, trusted profile or bundle status, and site policy for permissive mode. | +| `WaitingForCleanup/SecureEraseBoss` | NICo is decommissioning Dell BOSS storage. | Check iDRAC lockdown state, Redfish job status, and BOSS controller reachability. | +| `WaitingForCleanup/HostCleanup` | NICo is waiting for Scout cleanup completion. | Check Scout logs, cleanup report submission, NVMe/HDD cleanup, memory-overwrite result, and InfiniBand cleanup result. | +| `WaitingForCleanup/CreateBossVolume` | NICo is recreating the Dell BOSS virtual disk. | Check Redfish job status and confirm the recreated volume is `VD_0`. | +| `BomValidating/UpdatingInventory` | Cleanup completed and NICo is validating inventory. | Check BMC reachability, inventory collection, firmware update status, and blocking health reports. | +| `Failed` with `NVMECleanFailed` | Storage cleanup failed. | Keep the host out of allocation, inspect the cleanup error message, remediate the storage issue, and rerun the approved cleanup recovery path. | + +For log review, start with the NICo API or state-controller logs, Scout cleanup +logs, DPU agent logs, hardware-health logs, and Redfish job status from the BMC. + +## Manual Procedures + +Some environments require additional manual assurance before a host is reused. +Apply these only when required by site policy: + +- Full AC drain for volatile-memory handling. +- Firmware bundle reflash. +- Manual TPM clear if automated platform cleanup is unavailable. +- Manual firmware remediation when a host or DPU does not match the configured + site default. + +Record the completed procedure, the target machine ID, the reason, the operator, +and the evidence used to approve return to allocation.