diff --git a/README.md b/README.md index 2b3538d..f9d08a2 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ Jupyter notebook open with an 8 GB tensor on the GPU and went to lunch — `nvidia-smi` will show 1% utilization, but the card is *unusable* by anyone else. This tool measures that. -> **Status:** main is being reset around the bare-metal 1.0 scope. +> **Status:** main tracks the bare-metal 1.0 scope. > `gua doctor` checks only the current machine. `daemon` records NVML > telemetry from the current NVIDIA host, `report` reads the resulting > SQLite database, and `demo` runs anywhere with fake telemetry. The Go diff --git a/projects/auto-runtime-audit/plan.ko.md b/projects/auto-runtime-audit/plan.ko.md deleted file mode 100644 index 0f159ec..0000000 --- a/projects/auto-runtime-audit/plan.ko.md +++ /dev/null @@ -1,357 +0,0 @@ -# Auto Runtime Audit 개발 계획 - -상태: 보류 -범위: auto-runtime architecture 제안을 구현하기 위한 개발 계획 - -> 2026-05-14 scope reset: 1.0 제품은 auto-runtime/cluster-wide audit 대신 -> **설치된 현재 베어메탈 머신**만 진단하고 수집하는 방향으로 정리한다. 1.0 -> 기준 문서는 `projects/bare-metal-1.0/plan.ko.md`를 따른다. 이 문서는 -> Kubernetes, Slurm, Docker/Podman, scheduler allocation-aware report를 다시 -> 확장할 때 참고할 보류 문서로 남긴다. - -## 목표 - -`gpu-usage-audit`를 실제 GPU telemetry와 scheduler allocation context를 -결합하는 retrospective audit 도구로 만든다. - -제품은 다음 질문에 답해야 한다. - -- 누가 GPU capacity를 할당받았는가? -- 할당받은 GPU를 실제로 사용했는가? -- scheduler allocation 없이 GPU를 사용한 주체는 누구인가? -- 어떤 GPU가 memory-held 상태였지만 compute-idle이었는가? - -구현은 top-down으로 진행한다. 먼저 사용자에게 보일 report, runtime plan, -data model, fake end-to-end flow를 정의한다. 그 다음 실제 host, Kubernetes, -Slurm adapter를 붙인다. - -## 기대 아키텍처 - -기대하는 module 경계: - -```text -gpu_usage_audit/ - cli/ # gua doctor/start/status/report/stop - doctor/ # environment check와 RuntimePlan 생성 - runtime/ # collector가 어디에서 실행되는가 - telemetry/ # 실제 GPU fact, 보통 NVML - scheduler/ # allocation과 ownership context - attribution/ # PID -> pod/job/user 매핑 - storage/ # SQLite schema, migration, export, rollup - report/ # classification, aggregation, rendering - packaging/ # systemd unit, k8s manifest, OCI image -``` - -핵심 분리: - -```text -Runtime placement: collector process가 어디에서 실행되는가? -Telemetry source: 실제 GPU 상태를 어떻게 관측하는가? -Scheduler context: 누가 GPU capacity를 할당받았는가? -Attribution: 관측된 PID를 owner로 어떻게 되돌려 매핑하는가? -Report model: telemetry와 allocation을 어떻게 결합하는가? -``` - -Kubernetes와 Slurm은 scheduler context provider다. telemetry source가 아니다. -기본 telemetry source는 계속 NVML이다. - -## 지원 영역 - -| 영역 | Runtime | Telemetry | Scheduler | 기대 기능 | -|---|---|---|---|---| -| Bare metal | host systemd 또는 foreground | NVML | none | active / idle-held / truly-idle | -| Bare metal + Slurm | host systemd | NVML | Slurm | job, user, account audit | -| Kubernetes / GPU Operator | DaemonSet | pod 내부 NVML | Kubernetes | pod와 namespace audit | -| Local Docker/Podman | local container | container 내부 NVML | none | host 직접 실행이 불가능할 때 fallback | -| Demo/test | foreground | fake | fake 또는 none | GPU 접근 없이 제품 의미 검증 | - -## Delivery 원칙 - -- 모든 PR은 독립적으로 merge 가능해야 하며, merge 후 프로젝트는 동작 가능한 - 상태여야 한다. -- 새 `gua` command surface를 도입하는 동안 기존 command는 compatibility - alias로 유지할 수 있다. -- detection은 read-only여야 한다. package를 설치하거나 system/cluster 상태를 - 변경하면 안 된다. -- `start`는 system 또는 cluster 상태를 변경하기 전에 concrete plan을 보여줘야 - 한다. -- runtime placement와 scheduler context는 독립적으로 감지해야 한다. -- fake telemetry와 fake scheduler flow로 실제 cluster integration 전에 report - semantics를 검증해야 한다. - -## PR 계획 - -### PR 1: Proposal And Roadmap - -현재 PR. - -Deliver: - -- Auto-runtime architecture proposal. -- 한국어 번역본. -- 이 PR 단위 개발 계획. - -Working state: - -- 문서 변경만 포함한다. -- runtime behavior 변경은 없다. - -Merge 전 정리: - -- runtime placement와 scheduler context가 독립적이라는 점을 명확히 한다. -- Kubernetes owner identity는 안정적인 UID를 기준으로 두고, namespace/name은 - display field로 둔다. -- GPU request 없이 `NVIDIA_VISIBLE_DEVICES=all`이 있는 경우 anomaly로 다루되, - 이 collector, DCGM, NVIDIA device/plugin component 같은 GPU management - agent는 명시적으로 예외 처리한다. -- 의도하지 않은 Markdown trailing whitespace를 제거한다. 단, hard line break가 - 의도된 경우는 예외다. - -### PR 2: Command Surface Skeleton - -Deliver: - -- `gua` console entry point. -- Top-level commands: - -```sh -gua doctor -gua start --dry-run -gua status -gua report -gua stop -gua uninstall -``` - -- 기존 `gpu-usage-audit daemon/report/demo` compatibility path. -- unsupported 또는 아직 설치되지 않은 mode에 대한 명확한 placeholder behavior. -- CLI smoke test. - -Working state: - -- 사용자는 새 command surface를 실행해볼 수 있다. -- 기존 문서화된 command는 계속 동작한다. -- `start/status/stop`은 아무것도 조용히 변경하지 않는다. - -### PR 3: RuntimePlan And Doctor V1 - -Deliver: - -- `RuntimePlan` model. -- `gua doctor` human-readable output. -- `gua doctor --json`. -- `gua start --dry-run`에서 recommended plan 출력. -- 다음 항목에 대한 read-only check: - - OS/kernel/Python. - - `/dev/nvidia*`. - - NVML load/init/device count. - - `kubectl` 존재와 auth. - - Kubernetes runtime signal. - - Slurm command/config signal. - - Docker/Podman NVIDIA fallback signal. - -Working state: - -- 사용자는 아무것도 설치하지 않고 현재 machine에 어떤 runtime path가 추천되는지 - 이해할 수 있다. - -### PR 4: Data Model V2 And Migration - -Deliver: - -- Schema versioning과 migration. -- `node`. -- 확장된 `gpu_sample`. -- `gpu_process_sample`. -- `allocation_sample`. -- `owner_sample`. -- legacy DB read compatibility. - -Working state: - -- 기존 host daemon/report behavior가 새 schema에서도 계속 동작한다. -- scheduler allocation이 없어도 report는 기존 active / idle-held / truly-idle - view를 출력한다. - -### PR 5: Combined Classification And Fake Scheduler - -Deliver: - -- Allocation-aware classification: - -```text -allocated-active -allocated-idle-held -allocated-unused -unallocated-active -unallocated-idle-held -truly-idle -unknown-active -unknown-idle-held -unknown-unused -``` - -- Fake scheduler adapter. -- allocated, unallocated, unknown allocation state를 모두 포함하는 demo data. -- combined class report section. -- classification과 report aggregation test. - -Working state: - -- 실제 GPU, Kubernetes, Slurm 없이도 최종 제품 의미를 검증할 수 있다. - -### PR 6: Install State And Local Host Runtime - -Deliver: - -- Local install state file. -- Default DB path. -- Host foreground runtime adapter. -- `gua start --mode host --foreground`. -- `gua status`. -- `--db`가 생략되면 state를 사용하는 `gua report --since ...`. -- 가능한 foreground/state-aware flow에서 `gua stop`. - -Working state: - -- Single-host 사용자는 매 command마다 직접 `--db`를 넘기지 않고 새 `gua` - workflow를 사용할 수 있다. - -### PR 7: Systemd Host Runtime - -Deliver: - -- systemd unit template. -- `gua start --mode host`. -- `gua stop`. -- `gua uninstall`. -- `gua uninstall --delete-data`. -- `--dry-run`과 `--yes`. -- root/permission diagnostic. -- 기본 data 보존. - -Working state: - -- bare-metal host collection을 새 UX로 설치, 중지, 제거할 수 있다. - -### PR 8: Kubernetes Manifest Dry Run - -Deliver: - -- 내장 Kubernetes manifest template. -- Namespace, ServiceAccount, RBAC, ConfigMap, DaemonSet rendering. -- GPU-capable node targeting logic. -- `hostPID: true` 기본값. -- `--no-host-pid` opt-out. -- plan output의 security와 RBAC 설명. - -Working state: - -- 사용자는 Kubernetes cluster에 무엇이 설치될지 apply 없이 정확히 검토할 수 있다. - -### PR 9: Kubernetes Runtime Adapter - -Deliver: - -- 공식 OCI image path. -- `gua start --mode k8s`. -- `gua status --mode k8s`. -- `gua stop --mode k8s`. -- `kubectl apply/delete` integration. -- Collector pod discovery. -- Node별 hostPath SQLite DB. -- Node-level last-sample status. - -Working state: - -- Kubernetes GPU node에서 DaemonSet으로 collector를 실행할 수 있다. -- Scheduler attribution은 아직 limited일 수 있다. - -### PR 10: Kubernetes Report Export - -Deliver: - -- `gua report --since ... --node NODE`. -- `gua report --since ... --all-nodes`. -- Collector pod fan-out. -- Windowed export. -- JSONL export format. -- Parallel collection. -- `pods/exec` RBAC diagnostic. - -Working state: - -- 사용자는 per-node collector database에서 cluster-level report를 만들 수 있다. - -### PR 11: Kubernetes Scheduler Attribution - -Deliver: - -- Kubernetes API owner snapshot. -- Pod UID 기반 owner identity. -- PodResources API integration. -- Pod resource request/limit parsing. -- `/proc//cgroup` PID-to-pod mapping. -- cgroup v1/v2 parser coverage. -- `NVIDIA_VISIBLE_DEVICES=all` anomaly detection. -- GPU management pod exception. - -Working state: - -- Kubernetes report에서 pod/namespace별 allocated-active, allocated-unused, - unallocated-active, unallocated-idle-held를 볼 수 있다. - -### PR 12: Slurm Doctor And Scheduler Adapter - -Deliver: - -- Doctor의 Slurm detection. -- `scontrol`, `squeue`, optional `sacct` integration. -- Node-level running job allocation snapshot. -- job/user/account owner model. -- requested GPU count. -- cgroup PID-to-job mapping. -- best-effort exact GPU-to-job mapping. - -Working state: - -- Slurm compute node에서 job, user, account별 GPU usage report가 동작한다. - -### PR 13: Rollup And Retention - -Deliver: - -- Raw sample retention policy. -- 1-minute rollup table. -- Combined class rollup. -- Cleanup command. -- raw와 rollup window를 함께 읽는 report. - -Working state: - -- 장기 실행 collector가 core audit class를 잃지 않으면서 DB size를 통제한다. - -### PR 14: Packaging And Release Polish - -Deliver: - -- host, Kubernetes, Slurm, demo path를 위한 README quickstart. -- Troubleshooting matrix. -- Wheel release verification. -- OCI image release workflow. -- Manifest path가 안정화되었다면 optional Helm chart. - -Working state: - -- 새 사용자가 문서만 보고 install, start, inspect, report, uninstall을 진행할 수 - 있다. - -## 권장 Merge 순서 - -핵심 foundation은 PR 2부터 PR 5까지다. - -```text -CLI surface -> RuntimePlan/doctor -> schema V2 -> combined report semantics -``` - -그 다음 host, Kubernetes, Slurm은 안정된 contract 위에 붙는 adapter 작업이 된다. diff --git a/projects/auto-runtime-audit/plan.md b/projects/auto-runtime-audit/plan.md deleted file mode 100644 index 4683402..0000000 --- a/projects/auto-runtime-audit/plan.md +++ /dev/null @@ -1,359 +0,0 @@ -# Auto Runtime Audit Development Plan - -Status: on hold -Scope: development plan for the auto-runtime architecture proposal - -> 2026-05-14 scope reset: the 1.0 product is focused on diagnosing and -> collecting from **the currently installed bare-metal machine**, not -> auto-runtime or cluster-wide audit. The 1.0 plan of record is -> `projects/bare-metal-1.0/plan.ko.md`. This document remains as a deferred -> reference for a future expansion back into Kubernetes, Slurm, Docker/Podman, -> and scheduler allocation-aware reporting. - -## Goal - -Build `gpu-usage-audit` as a retrospective audit tool that joins actual GPU -telemetry with scheduler allocation context. - -The product should answer: - -- Who was allocated GPU capacity? -- Did they actually use it? -- Who used GPUs without scheduler allocation? -- Which GPUs were memory-held but compute-idle? - -The implementation should be top-down. First define the user-facing report, -runtime plan, data model, and fake end-to-end flow. Then attach real host, -Kubernetes, and Slurm adapters. - -## Architecture Shape - -Expected module boundaries: - -```text -gpu_usage_audit/ - cli/ # gua doctor/start/status/report/stop - doctor/ # environment checks and RuntimePlan creation - runtime/ # where the collector runs - telemetry/ # actual GPU facts, usually NVML - scheduler/ # allocation and ownership context - attribution/ # PID -> pod/job/user mapping - storage/ # SQLite schema, migration, export, rollup - report/ # classification, aggregation, rendering - packaging/ # systemd units, k8s manifests, OCI image -``` - -The core separation: - -```text -Runtime placement: where does the collector process run? -Telemetry source: how do we observe actual GPU state? -Scheduler context: who was allocated GPU capacity? -Attribution: how do observed PIDs map back to owners? -Report model: how do telemetry and allocation combine? -``` - -Kubernetes and Slurm are scheduler context providers. They are not telemetry -sources. The default telemetry source remains NVML. - -## Supported Areas - -| Area | Runtime | Telemetry | Scheduler | Expected capability | -|---|---|---|---|---| -| Bare metal | host systemd or foreground | NVML | none | active / idle-held / truly-idle | -| Bare metal + Slurm | host systemd | NVML | Slurm | job, user, account audit | -| Kubernetes / GPU Operator | DaemonSet | NVML inside pod | Kubernetes | pod and namespace audit | -| Local Docker/Podman | local container | NVML inside container | none | fallback when host execution is unavailable | -| Demo/test | foreground | fake | fake or none | product semantics without GPU access | - -## Delivery Principles - -- Every PR must merge independently and leave the project in a working state. -- Existing commands may remain as compatibility aliases while the new `gua` - command surface is introduced. -- Detection must be read-only. It must not install packages or mutate system or - cluster state. -- `start` must show a concrete plan before changing system or cluster state. -- Runtime placement and scheduler context must be detected independently. -- Fake telemetry and fake scheduler flows should prove the report semantics - before real cluster integrations are added. - -## PR Plan - -### PR 1: Proposal And Roadmap - -Current PR. - -Deliver: - -- Auto-runtime architecture proposal. -- Korean translation. -- This PR-based development plan. - -Working state: - -- Documentation-only change. -- No runtime behavior changes. - -Before merge: - -- Clarify that runtime placement and scheduler context are independent. -- Use Kubernetes UID as the stable owner identity, with namespace/name as - display fields. -- Treat `NVIDIA_VISIBLE_DEVICES=all` without GPU request as an anomaly, with - explicit exceptions for GPU management agents such as this collector, DCGM, - and NVIDIA device/plugin components. -- Remove unintended Markdown trailing whitespace unless a hard line break is - deliberately required. - -### PR 2: Command Surface Skeleton - -Deliver: - -- `gua` console entry point. -- Top-level commands: - -```sh -gua doctor -gua start --dry-run -gua status -gua report -gua stop -gua uninstall -``` - -- Existing `gpu-usage-audit daemon/report/demo` compatibility path. -- Clear placeholder behavior for unsupported or not-yet-installed modes. -- CLI smoke tests. - -Working state: - -- Users can run the new command surface. -- Existing documented commands still work. -- `start/status/stop` do not silently mutate anything. - -### PR 3: RuntimePlan And Doctor V1 - -Deliver: - -- `RuntimePlan` model. -- `gua doctor` human-readable output. -- `gua doctor --json`. -- `gua start --dry-run` rendering the recommended plan. -- Read-only checks for: - - OS/kernel/Python. - - `/dev/nvidia*`. - - NVML load/init/device count. - - `kubectl` presence and auth. - - Kubernetes runtime signals. - - Slurm command/config signals. - - Docker/Podman NVIDIA fallback signals. - -Working state: - -- Users can understand which runtime path is recommended on the current - machine without installing anything. - -### PR 4: Data Model V2 And Migration - -Deliver: - -- Schema versioning and migration. -- `node`. -- expanded `gpu_sample`. -- `gpu_process_sample`. -- `allocation_sample`. -- `owner_sample`. -- Legacy DB read compatibility. - -Working state: - -- Existing host daemon/report behavior continues on the new schema. -- Scheduler allocation may be absent, but reports still produce the legacy - active / idle-held / truly-idle view. - -### PR 5: Combined Classification And Fake Scheduler - -Deliver: - -- Allocation-aware classification: - -```text -allocated-active -allocated-idle-held -allocated-unused -unallocated-active -unallocated-idle-held -truly-idle -unknown-active -unknown-idle-held -unknown-unused -``` - -- Fake scheduler adapter. -- Demo data covering allocated, unallocated, and unknown allocation states. -- Report section for combined classes. -- Tests for classification and report aggregation. - -Working state: - -- The final product meaning is testable without real GPUs, Kubernetes, or Slurm. - -### PR 6: Install State And Local Host Runtime - -Deliver: - -- Local install state file. -- Default DB path. -- Host foreground runtime adapter. -- `gua start --mode host --foreground`. -- `gua status`. -- `gua report --since ...` using state when `--db` is omitted. -- `gua stop` for foreground/state-aware flows where applicable. - -Working state: - -- Single-host users can use the new `gua` workflow without manually passing - `--db` through every command. - -### PR 7: Systemd Host Runtime - -Deliver: - -- systemd unit template. -- `gua start --mode host`. -- `gua stop`. -- `gua uninstall`. -- `gua uninstall --delete-data`. -- `--dry-run` and `--yes`. -- root/permission diagnostics. -- Data preservation by default. - -Working state: - -- Bare-metal host collection can be installed, stopped, and removed through the - new UX. - -### PR 8: Kubernetes Manifest Dry Run - -Deliver: - -- Embedded Kubernetes manifest templates. -- Namespace, ServiceAccount, RBAC, ConfigMap, and DaemonSet rendering. -- GPU-capable node targeting logic. -- `hostPID: true` default. -- `--no-host-pid` opt-out. -- Security and RBAC explanation in the plan output. - -Working state: - -- Users can inspect exactly what would be installed in a Kubernetes cluster - without applying it. - -### PR 9: Kubernetes Runtime Adapter - -Deliver: - -- Official OCI image path. -- `gua start --mode k8s`. -- `gua status --mode k8s`. -- `gua stop --mode k8s`. -- `kubectl apply/delete` integration. -- Collector pod discovery. -- Per-node hostPath SQLite DB. -- Node-level last-sample status. - -Working state: - -- Kubernetes GPU nodes can run collectors through a DaemonSet. -- Scheduler attribution may still be limited. - -### PR 10: Kubernetes Report Export - -Deliver: - -- `gua report --since ... --node NODE`. -- `gua report --since ... --all-nodes`. -- Collector pod fan-out. -- Windowed export. -- JSONL export format. -- Parallel collection. -- `pods/exec` RBAC diagnostics. - -Working state: - -- Users can generate a cluster-level report from per-node collector databases. - -### PR 11: Kubernetes Scheduler Attribution - -Deliver: - -- Kubernetes API owner snapshot. -- Pod UID based owner identity. -- PodResources API integration. -- Pod resource request/limit parsing. -- `/proc//cgroup` PID-to-pod mapping. -- cgroup v1/v2 parser coverage. -- `NVIDIA_VISIBLE_DEVICES=all` anomaly detection. -- GPU management pod exceptions. - -Working state: - -- Kubernetes reports can show allocated-active, allocated-unused, - unallocated-active, and unallocated-idle-held by pod/namespace. - -### PR 12: Slurm Doctor And Scheduler Adapter - -Deliver: - -- Slurm detection in doctor. -- `scontrol`, `squeue`, and optional `sacct` integration. -- Node-level running job allocation snapshot. -- job/user/account owner model. -- Requested GPU count. -- cgroup PID-to-job mapping. -- Best-effort exact GPU-to-job mapping. - -Working state: - -- Slurm compute nodes can report GPU usage by job, user, and account. - -### PR 13: Rollup And Retention - -Deliver: - -- Raw sample retention policy. -- 1-minute rollup tables. -- Combined class rollup. -- Cleanup command. -- Report support for raw plus rollup windows. - -Working state: - -- Long-running collectors keep DB size under control without losing the core - audit classes. - -### PR 14: Packaging And Release Polish - -Deliver: - -- README quickstart for host, Kubernetes, Slurm, and demo paths. -- Troubleshooting matrix. -- Wheel release verification. -- OCI image release workflow. -- Optional Helm chart, if the manifest path has stabilized. - -Working state: - -- A new user can install, start, inspect, report, and uninstall using the docs. - -## Recommended Merge Order - -The critical foundation is PR 2 through PR 5: - -```text -CLI surface -> RuntimePlan/doctor -> schema V2 -> combined report semantics -``` - -After that, host, Kubernetes, and Slurm become adapter work against stable -contracts. diff --git a/projects/bare-metal-1.0/handoff.ko.md b/projects/bare-metal-1.0/handoff.ko.md index c1a4b29..fea4c2a 100644 --- a/projects/bare-metal-1.0/handoff.ko.md +++ b/projects/bare-metal-1.0/handoff.ko.md @@ -22,11 +22,15 @@ - `daemon`은 기존 DB 파일이 있으면 실패한다. - `report`는 DB 파일이 없으면 실패한다. - `gua`의 사용자 표면은 `doctor`만 남긴다. +- auto-runtime proposal/project 문서는 삭제했다. Kubernetes/Slurm/Docker/Podman + 확장을 다시 시작하려면 새 proposal로 시작한다. ## 현재 상태 - PR A: implemented in PR #9. - PR B: implemented in PR #10. +- Post-1.0 cleanup: 완료. auto-runtime 문서와 `RuntimePlan`/env detection + 잔재를 제거했다. - PR C: 구현 대부분은 README/CLI에 반영된 것으로 보이나 계획서에는 아직 완료 상태가 없다. - PR D: 대기. 현재 버전은 `0.4.1`이며 1.0 release bump는 아직 하지 않았다. @@ -34,14 +38,18 @@ 마지막 로컬 검증은 모두 통과했다. ```sh -uv run pytest uv run ruff check uv run ruff format --check uv run mypy -uv build --out-dir /tmp/gua-dist-check-20260515 -bash scripts/smoke-dist-wheel.sh /tmp/gua-dist-check-20260515/gpu_usage_audit-0.4.1-py3-none-any.whl +uv run pytest +uv build --out-dir /tmp/gua-dist-prune-20260515 +bash scripts/smoke-dist-wheel.sh /tmp/gua-dist-prune-20260515/gpu_usage_audit-0.4.1-py3-none-any.whl ``` +cleanup 후 결과는 `pytest` 107 passed, `mypy` 25 source files, `ruff format` +26 files 기준이다. `/tmp/gua-dist-prune-20260515`로 build와 wheel smoke도 +통과했다. + ## 주의할 점 - 현재 로컬 개발 머신은 NVIDIA host가 아니다. `gua doctor`가 unsupported를 내는 것은 @@ -49,6 +57,8 @@ bash scripts/smoke-dist-wheel.sh /tmp/gua-dist-check-20260515/gpu_usage_audit-0. - `/tmp/gua.db`가 이미 존재한다. 기본 경로 daemon 테스트는 이 파일 때문에 실패하는 것이 기대 동작이다. - 실제 1.0 acceptance는 NVIDIA 베어메탈 호스트에서만 닫을 수 있다. +- `daemon`과 `demo`는 host row의 `env_kind`를 항상 `"bare"`로 기록한다. 1.0은 + container/k8s runtime 감지를 하지 않는다. - PR C를 닫기 전에 문서만 보고 끝내지 말고, 기존 DB 존재/부재 error UX가 README와 CLI 출력에서 서로 같은 메시지를 주는지 확인한다. - PR D에서 tag를 만들기 전에는 `scripts/check-tag-version.py`가 tag와 diff --git a/projects/bare-metal-1.0/plan.ko.md b/projects/bare-metal-1.0/plan.ko.md index 1856439..ec119a6 100644 --- a/projects/bare-metal-1.0/plan.ko.md +++ b/projects/bare-metal-1.0/plan.ko.md @@ -234,7 +234,8 @@ Deliver: - [x] auto-runtime doctor 구현 제거 또는 축소. - [x] `gua doctor`를 local machine / host NVML readiness 전용으로 재작성. - [x] k8s/slurm/docker signal 제거. -- [x] `RuntimePlan`을 host/unsupported 중심으로 축소. +- [x] auto-runtime `RuntimePlan` 잔재를 제거하고 `gua doctor` 내부의 + `DoctorPlan`으로 축소. - [x] README의 제품 설명을 single-host bare-metal 중심으로 재정렬. - [x] `gua start/status/report/stop/uninstall` placeholder 사용자 표면 제거. - [x] `gua doctor --db PATH`로 실제 daemon/report DB 경로를 점검. @@ -313,16 +314,12 @@ gpu-usage-audit report --since 1h --interval 30s ## Deferred Work -아래는 1.0 GA 전 또는 1.0 이후 다시 검토한다. +아래는 1.0 GA 전 또는 이후 다시 검토할 수 있는 운영 품질 항목이다. Kubernetes, +Slurm, Docker/Podman, scheduler allocation, managed runtime 같은 1.0 이후 +제품 확장은 현재 코드베이스와 프로젝트 문서에서 제거했다. 다시 진행하려면 새 +proposal로 시작한다. - `nvidia-ml-py` upper bound 정책 (`>=12.535,<13` 같은 known-good range 여부). - `NVMLInfo.failure_kind` 같은 구조적 실패 타입 도입. - unsupported text output에 `Blockers:` 섹션을 별도로 노출할지 결정. - raw NVML detail의 redact 옵션 또는 JSON 필드 분리. -- Kubernetes current-node 진단. -- GPU Operator staged NVML path. -- Slurm allocation context. -- Docker/Podman fallback collector. -- scheduler allocation-aware report. -- DB schema v2. -- managed `gua start/status/stop/uninstall`. diff --git a/projects/bare-metal-1.0/status.ko.md b/projects/bare-metal-1.0/status.ko.md index f72629d..5962303 100644 --- a/projects/bare-metal-1.0/status.ko.md +++ b/projects/bare-metal-1.0/status.ko.md @@ -5,11 +5,11 @@ ## 요약 Bare Metal 1.0은 단일 NVIDIA 베어메탈 호스트만 대상으로 하는 방향으로 정리되어 -있다. PR A/B 범위는 구현 완료 상태이고, 다음으로는 PR C runbook hardening을 +있다. PR A/B 범위는 구현 완료 상태이고, 이번 cleanup에서 1.0 이후 확장을 위한 +auto-runtime 문서와 코드 잔재를 제거했다. 다음으로는 PR C runbook hardening을 닫을지 확인한 뒤 PR D release prep으로 넘어가면 된다. -점검 시작 시 워크트리는 깨끗했다. 현재 변경분은 이 `status.ko.md`와 -`handoff.ko.md` 추가뿐이다. +cleanup 시작 시 워크트리는 깨끗했다. ## 구현 상태 @@ -20,37 +20,48 @@ Bare Metal 1.0은 단일 NVIDIA 베어메탈 호스트만 대상으로 하는 | Packaging UX | 완료 | `nvidia-ml-py`가 기본 dependency이고 `nvml` extra는 빈 compatibility alias. | | `daemon`/`report` DB UX | 구현됨 | 기본 DB는 `/tmp/gua.db`; daemon은 기존 DB를 거부하고 report는 없는 DB를 거부. | | README bare-metal 문서 | 대부분 완료 | 2-shell flow, systemd 예시, 운영 notes가 들어가 있음. | +| Post-1.0 cleanup | 완료 | auto-runtime proposal/project 문서, k8s/docker env 감지, `RuntimePlan` 잔재 제거. | | PR C closure | 미확정 | 계획서에는 아직 완료 표시가 없다. README와 CLI UX를 기준으로 닫을지 최종 확인 필요. | | PR D release prep | 대기 | 현재 package version은 `0.4.1`; 1.0 릴리스 버전 bump와 릴리스 노트 정리가 남음. | | NVIDIA host acceptance | 미검증 | 현재 로컬 머신에는 NVIDIA device/driver가 없어 실제 host 수집 loop는 확인하지 못함. | ## 검증 결과 -2026-05-15 로컬 검증: +2026-05-15 cleanup 후 로컬 검증: ```sh git status --short -uv run pytest uv run ruff check uv run ruff format --check uv run mypy -uv build --out-dir /tmp/gua-dist-check-20260515 -bash scripts/smoke-dist-wheel.sh /tmp/gua-dist-check-20260515/gpu_usage_audit-0.4.1-py3-none-any.whl +uv run pytest +uv build --out-dir /tmp/gua-dist-prune-20260515 +bash scripts/smoke-dist-wheel.sh /tmp/gua-dist-prune-20260515/gpu_usage_audit-0.4.1-py3-none-any.whl env GITHUB_REF_NAME=v0.4.1 uv run python scripts/check-tag-version.py ``` 결과: -- `git status --short`: 점검 시작 시 변경 없음. 문서 작성 후에는 - `status.ko.md`, `handoff.ko.md`가 새 파일로 남아 있음. -- `pytest`: 118 passed. +- `git status --short`: cleanup 변경분만 존재. - `ruff check`: pass. -- `ruff format --check`: 28 files already formatted. -- `mypy`: no issues in 27 source files. +- `ruff format --check`: 26 files already formatted. +- `mypy`: no issues in 25 source files. +- `pytest`: 107 passed. - `uv build`: sdist/wheel build 성공. - wheel smoke: 성공. - tag-version check: `v0.4.1`과 `pyproject.toml` version 일치. +## 이번 cleanup 변경 + +- `proposals/design-auto-runtime*.md` 삭제. +- `projects/auto-runtime-audit/plan*.md` 삭제. +- `src/gpu_usage_audit/env.py`와 `tests/test_env.py` 삭제. +- `daemon`/`demo`는 1.0 계약대로 host `env_kind`를 `"bare"`로 직접 기록. +- `RuntimePlan` 모델 제거. `gua doctor`는 내부 `DoctorPlan`으로 host/unsupported, + reasons, blockers, warnings만 유지. +- `DoctorPlan` JSON에서 post-1.0 placeholder였던 `scheduler`, `telemetry`, + `confidence`, `required_privileges`, `actions` 필드 제거. + ## 로컬 `doctor` 상태 현재 개발 머신은 NVIDIA host가 아니므로 `uv run gua doctor --json`은 diff --git a/proposals/design-auto-runtime.ko.md b/proposals/design-auto-runtime.ko.md deleted file mode 100644 index e733b68..0000000 --- a/proposals/design-auto-runtime.ko.md +++ /dev/null @@ -1,1144 +0,0 @@ -# gpu-usage-audit 자동 런타임 설계 - -상태: 초안 -작성일: 2026-05-12 - -## 개요 - -`gpu-usage-audit`는 사용자가 현재 머신이 베어메탈인지, Kubernetes인지, -컨테이너 런타임 호스트인지, Slurm compute node인지 몰라도 시작할 수 있는 -도구가 되어야 한다. - -목표 UX: - -```sh -gua doctor -gua start - -# 며칠 뒤 -gua status -gua report --since 3d -gua stop -``` - -제품은 적절한 collector 실행 방식을 자동으로 감지해야 한다. 단, 그 결정을 -숨기면 안 된다. 사용자는 배포 모델을 미리 알 필요가 없어야 하지만, `gua`는 -무엇을 선택했고 왜 그렇게 판단했는지 명확히 보여줘야 한다. - -예: - -```text -Detected environment: - host NVML: initialized, GPU count=0 - kubernetes: available - k8s NVIDIA runtime: available - slurm: not detected - -Recommended plan: - runtime: k8s-daemonset - telemetry: nvml - scheduler: k8s - -Reason: - GPUs are not visible from the host namespace, but they are visible inside - Kubernetes containers with NVIDIA_VISIBLE_DEVICES=all. -``` - -이것이 제품의 주요 변화다. `daemon`은 저수준 collector로 남기고, -`gua start`가 launcher/orchestrator 역할을 맡는다. - -## 동기와 차별점 - -이 프로젝트가 가져가야 할 영역은 raw GPU telemetry 그 자체가 아니다. -DCGM exporter, `nvidia-smi`, 여러 Grafana dashboard는 이미 utilization, -memory, temperature, process-level fact를 잘 보여준다. Slurm accounting, -Kubernetes metadata, cluster dashboard도 scheduler-side allocation과 -ownership을 보여준다. - -비어 있는 영역은 둘을 retrospective하게 join한 뷰다. - -```text -누가 GPU를 할당받았고, 그 GPU가 실제로 유의미한 일을 했는가? -scheduler allocation 없이 GPU를 사용한 주체는 누구인가? -어떤 GPU가 memory-held 상태였지만 compute-idle이었는가? -어떤 GPU가 할당됐지만 의미 있는 GPU process가 전혀 없었는가? -``` - -이 combined view가 핵심 가치다. 따라서 `gpu-usage-audit`는 또 하나의 live -GPU monitor가 되면 안 된다. 실제 NVML 관측과 scheduler context를 결합하는 -가벼운 retrospective audit 도구가 되어야 한다. - -가장 중요한 headline class는 다음이다. - -```text -allocated-idle-held # scheduler가 할당했고, process가 memory를 잡았지만 compute는 차가움 -allocated-unused # scheduler가 할당했지만, NVML상 의미 있는 사용이 없음 -unallocated-active # scheduler allocation 없이 GPU가 사용됨 -unallocated-idle-held # scheduler allocation 없이 GPU memory가 잡힘 -``` - -Kubernetes에서 `nvidia.com/gpu` request 없이 `NVIDIA_VISIBLE_DEVICES=all`이 -있는 pod는 first-class anomaly다. 이 pod는 scheduler accounting에 잡히지 -않는 GPU 접근 권한을 가질 수 있다. 이는 표준 GPU telemetry나 kube-state류 -metadata만으로는 만들어지지 않는 신호다. - -## 제품 목표 - -1. **첫 사용에 환경 지식이 필요 없어야 한다** - - 사용자는 node가 베어메탈인지, k8s인지, Docker인지, Slurm인지 몰라도 - `gua doctor`나 `gua start`를 실행할 수 있어야 한다. - -2. **마법처럼 숨기지 말고 투명해야 한다** - - auto mode는 선택한 plan, 판단 이유, 필요한 권한, 저장 위치, cleanup - 명령을 출력해야 한다. - - 고급 사용자는 `--mode host`, `--mode k8s`, `--mode slurm`, - `--mode container`로 명시 override할 수 있어야 한다. - -3. **Retrospective audit이 우선이다** - - 핵심 가치는 "지난 N시간/일 동안 무엇이 있었는가?"다. - - live dashboard, quota, scheduling decision, remediation은 첫 제품 - 표면이 아니다. - -4. **실제 GPU 사용과 scheduler allocation을 모두 측정한다** - - NVML은 "GPU가 일을 하고 있는가, memory를 잡고 있는가?"에 답한다. - - k8s/Slurm은 "이 GPU가 workload에 할당됐는가?"에 답한다. - - report는 둘을 결합해야 한다. - -5. **운영 부담이 낮아야 한다** - - 기본 저장소는 SQLite로 유지한다. - - 기본 경로에는 database service, web server, Prometheus, Grafana가 - 필요하지 않아야 한다. - -6. **실패 모드가 좋아야 한다** - - `gua`가 실행될 수 없다면 driver, NVML, device visibility, container - runtime, kubectl auth, Slurm config, permission 중 어느 층이 실패했는지 - 말해야 한다. - -## 비목표 - -- Slurm, Kubernetes, DCGM, Prometheus, Grafana, Open OnDemand, cluster - dashboard를 대체하지 않는다. -- quota를 enforce하거나 job을 kill하지 않는다. -- workload scheduling을 하지 않는다. -- 최소 제품에서 central server를 요구하지 않는다. -- 모든 설치를 silent하게 만들지 않는다. system 또는 cluster 상태 변경은 - 명시적이어야 한다. - -## 지원 환경 분류 - -### 1. 베어메탈 host - -전형적인 형태: - -```text -/dev/nvidia0..N 이 host에서 보임 -host NVML init 성공 -host NVML device count > 0 -scheduler가 없거나 scheduler context 비활성 -``` - -Runtime: - -```text -runtime: host-systemd or host-foreground -telemetry: nvml -scheduler: none -``` - -현재 프로젝트와 가장 가까운 형태다. - -### 2. Kubernetes / GPU Operator - -전형적인 형태: - -```text -host에는 /dev/nvidiactl만 보일 수 있음 -host NVML device count가 0일 수 있음 -GPU device는 pod 안에 inject됨 -runtimeClassName=nvidia가 있을 수 있음 -NVIDIA_VISIBLE_DEVICES가 device 노출을 제어함 -``` - -Runtime: - -```text -runtime: k8s-daemonset -telemetry: nvml -scheduler: k8s -``` - -GPU가 container namespace 안에서만 보일 수 있으므로 collector는 Kubernetes -안에서 실행되어야 한다. - -사용자가 Docker를 직접 build하거나 run할 필요는 없어야 한다. 제품 내부에서 -공식 OCI image를 사용하는 것은 괜찮다. - -### 3. Slurm compute node - -전형적인 형태: - -```text -host /dev/nvidia0..N 이 보임 -Slurm이 GPU를 GRES로 관리함 -job이 --gres=gpu:N 또는 --gpus=N 으로 GPU를 요청함 -Slurm이 job step 안에 CUDA_VISIBLE_DEVICES를 설정함 -cgroup이 visible device file을 제한할 수 있음 -``` - -Runtime: - -```text -runtime: host-systemd or host-foreground -telemetry: nvml -scheduler: slurm -``` - -Slurm 지원의 핵심은 NVML을 동작시키는 것이 아니다. NVML 사용 상태와 Slurm -allocation state를 결합하는 것이다. - -### 4. 로컬 컨테이너 런타임 - -전형적인 형태: - -```text -host command를 직접 실행할 수 없거나 직접 실행하면 안 됨 -docker/podman이 NVIDIA container를 실행할 수 있음 -docker run --gpus all ... 에서 GPU가 보임 -``` - -Runtime: - -```text -runtime: local-container -telemetry: nvml -scheduler: none -``` - -fallback으로는 유용하지만, 기본 UX가 되어서는 안 된다. - -## 핵심 아키텍처 - -collector와 report 코드 전체에 환경 분기를 퍼뜨리면 안 된다. 제품을 세 축으로 -분리한다. - -```text -1. Collector Runtime - collector process가 어디에서 실행되는가? - -2. Telemetry Source - 실제 GPU 상태를 어떻게 읽는가? - -3. Scheduler Context - GPU가 누구에게 예약/할당되었는가? -``` - -구체적 조합: - -| Environment | Runtime | Telemetry | Scheduler | -|---|---|---|---| -| Bare metal | host-systemd | nvml | none | -| Kubernetes / GPU Operator | k8s-daemonset | nvml | k8s | -| Slurm | host-systemd | nvml | slurm | -| Docker-only | local-container | nvml | none | -| Demo/test | foreground | fake | none/fake | - -중요한 규칙: - -```text -Kubernetes와 Slurm은 telemetry source가 아니다. -telemetry source는 여전히 NVML이다. -Kubernetes와 Slurm은 runtime placement와 allocation context를 제공한다. -``` - -## CLI 설계 - -### 기본 명령 - -```text -gua doctor -gua start -gua status -gua report -gua stop -gua uninstall -``` - -### 저수준 명령 - -아래 명령은 유지할 수 있지만 첫 사용 UX의 중심이 되어서는 안 된다. - -```text -gua daemon run -gua daemon export -gua db inspect -``` - -현재 `gpu-usage-audit daemon`과 `gpu-usage-audit report`는 migration 기간에 -compatibility alias로 남길 수 있다. - -### `gua doctor` - -읽기 전용 환경 진단 명령이다. - -기본 출력은 사람이 읽기 쉬운 형태다. 자동화에는 `--json`을 사용한다. - -예: - -```sh -gua doctor -gua doctor --json -gua doctor --mode k8s -``` - -Doctor가 확인할 항목: - -- OS, kernel, Python, uv/pipx 가용성 -- `/dev/nvidia*` -- host NVML load/init/device count -- `/run/nvidia/driver` 아래 GPU Operator staged NVML -- staged NVML path를 host mode에 써야 하는지 여부 -- `nvidia-smi` 존재 여부 -- `kubectl` 가용성과 인증 상태 -- k8s runtime class -- k8s GPU pod/DaemonSet -- 필요한 k8s resource를 만들 수 있는 권한 -- Slurm command와 node GRES -- Docker/Podman NVIDIA runtime fallback - -Doctor는 `RuntimePlan`을 만든다. - -### `gua start` - -기본 mode는 `auto`다. - -```sh -gua start -gua start --mode auto -gua start --mode host -gua start --mode k8s -gua start --mode slurm -gua start --mode container -gua start --dry-run -gua start --yes -``` - -동작: - -1. doctor를 실행한다. -2. runtime plan을 선택한다. -3. plan을 출력한다. -4. system이나 cluster 상태를 변경하는 작업이라면 TTY에서 확인을 받는다. -5. install state를 local에 저장한다. - -예: - -```text -Plan: - mode: k8s-daemonset - namespace: gpu-usage-audit - image: ghcr.io/AI-Ocean/gpu-usage-audit:0.4.0 - db: hostPath /var/lib/gpu-usage-audit/gua.db - nodes: GPU-capable nodes - cleanup: gua stop --mode k8s - -Continue? [y/N] -``` - -### `gua status` - -설치/실행 중인 collector 상태를 보여준다. - -```text -mode: k8s-daemonset -collectors: - gpusystem: running, last sample 12s ago, GPUs visible=10 - ds02: running, last sample 10s ago, GPUs visible=4 -storage: - per-node SQLite under /var/lib/gpu-usage-audit/gua.db -``` - -### `gua report` - -기본적으로 저장된 install state를 사용한다. - -```sh -gua report --since 24h -gua report --since 3d --node gpusystem -gua report --since 3d --all-nodes -gua report --db /var/lib/gpu-usage-audit/gua.db --since 3d -``` - -k8s에서는 사용자가 DB 위치를 알 필요가 없어야 한다. CLI가 collector pod를 -발견하고 `kubectl exec` 등을 통해 export stream을 받아 local에서 집계할 수 -있다. - -### `gua stop`과 `gua uninstall` - -`stop`은 기본적으로 collector를 멈추되 data는 보존해야 한다. - -`uninstall`은 설치된 resource를 제거하고, 선택적으로 data도 지울 수 있다. - -```sh -gua stop -gua uninstall -gua uninstall --delete-data -``` - -## RuntimePlan 인터페이스 - -detector는 바로 실행하지 말고 구조화된 plan을 만들어야 한다. - -개념 모델: - -```python -class RuntimePlan: - mode: Literal[ - "host-systemd", - "host-foreground", - "k8s-daemonset", - "local-container", - "unsupported", - ] - telemetry: Literal["nvml", "fake"] - scheduler: Literal["none", "k8s", "slurm"] - confidence: Literal["high", "medium", "low"] - reasons: list[str] - blockers: list[str] - warnings: list[str] - required_privileges: list[str] - actions: list[PlannedAction] -``` - -Runtime adapter가 plan을 소비한다. - -```text -HostRuntimeAdapter -K8sRuntimeAdapter -ContainerRuntimeAdapter -``` - -Scheduler adapter는 snapshot을 enrich한다. - -```text -NoSchedulerAdapter -K8sSchedulerAdapter -SlurmSchedulerAdapter -``` - -Telemetry adapter는 hardware fact를 만든다. - -```text -NVMLTelemetry -FakeTelemetry -``` - -## 감지 순서 - -Auto mode는 모든 GPU를 볼 수 있는 가장 덜 놀라운 runtime을 선호해야 한다. - -제안 순서: - -1. Host NVML - - host NVML이 GPU를 보면 host runtime은 viable하다. - - Slurm이 감지되면 scheduler context는 `slurm`이다. - - 아니면 scheduler context는 `none`이다. - - host NVML이 version mismatch로 실패했지만 `/run/nvidia/driver` 아래 - GPU Operator staged NVML이 있으면 plan에 host runtime remediation을 - 기록한다. - - pynvml import 전에 `LD_LIBRARY_PATH`를 prepend하여 re-exec하거나, - - collector 시작 전에 library path를 설정하는 작은 launcher wrapper를 - 사용한다. - pynvml/libnvidia-ml이 이미 load된 뒤 `LD_LIBRARY_PATH`를 바꾸는 것은 - 충분하지 않다. - -2. Kubernetes - - host NVML이 GPU를 보지 못하지만 k8s가 있고 NVIDIA runtime이 pod 안에 - GPU를 노출할 수 있으면 `k8s-daemonset`을 사용한다. - - `node.status.capacity["nvidia.com/gpu"]`만 믿지 않는다. 일부 cluster는 - accounting이 unusual/custom이어도 pod 안에 GPU를 노출한다. - -3. Local container runtime - - Docker/Podman이 all GPU를 가진 NVIDIA container를 실행할 수 있으면 - `local-container`를 사용한다. - -4. Unsupported - - 가장 가까운 viable path를 설명한다. - -중요: detection은 package를 설치하거나 cluster를 변경하면 안 된다. - -## Kubernetes Runtime 설계 - -### 설치 형태 - -최소 설치: - -```text -Namespace: gpu-usage-audit -DaemonSet: gpu-usage-audit -ServiceAccount: gpu-usage-audit -ConfigMap: collector config -hostPath DB: /var/lib/gpu-usage-audit/gua.db -``` - -DaemonSet 요구사항: - -```yaml -runtimeClassName: nvidia -hostPID: true -env: - - name: NVIDIA_VISIBLE_DEVICES - value: all - - name: NVIDIA_DRIVER_CAPABILITIES - value: compute,utility -``` - -가능한 mount: - -```text -/var/lib/gpu-usage-audit read-write DB hostPath -/proc read-only host process metadata, if needed -/var/lib/kubelet/pod-resources read-only pod resources socket, if available -``` - -`hostPID: true`는 node-wide process attribution에 중요하다. NVML은 GPU -process PID를 보고할 수 있지만, host PID visibility가 없으면 collector가 -그 PID를 `/proc//cgroup`으로 다시 매핑하지 못할 수 있다. - -기본값은 `hostPID: true`가 되어야 하며 opt-out을 제공한다. 일부 cluster는 -restricted Pod Security profile을 강제하므로 -`gua start --mode k8s --no-host-pid`가 가능해야 한다. 단, plan은 -process-to-pod attribution이 약해진다고 명확히 말해야 한다. - -DaemonSet은 기본적으로 모든 node가 아니라 GPU-capable node만 대상으로 해야 -한다. 선호 selector: - -```text -nvidia.com/gpu.present=true -feature.node.kubernetes.io/pci-10de.present=true -``` - -GPU Feature Discovery / Node Feature Discovery label이 없다면 더 넓은 -DaemonSet을 설치한 뒤 collector self-check로 fallback할 수 있다. - -### Kubernetes Allocation Context - -k8s adapter는 세 데이터 source를 결합해야 한다. - -1. Kubernetes API - - Pod, namespace, node name, owner reference, resource request/limit. - -2. Kubelet PodResources API - - 어떤 pod/container가 어떤 GPU device ID를 받았는지에 대한 가장 좋은 - source. - -3. Host `/proc//cgroup` - - 관측된 GPU process PID를 pod/container로 매핑하는 가장 좋은 source. - -이 구분이 중요한 이유는, 관측한 cluster에 다음 형태의 pod가 있었기 때문이다. - -```text -NVIDIA_VISIBLE_DEVICES=all -no nvidia.com/gpu request -all GPUs visible inside the container -``` - -이 pod들은 scheduler accounting이 깨끗하게 표현하지 못하는 방식으로 GPU를 -쓸 수 있다. - -adapter는 다음을 명시적으로 감지해야 한다. - -```text -NVIDIA_VISIBLE_DEVICES=all -NVIDIA_VISIBLE_DEVICES= -no nvidia.com/gpu request or limit -``` - -이는 raw environment variable로만 저장하지 말고 scheduler-accounting -anomaly로 표면화해야 한다. - -### Cgroup 호환성 - -Process attribution은 `/proc//cgroup`에 의존하지만 cgroup v1과 unified -cgroup v2는 path 표현이 다르다. Kubernetes와 Slurm 배포 모두 cgroup v2로 -이동하는 추세다. - -parser는 k8s adapter와 Slurm adapter가 공유하는 module이어야 한다. 지원할 -항목: - -```text -cgroup v1 controller-specific lines -cgroup v2 unified `0::/path` lines -systemd slice escaping -containerd / CRI-O pod and container IDs -Slurm job_ and step_ paths -``` - -process-to-owner attribution 구현 전에 이 결정을 내려야 한다. - -### Kubernetes Report 의미론 - -report는 scheduler allocation과 실제 GPU state를 모두 보여줘야 한다. - -```text -allocated-active -allocated-idle-held -allocated-unused -unallocated-active -unallocated-idle-held -truly-idle -``` - -정의: - -```text -allocated-unused = scheduler가 GPU를 할당했지만 의미 있는 NVML process/memory가 없음 -unallocated-active = NVML상 사용이 있지만 scheduler allocation이 없거나 알 수 없음 -unallocated-idle-held = scheduler allocation 없이 memory가 잡힘 -truly-idle = allocation도 없고 의미 있는 NVML 사용도 없음 -``` - -## Slurm Runtime 설계 - -Slurm은 일반적으로 GPU를 GRES로 관리한다. - -중요한 Slurm 사실: - -- GPU는 보통 `Name=gpu`인 GRES로 설정된다. -- job은 `--gres=gpu:N`, `--gpus=N` 또는 관련 flag로 GPU를 요청한다. -- Slurm은 job step에 `CUDA_VISIBLE_DEVICES`를 설정한다. -- Slurm은 cgroup으로 visible device file을 제한할 수 있다. -- Slurm은 `gres.conf`에서 NVML을 통해 NVIDIA GPU를 autodetect할 수 있다. - -Slurm 지원은 다음으로 다뤄야 한다. - -```text -runtime: host-systemd -telemetry: nvml -scheduler: slurm -``` - -collector는 user job 밖에서 compute node 위에 실행된다. 실제 GPU 사용은 -NVML로 읽고, allocation context는 Slurm에서 읽는다. - -### Slurm 감지 신호 - -```text -scontrol exists -sinfo exists -slurmd process or service exists -/etc/slurm/slurm.conf or $SLURM_CONF exists -scontrol show node reports Gres or CfgTRES with gpu -``` - -### Slurm Allocation Context - -초기 adapter source: - -```text -scontrol show node -squeue -h -w -scontrol show job -d -sacct, when available -/proc//cgroup for job_ or step_ -``` - -MVP가 지원해야 할 것: - -- 이 node에서 실행 중인 job. -- 각 job의 user. -- 각 job이 요청한 GPU 수. -- 가능하면 할당된 GPU device ID 또는 UUID. -- cgroup을 통한 GPU PID -> Slurm job ID 매핑. - -Slurm이 exact GPU ID를 노출하지 않는 경우 첫 버전에서는 per-GPU allocation을 -`allocated-unknown-gpu`로 표시해도 된다. - -## Data Model V2 - -현재 schema는 hardware sample과 process sample을 담는다. 여전히 유용하지만, -scheduler allocation은 first-class storage가 필요하다. - -제안 table: - -### `node` - -```text -node_id -hostname -first_seen -last_seen -runtime_mode # host-systemd / k8s-daemonset / local-container -scheduler_kind # none / k8s / slurm -driver_version -collector_version -``` - -### `gpu_sample` - -```text -ts -node_id -gpu_uuid -gpu_index -parent_uuid # nullable, set for MIG instances or virtual slices -mig_profile # nullable, e.g. 1g.5gb -share_id # nullable, for MIG/vGPU/time-slicing/MPS-style slices -bus_id -util_pct -mem_used_mb -mem_total_mb -``` - -### `gpu_process_sample` - -```text -ts -node_id -gpu_uuid -pid -process_name -mem_used_mb -loginuid_user -owner_key # nullable, references observed owner if resolved -``` - -### `allocation_sample` - -```text -ts -node_id -scheduler_kind # k8s / slurm -gpu_uuid # nullable if exact GPU unknown -parent_uuid # nullable, physical GPU for MIG/vGPU/shared allocations -owner_kind # k8s_pod / slurm_job -owner_key # stable ID: namespace/name or job ID -owner_name -namespace -user_name -account -requested_gpus -share_fraction # nullable, for fractional/shared GPU allocation -allocation_state # allocated / released / unknown -raw_ref -``` - -### `owner_sample` - -정규화된 report에 유용한 optional table: - -```text -ts -owner_kind -owner_key -owner_name -namespace -user_name -account -labels_json -``` - -### Migration - -기존 DB는 legacy mode로 읽을 수 있다. - -```text -scheduler_kind = none -allocation state = unknown -``` - -report는 기존 DB에서도 계속 동작해야 한다. - -### Retention과 Rollup - -Raw process sample은 빠르게 커질 수 있다. 바쁜 node는 tick마다 많은 row를 -만들 수 있다. - -```text -1 Hz * 10 GPUs * 50 GPU processes = 500 process rows/sec -``` - -SQLite는 유용한 short-term window를 감당할 수 있지만, 긴 retention에는 명시적 -정책이 필요하다. 기본 저장소는 운영 모델을 단순하게 유지해야 한다. - -```text -raw samples: 7-14 days by default -1-minute rollups: 90 days by default -5-minute rollups: optional long-term retention -``` - -제안 rollup table: - -```text -gpu_rollup_1m -owner_rollup_1m -allocation_rollup_1m -``` - -Rollup은 평균 utilization만 보존하면 안 되고 combined class를 보존해야 한다. -그렇지 않으면 `allocated-unused` 같은 핵심 신호가 downsampling 중 사라진다. - -## Classification Model - -기존 hardware classification은 유지한다. - -```text -util >= 10 -> active -util < 10 and mem > 100 -> idle-held -util < 10 and mem <= 100 -> truly-idle -``` - -scheduler allocation을 추가한다. - -```text -allocation known and present -> allocated -allocation absent -> unallocated -allocation unavailable -> unknown -``` - -Combined class: - -| Allocation | Hardware | Combined | -|---|---|---| -| allocated | active | allocated-active | -| allocated | idle-held | allocated-idle-held | -| allocated | truly-idle | allocated-unused | -| unallocated | active | unallocated-active | -| unallocated | idle-held | unallocated-idle-held | -| unallocated | truly-idle | truly-idle | -| unknown | active | active | -| unknown | idle-held | idle-held | -| unknown | truly-idle | truly-idle | - -이 모델은 기존 report 의미를 유지하면서 k8s/Slurm 가치를 추가한다. - -## Storage와 Reporting 전략 - -### Single Node - -기본: - -```text -/var/lib/gpu-usage-audit/gua.db -``` - -user-mode/foreground fallback: - -```text -~/.local/share/gpu-usage-audit/gua.db -``` - -### Kubernetes - -MVP: - -- node마다 hostPath 기반 SQLite DB 하나. -- `gua report`가 collector pod를 발견한다. -- `gua report`가 각 collector pod 안에서 - `gua daemon export --format jsonl`을 실행하고 local에서 집계한다. - -이 방식은 central database나 service를 피할 수 있지만 한계가 있다. - -- `pods/exec` RBAC는 종종 제한된다. -- 많은 node를 sequential exec하면 느리다. -- 큰 export에는 streaming, compression, time-window filtering이 필요하다. - -report 구현은 병렬 fan-out을 해야 하고 필요한 time window만 요청해야 한다. -또한 alternative export path를 지원해야 한다. - -나중: - -- 각 collector pod의 read-only HTTP export endpoint. -- `kubectl port-forward` 기반 report collection. -- cluster-internal aggregator Job. -- optional central PVC. -- optional Prometheus/exporter mode. -- optional object storage export. - -### Slurm - -MVP: - -- compute node마다 SQLite DB 하나. -- 먼저 local node report를 지원한다. - -나중: - -- Slurm controller-side aggregator. -- `gua report --partition` 또는 `--nodes`. - -## Packaging과 Installation - -### 기본 CLI 설치 - -권장: - -```sh -uv tool install gpu-usage-audit -``` - -또는: - -```sh -pipx install gpu-usage-audit -``` - -첫 사용 마찰을 줄이기 위해 `nvidia-ml-py`를 optional extra가 아니라 기본 -dependency로 둘지 검토한다. 작고, GPU audit 도구가 NVML binding 누락으로 첫 -실행에서 실패하는 것은 좋지 않다. - -### OCI Image - -k8s runtime에는 필요하다. - -```text -ghcr.io/AI-Ocean/gpu-usage-audit: -ghcr.io/AI-Ocean/gpu-usage-audit:latest -``` - -사용자가 Docker를 직접 실행할 필요는 없다. image는 k8s runtime adapter가 -사용하는 내부 구현 디테일이다. - -### Kubernetes 설치 - -초기 구현은 Python package 안에 manifest template을 내장할 수 있다. - -나중: - -- GitHub Releases에 standalone YAML 게시. -- Helm chart 게시. - -### One-Line Installer - -나중에 가능한 UX: - -```sh -curl -Ls https://github.com/AI-Ocean/gpu-usage-audit/releases/latest/download/install.sh | sh -``` - -이는 CLI만 설치해야 한다. systemd service나 k8s DaemonSet을 조용히 설치하면 -안 된다. - -## Security와 Permission - -### Host Mode - -필요: - -- NVML 접근. -- `/proc//loginuid`와 cgroup metadata read 권한. -- DB directory write 권한. -- systemd install에는 root 필요. - -테스트용으로 non-root foreground mode를 지원해야 한다. - -### Kubernetes Mode - -필요: - -- namespace, service account, configmap, daemonset, RBAC를 만들 수 있는 권한. -- target node의 모든 GPU에 접근할 수 있는 runtime 권한. -- pod와 node metadata read 권한. -- process attribution을 위한 hostPID와 read-only `/proc` 접근 가능성. -- SQLite DB를 위한 hostPath write 권한. -- exec 기반 export를 쓸 경우 `gua report`용 optional `pods/exec`. - -install plan은 resource를 적용하기 전에 이 권한들을 출력해야 한다. - -collector의 최소 RBAC는 다음에서 시작한다. - -```text -get/list/watch pods -get/list/watch nodes -``` - -`pods/exec`는 report-side에만 필요하며 collector 자체에는 필요하지 않아야 -한다. - -### Slurm Mode - -필요: - -- Host NVML 접근. -- Slurm command/config/accounting read 접근. -- process cgroup read 접근. -- systemd install에는 보통 admin 권한 필요. - -Slurm job user가 node-wide collector를 설치한다고 기대하면 안 된다. - -## 구현 마일스톤 - -### M0: 집중 ADR - -넓은 구현 전에 위험도가 높은 세부사항에 대해 짧은 architecture decision -record를 작성한다. - -- GPU Operator staged NVML loading과 host-mode re-exec. -- MIG, vGPU, MPS, time-slicing 표현. -- cgroup v1/v2 parser와 owner attribution. -- k8s report export path: `pods/exec` vs HTTP endpoint vs aggregator. - -### M1: Doctor와 RuntimePlan - -아직 collection 동작은 바꾸지 않는다. - -Deliver: - -- `gua doctor` -- host NVML/device check -- k8s check -- Slurm check -- structured JSON output -- recommended plan - -환경 가정을 설치 없이 검증하므로 가장 leverage가 높은 milestone이다. - -### M2: Schema V2와 Combined Report Model - -Deliver: - -- migration-safe DB schema -- allocation table -- combined classes -- fake scheduler tests -- old DB compatibility -- retention and rollup policy - -이것이 차별화 기능이다. 모든 runtime adapter가 같은 model을 target할 수 -있도록 일찍 들어가야 한다. - -### M3: CLI Surface와 State - -Deliver: - -- `gua start --dry-run` -- `gua status` -- local state file -- 기존 command compatibility alias - -아직 k8s install은 하지 않는다. - -### M4: Kubernetes Runtime Adapter - -Deliver: - -- official OCI image -- embedded DaemonSet manifest -- `gua start --mode k8s` -- `gua stop --mode k8s` -- parallel, windowed export 기반 collector pod report - -관측한 GPU Operator 환경을 해결한다. - -### M5: Kubernetes Scheduler Adapter - -Deliver: - -- pod/process attribution -- 가능한 경우 PodResources API integration -- namespace/pod/user별 report -- GPU request 없는 `NVIDIA_VISIBLE_DEVICES=all` pod 탐지 -- unrequested GPU access anomaly headline - -### M6: Host Runtime Adapter - -Deliver: - -- systemd unit install -- foreground mode -- host preflight -- GPU Operator staged NVML re-exec 또는 명확한 diagnostic - -### M7: Slurm Scheduler Adapter - -Deliver: - -- Slurm detection -- job allocation snapshot -- cgroup 기반 process-to-job mapping -- best-effort exact GPU-to-job mapping -- job/user/account별 report - -### M8: Documentation과 Release Polish - -Deliver: - -- quickstart -- architecture docs -- troubleshooting matrix -- wheel + OCI image release workflow -- optional Helm chart - -## 현재 서버 해석 - -관측한 `gpusystem` 서버는 다음에 해당한다. - -```text -runtime: k8s-daemonset -telemetry: nvml -scheduler: k8s -``` - -이유: - -- Host에는 `/dev/nvidiactl`만 보인다. -- Host NVML은 device를 보지 못한다. -- Kubernetes workload container 안에서는 `/dev/nvidia0..9`가 보인다. -- 일부 pod는 `runtimeClassName=nvidia`와 `nvidia.com/gpu` request를 쓴다. -- 일부 pod는 GPU request 없이 `NVIDIA_VISIBLE_DEVICES=all`을 노출한다. - -이 환경이 바로 runtime placement와 scheduler context를 분리해야 하는 이유다. - -## Open Questions - -제안 결정: - -1. `nvidia-ml-py`는 기본 dependency가 되어야 한다. -2. k8s DaemonSet은 `hostPID: true`를 기본값으로 하고 `--no-host-pid` opt-out을 - 제공한다. -3. k8s install은 기본적으로 GPU-capable node만 target해야 한다. -4. collector RBAC는 read-only로 시작한다: pods와 nodes. `pods/exec`는 - exec 기반 report transport에만 필요하다. -5. `gua report`는 local state가 node-scoped일 때 current node를 기본값으로 - 하고, cluster report에는 `--all-nodes`를 제공한다. -6. Slurm MVP는 detection, node-level job allocation, cgroup PID-to-job mapping을 - 포함해야 한다. Exact GPU-to-job mapping은 best effort다. -7. MIG field는 schema v2에 미리 들어가야 한다. report는 초기에는 MIG를 일반 - GPU-like device처럼 다뤄도 된다. -8. `gua`를 primary command로 둔다. `gpu-usage-audit`는 compatibility alias로 - 유지한다. - -아직 열려 있는 질문: - -1. 첫 k8s report transport는 `pods/exec`, HTTP export, 또는 둘 다 중 무엇인가? -2. 바쁜 node에서 acceptable한 기본 raw retention window는 얼마인가? -3. rollup은 collector process에서 계산할 것인가, report/export 시점에 계산할 - 것인가? -4. HAMi/vGPU/time-slicing의 fractional sharing을 scheduler 간 어떻게 정규화할 - 것인가? - -## 참고 자료 - -- NVIDIA DCGM Exporter deployment patterns: - https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html -- NVIDIA Container Toolkit GPU environment variables: - https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.18.1/docker-specialized.html -- NVIDIA GPU Operator overview: - https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html -- NVIDIA GPU Operator CDI and GPU Management Containers: - https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/cdi.html -- Kubernetes Device Plugins: - https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/ -- Kubernetes kubelet files and Pod Resources API path: - https://kubernetes.io/docs/reference/node/kubelet-files/ -- Slurm GRES GPU scheduling: - https://slurm.schedmd.com/gres.html -- Slurm `gres.conf`: - https://slurm.schedmd.com/gres.conf.html -- Slurm cgroups: - https://slurm.schedmd.com/cgroups.html -- Jeon et al., "Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN - Training Workloads", USENIX ATC 2019: - https://www.usenix.org/conference/atc19/presentation/jeon -- Hu et al., "Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for - Deep Learning Training Jobs", ASPLOS 2023: - https://doi.org/10.1145/3575693.3575705 diff --git a/proposals/design-auto-runtime.md b/proposals/design-auto-runtime.md deleted file mode 100644 index b0ae944..0000000 --- a/proposals/design-auto-runtime.md +++ /dev/null @@ -1,1145 +0,0 @@ -# gpu-usage-audit auto-runtime design - -Status: draft -Date: 2026-05-12 - -## Summary - -`gpu-usage-audit` should become a tool that a user can start without knowing -whether the machine is bare metal, Kubernetes, a container runtime host, or a -Slurm compute node. - -Target UX: - -```sh -gua doctor -gua start - -# days later -gua status -gua report --since 3d -gua stop -``` - -The product should auto-detect the right collector runtime, but it must not hide -the decision. The user should not need to know the deployment model up front, -but `gua` should clearly report what it chose and why. - -Example: - -```text -Detected environment: - host NVML: initialized, GPU count=0 - kubernetes: available - k8s NVIDIA runtime: available - slurm: not detected - -Recommended plan: - runtime: k8s-daemonset - telemetry: nvml - scheduler: k8s - -Reason: - GPUs are not visible from the host namespace, but they are visible inside - Kubernetes containers with NVIDIA_VISIBLE_DEVICES=all. -``` - -This is the main product shift: `daemon` remains a low-level collector, while -`gua start` becomes the launcher/orchestrator. - -## Motivation and Differentiation - -The gap this project should own is not raw GPU telemetry alone. DCGM exporter, -`nvidia-smi`, and many Grafana dashboards already expose utilization, memory, -temperature, and process-level facts. Slurm accounting, Kubernetes metadata, and -cluster dashboards already expose scheduler-side allocation and ownership. - -The missing view is the retrospective join between the two: - -```text -Who was allocated a GPU, and did that GPU actually do useful work? -Who used a GPU without a scheduler allocation? -Which GPUs were memory-held but compute-idle? -Which GPUs were allocated but had no meaningful GPU process at all? -``` - -That combined view is the unique value. `gpu-usage-audit` should therefore -avoid becoming another live GPU monitor. It should be a lightweight retrospective -audit tool that correlates actual NVML observations with scheduler context. - -The most important headline classes are: - -```text -allocated-idle-held # scheduler allocated it, process held memory, compute was cold -allocated-unused # scheduler allocated it, but NVML saw no meaningful use -unallocated-active # GPU was used without visible scheduler allocation -unallocated-idle-held # GPU memory was held without visible scheduler allocation -``` - -In Kubernetes, `NVIDIA_VISIBLE_DEVICES=all` without a corresponding -`nvidia.com/gpu` request is a first-class anomaly. It means a pod can access GPUs -that scheduler accounting may not represent. This is one of the signals that -standard GPU telemetry and kube-state style metadata do not provide by -themselves. - -## Product Goals - -1. **No environment knowledge required for first use** - - The user can run `gua doctor` or `gua start` without knowing whether the - node is bare metal, k8s, Docker, or Slurm. - -2. **Transparent, not magical** - - Auto mode must print the selected plan, reasons, required privileges, - storage location, and cleanup command. - - Advanced users can override with `--mode host`, `--mode k8s`, - `--mode slurm`, or `--mode container`. - -3. **Retrospective audit first** - - The core value remains "what happened over the last N hours/days?" - - Live dashboards, quotas, scheduling decisions, and remediation are not the - first product surface. - -4. **Measure both actual GPU use and scheduler allocation** - - NVML answers: "is a GPU doing work or holding memory?" - - k8s/Slurm answer: "was this GPU allocated to a workload?" - - The report should combine both. - -5. **Low operational footprint** - - SQLite remains the default local storage. - - No database service, web server, Prometheus, or Grafana required for the - default path. - -6. **Good failure modes** - - If `gua` cannot run, it should say which layer failed: driver, NVML, - device visibility, container runtime, kubectl auth, Slurm config, or - permissions. - -## Non-Goals - -- Replacing Slurm, Kubernetes, DCGM, Prometheus, Grafana, Open OnDemand, or - cluster dashboards. -- Enforcing quotas or killing jobs. -- Scheduling workloads. -- Requiring a central server in the minimum viable product. -- Making every install silent. Cluster or system changes should be explicit. - -## Supported Environment Classes - -### 1. Bare Metal Host - -Typical shape: - -```text -/dev/nvidia0..N visible on host -host NVML init succeeds -host NVML device count > 0 -no scheduler detected, or scheduler context disabled -``` - -Runtime: - -```text -runtime: host-systemd or host-foreground -telemetry: nvml -scheduler: none -``` - -This is closest to the current project. - -### 2. Kubernetes / GPU Operator - -Typical shape: - -```text -host may only show /dev/nvidiactl -host NVML device count may be 0 -GPU devices are injected into pods -runtimeClassName=nvidia may exist -NVIDIA_VISIBLE_DEVICES controls device exposure -``` - -Runtime: - -```text -runtime: k8s-daemonset -telemetry: nvml -scheduler: k8s -``` - -The collector must run inside Kubernetes because the GPUs may only be visible in -container namespaces. - -The user should not need to build or run Docker manually. The product can still -use an official OCI image internally. - -### 3. Slurm Compute Node - -Typical shape: - -```text -host /dev/nvidia0..N visible -Slurm manages GPUs as GRES -jobs request GPUs with --gres=gpu:N or --gpus=N -Slurm sets CUDA_VISIBLE_DEVICES inside job steps -cgroups may restrict visible device files -``` - -Runtime: - -```text -runtime: host-systemd or host-foreground -telemetry: nvml -scheduler: slurm -``` - -Slurm support is not mainly about making NVML work. It is about combining NVML -use with Slurm allocation state. - -### 4. Local Container Runtime - -Typical shape: - -```text -host command cannot or should not run directly -docker/podman can run NVIDIA containers -docker run --gpus all ... sees GPUs -``` - -Runtime: - -```text -runtime: local-container -telemetry: nvml -scheduler: none -``` - -This is useful as a fallback, but should not be the primary UX. - -## Core Architecture - -Do not spread environment branches throughout the collector and report code. -Separate the product into three axes. - -```text -1. Collector Runtime - Where does the collector process run? - -2. Telemetry Source - How does it read actual GPU state? - -3. Scheduler Context - Who has the GPU reserved or allocated? -``` - -Concrete combinations: - -| Environment | Runtime | Telemetry | Scheduler | -|---|---|---|---| -| Bare metal | host-systemd | nvml | none | -| Kubernetes / GPU Operator | k8s-daemonset | nvml | k8s | -| Slurm | host-systemd | nvml | slurm | -| Docker-only | local-container | nvml | none | -| Demo/test | foreground | fake | none/fake | - -The important rule: - -```text -Kubernetes and Slurm are not telemetry sources. -NVML is still the telemetry source. -Kubernetes and Slurm provide runtime placement and allocation context. -``` - -## CLI Design - -### Primary Commands - -```text -gua doctor -gua start -gua status -gua report -gua stop -gua uninstall -``` - -### Low-Level Commands - -These can remain available, but should not be the primary first-run UX. - -```text -gua daemon run -gua daemon export -gua db inspect -``` - -The current `gpu-usage-audit daemon` and `gpu-usage-audit report` can remain as -compatibility aliases during migration. - -### `gua doctor` - -Read-only environment diagnosis. - -Default output is human-readable. `--json` is required for automation. - -Example: - -```sh -gua doctor -gua doctor --json -gua doctor --mode k8s -``` - -Doctor checks: - -- OS, kernel, Python, uv/pipx availability -- `/dev/nvidia*` -- host NVML load/init/device count -- GPU Operator staged NVML under `/run/nvidia/driver` -- whether the staged NVML path should be used for host mode -- `nvidia-smi` presence if available -- `kubectl` availability and auth -- k8s runtime classes -- k8s GPU pods/DaemonSets -- ability to create required k8s resources -- Slurm commands and node GRES -- Docker/Podman NVIDIA runtime fallback - -Doctor produces a `RuntimePlan`. - -### `gua start` - -Default mode is `auto`. - -```sh -gua start -gua start --mode auto -gua start --mode host -gua start --mode k8s -gua start --mode slurm -gua start --mode container -gua start --dry-run -gua start --yes -``` - -Behavior: - -1. Run doctor. -2. Select a runtime plan. -3. Print the plan. -4. If the action mutates system or cluster state, ask for confirmation when - running in a TTY. -5. Persist install state locally. - -Example: - -```text -Plan: - mode: k8s-daemonset - namespace: gpu-usage-audit - image: ghcr.io/AI-Ocean/gpu-usage-audit:0.4.0 - db: hostPath /var/lib/gpu-usage-audit/gua.db - nodes: GPU-capable nodes - cleanup: gua stop --mode k8s - -Continue? [y/N] -``` - -### `gua status` - -Shows the installed/running collector state. - -```text -mode: k8s-daemonset -collectors: - gpusystem: running, last sample 12s ago, GPUs visible=10 - ds02: running, last sample 10s ago, GPUs visible=4 -storage: - per-node SQLite under /var/lib/gpu-usage-audit/gua.db -``` - -### `gua report` - -Default should use the saved install state. - -```sh -gua report --since 24h -gua report --since 3d --node gpusystem -gua report --since 3d --all-nodes -gua report --db /var/lib/gpu-usage-audit/gua.db --since 3d -``` - -For k8s, `gua report` should not require users to know where the DB is. It can -query collector pods through `kubectl exec` and stream an export format back to -the local CLI. - -### `gua stop` and `gua uninstall` - -`stop` should stop the collector but preserve data by default. - -`uninstall` can remove installed resources and optionally data. - -```sh -gua stop -gua uninstall -gua uninstall --delete-data -``` - -## RuntimePlan Interface - -The detector should produce a structured plan, not directly perform actions. - -Conceptual model: - -```python -class RuntimePlan: - mode: Literal[ - "host-systemd", - "host-foreground", - "k8s-daemonset", - "local-container", - "unsupported", - ] - telemetry: Literal["nvml", "fake"] - scheduler: Literal["none", "k8s", "slurm"] - confidence: Literal["high", "medium", "low"] - reasons: list[str] - blockers: list[str] - warnings: list[str] - required_privileges: list[str] - actions: list[PlannedAction] -``` - -Runtime adapters consume a plan: - -```text -HostRuntimeAdapter -K8sRuntimeAdapter -ContainerRuntimeAdapter -``` - -Scheduler adapters enrich snapshots: - -```text -NoSchedulerAdapter -K8sSchedulerAdapter -SlurmSchedulerAdapter -``` - -Telemetry adapters produce hardware facts: - -```text -NVMLTelemetry -FakeTelemetry -``` - -## Detection Order - -Auto mode should prefer the least surprising runtime that can see all GPUs. - -Proposed order: - -1. Host NVML - - If host NVML sees GPUs, host runtime is viable. - - If Slurm is detected, scheduler context becomes `slurm`. - - Otherwise scheduler context is `none`. - - If host NVML fails with a likely version mismatch but staged GPU Operator - NVML exists under `/run/nvidia/driver`, the plan should record a host - runtime remediation: - - re-exec with `LD_LIBRARY_PATH` prepended before importing pynvml, or - - use a tiny launcher wrapper that sets the library path before starting - the collector. - Changing `LD_LIBRARY_PATH` after pynvml/libnvidia-ml has already been - loaded is not sufficient. - -2. Kubernetes - - If host NVML cannot see GPUs, but k8s is available and NVIDIA runtime can - expose GPUs in a pod, use `k8s-daemonset`. - - Do not rely only on `node.status.capacity["nvidia.com/gpu"]`; some - clusters expose GPUs to pods even when accounting is unusual or custom. - -3. Local container runtime - - If Docker/Podman can run an NVIDIA container with all GPUs, use - `local-container`. - -4. Unsupported - - Explain the nearest viable path. - -Important: detection should never install packages or mutate the cluster. - -## Kubernetes Runtime Design - -### Installation Shape - -Minimum viable install: - -```text -Namespace: gpu-usage-audit -DaemonSet: gpu-usage-audit -ServiceAccount: gpu-usage-audit -ConfigMap: collector config -hostPath DB: /var/lib/gpu-usage-audit/gua.db -``` - -DaemonSet requirements: - -```yaml -runtimeClassName: nvidia -hostPID: true -env: - - name: NVIDIA_VISIBLE_DEVICES - value: all - - name: NVIDIA_DRIVER_CAPABILITIES - value: compute,utility -``` - -Likely mounts: - -```text -/var/lib/gpu-usage-audit read-write DB hostPath -/proc read-only host process metadata, if needed -/var/lib/kubelet/pod-resources read-only pod resources socket, if available -``` - -`hostPID: true` is important for node-wide process attribution. NVML can report -PIDs for GPU processes, but without host PID visibility the collector may not be -able to map those PIDs back to `/proc//cgroup`. - -Default should be `hostPID: true` with an opt-out mode. Some clusters enforce -restricted Pod Security profiles, so `gua start --mode k8s --no-host-pid` should -be possible, but the plan must say that process-to-pod attribution will be -weaker. - -The DaemonSet should target GPU-capable nodes by default, not every node. -Preferred selectors: - -```text -nvidia.com/gpu.present=true -feature.node.kubernetes.io/pci-10de.present=true -``` - -If GPU Feature Discovery / Node Feature Discovery labels are absent, the -installer can fall back to a broader DaemonSet plus collector self-checks. - -### Kubernetes Allocation Context - -The k8s adapter should combine three data sources: - -1. Kubernetes API - - Pods, namespaces, node names, owner references, resource requests/limits. - -2. Kubelet PodResources API - - Best source for which pod/container received which GPU device IDs. - -3. Host `/proc//cgroup` - - Best source for mapping an observed GPU process PID to a pod/container. - -This distinction matters because the current observed cluster has pods with: - -```text -NVIDIA_VISIBLE_DEVICES=all -no nvidia.com/gpu request -all GPUs visible inside the container -``` - -Those pods can use GPUs even though scheduler accounting may not represent the -use cleanly. - -The adapter should explicitly detect: - -```text -NVIDIA_VISIBLE_DEVICES=all -NVIDIA_VISIBLE_DEVICES= -no nvidia.com/gpu request or limit -``` - -These should be surfaced as scheduler-accounting anomalies, not just stored as -raw environment variables. - -### Cgroup Compatibility - -Process attribution depends on `/proc//cgroup`, but cgroup v1 and unified -cgroup v2 encode paths differently. Kubernetes and Slurm deployments are both -moving toward cgroup v2. - -The parser should be a shared module used by the k8s and Slurm adapters. It -should support: - -```text -cgroup v1 controller-specific lines -cgroup v2 unified `0::/path` lines -systemd slice escaping -containerd / CRI-O pod and container IDs -Slurm job_ and step_ paths -``` - -This should be decided before implementing process-to-owner attribution. - -### Kubernetes Report Semantics - -The report should show both scheduler allocation and actual GPU state: - -```text -allocated-active -allocated-idle-held -allocated-unused -unallocated-active -unallocated-idle-held -truly-idle -``` - -Where: - -```text -allocated-unused = scheduler allocated GPU, but no meaningful NVML process/mem -unallocated-active = NVML shows use, but scheduler allocation is absent/unknown -unallocated-idle-held = memory held without scheduler allocation -truly-idle = no allocation and no meaningful NVML use -``` - -## Slurm Runtime Design - -Slurm generally manages GPUs through GRES. - -Important Slurm facts: - -- GPUs are configured as GRES, usually `Name=gpu`. -- Jobs request GPUs with `--gres=gpu:N`, `--gpus=N`, or related flags. -- Slurm sets `CUDA_VISIBLE_DEVICES` for job steps. -- Slurm can use cgroups to restrict visible device files. -- Slurm can autodetect NVIDIA GPUs with NVML in `gres.conf`. - -Slurm support should be treated as: - -```text -runtime: host-systemd -telemetry: nvml -scheduler: slurm -``` - -The collector runs on compute nodes outside user jobs. It reads NVML for actual -GPU use and Slurm for allocation context. - -### Slurm Detection Signals - -```text -scontrol exists -sinfo exists -slurmd process or service exists -/etc/slurm/slurm.conf or $SLURM_CONF exists -scontrol show node reports Gres or CfgTRES with gpu -``` - -### Slurm Allocation Context - -Initial adapter sources: - -```text -scontrol show node -squeue -h -w -scontrol show job -d -sacct, when available -/proc//cgroup for job_ or step_ -``` - -MVP should support: - -- Which jobs are running on this node. -- Which users own those jobs. -- How many GPUs each job requested. -- If available, which GPU device IDs or UUIDs are allocated. -- Mapping GPU PIDs back to Slurm job IDs via cgroup. - -It is acceptable for the first version to mark per-GPU allocation as -`allocated-unknown-gpu` if Slurm does not expose exact GPU IDs in the available -commands. - -## Data Model V2 - -The current schema captures hardware samples and process samples. That is still -useful, but scheduler allocation needs first-class storage. - -Proposed tables: - -### `node` - -```text -node_id -hostname -first_seen -last_seen -runtime_mode # host-systemd / k8s-daemonset / local-container -scheduler_kind # none / k8s / slurm -driver_version -collector_version -``` - -### `gpu_sample` - -```text -ts -node_id -gpu_uuid -gpu_index -parent_uuid # nullable, set for MIG instances or virtual slices -mig_profile # nullable, e.g. 1g.5gb -share_id # nullable, for MIG/vGPU/time-slicing/MPS-style slices -bus_id -util_pct -mem_used_mb -mem_total_mb -``` - -### `gpu_process_sample` - -```text -ts -node_id -gpu_uuid -pid -process_name -mem_used_mb -loginuid_user -owner_key # nullable, references observed owner if resolved -``` - -### `allocation_sample` - -```text -ts -node_id -scheduler_kind # k8s / slurm -gpu_uuid # nullable if exact GPU unknown -parent_uuid # nullable, physical GPU for MIG/vGPU/shared allocations -owner_kind # k8s_pod / slurm_job -owner_key # stable ID: namespace/name or job ID -owner_name -namespace -user_name -account -requested_gpus -share_fraction # nullable, for fractional/shared GPU allocation -allocation_state # allocated / released / unknown -raw_ref -``` - -### `owner_sample` - -Optional but useful for normalized reporting: - -```text -ts -owner_kind -owner_key -owner_name -namespace -user_name -account -labels_json -``` - -### Migration - -The existing DB can be read as legacy mode: - -```text -scheduler_kind = none -allocation state = unknown -``` - -Reports should continue to work on old DBs. - -### Retention and Rollups - -Raw process samples can become large quickly. A busy node can produce many rows -per tick: - -```text -1 Hz * 10 GPUs * 50 GPU processes = 500 process rows/sec -``` - -SQLite can handle useful short-term windows, but long retention needs an -explicit policy. Default storage should keep the operational model simple: - -```text -raw samples: 7-14 days by default -1-minute rollups: 90 days by default -5-minute rollups: optional long-term retention -``` - -Proposed rollup tables: - -```text -gpu_rollup_1m -owner_rollup_1m -allocation_rollup_1m -``` - -Rollups should preserve the combined classes, not just average utilization. -Otherwise the core signal, such as `allocated-unused`, disappears during -downsampling. - -## Classification Model - -Keep the existing hardware classification: - -```text -util >= 10 -> active -util < 10 and mem > 100 -> idle-held -util < 10 and mem <= 100 -> truly-idle -``` - -Add scheduler allocation: - -```text -allocation known and present -> allocated -allocation absent -> unallocated -allocation unavailable -> unknown -``` - -Combined classes: - -| Allocation | Hardware | Combined | -|---|---|---| -| allocated | active | allocated-active | -| allocated | idle-held | allocated-idle-held | -| allocated | truly-idle | allocated-unused | -| unallocated | active | unallocated-active | -| unallocated | idle-held | unallocated-idle-held | -| unallocated | truly-idle | truly-idle | -| unknown | active | active | -| unknown | idle-held | idle-held | -| unknown | truly-idle | truly-idle | - -This lets the product keep the original report semantics while adding k8s/Slurm -value. - -## Storage and Reporting Strategy - -### Single Node - -Default: - -```text -/var/lib/gpu-usage-audit/gua.db -``` - -User-mode/foreground fallback: - -```text -~/.local/share/gpu-usage-audit/gua.db -``` - -### Kubernetes - -MVP: - -- One SQLite DB per node via hostPath. -- `gua report` discovers collector pods. -- `gua report` runs `gua daemon export --format jsonl` inside each collector - pod and aggregates locally. - -This avoids a central database or service, but it has known limits: - -- `pods/exec` RBAC is often restricted. -- Sequential exec across many nodes is slow. -- Large exports need streaming, compression, and time-window filtering. - -The report implementation should fan out in parallel and request only the -needed time window. It should also support an alternative export path. - -Later: - -- Optional read-only HTTP export endpoint in each collector pod. -- Optional `kubectl port-forward` based report collection. -- Optional cluster-internal aggregator Job. -- Optional central PVC. -- Optional Prometheus/exporter mode. -- Optional object storage export. - -### Slurm - -MVP: - -- One SQLite DB per compute node. -- Local node reports first. - -Later: - -- Slurm controller-side aggregator. -- `gua report --partition` or `--nodes`. - -## Packaging and Installation - -### Primary CLI Install - -Recommended: - -```sh -uv tool install gpu-usage-audit -``` - -or: - -```sh -pipx install gpu-usage-audit -``` - -To reduce first-run friction, consider making `nvidia-ml-py` a default -dependency instead of an optional extra. It is small, and missing NVML bindings -should not be the reason a GPU audit tool fails on first use. - -### OCI Image - -Needed for k8s runtime. - -```text -ghcr.io/AI-Ocean/gpu-usage-audit: -ghcr.io/AI-Ocean/gpu-usage-audit:latest -``` - -The user does not need to run Docker manually. The image is an implementation -detail used by the k8s runtime adapter. - -### Kubernetes Install - -Initial implementation can embed a manifest template in the Python package. - -Later: - -- Publish standalone YAML in GitHub Releases. -- Publish Helm chart. - -### One-Line Installer - -Optional later UX: - -```sh -curl -Ls https://github.com/AI-Ocean/gpu-usage-audit/releases/latest/download/install.sh | sh -``` - -This should install the CLI only. It should not silently install a systemd -service or k8s DaemonSet. - -## Security and Permissions - -### Host Mode - -Needs: - -- NVML access. -- Read access to `/proc//loginuid` and cgroup metadata. -- Write access to DB directory. -- systemd install requires root. - -Non-root foreground mode should be supported for testing. - -### Kubernetes Mode - -Needs: - -- Ability to create namespace, service account, configmap, daemonset, and RBAC. -- Runtime access to all GPUs on the target node. -- Read access to pod and node metadata. -- Potential hostPID and read-only `/proc` access for process attribution. -- hostPath write access for SQLite DB. -- Optional `pods/exec` for `gua report` if using exec-based export. - -The install plan must print these privileges before applying resources. - -Minimum collector RBAC should start with: - -```text -get/list/watch pods -get/list/watch nodes -``` - -`pods/exec` should be report-side only, not required by the collector itself. - -### Slurm Mode - -Needs: - -- Host NVML access. -- Read access to Slurm commands/config/accounting. -- Read access to process cgroups. -- systemd install usually requires admin privileges. - -Slurm job users should not be expected to install node-wide collectors. - -## Implementation Milestones - -### M0: Focused ADRs - -Before broad implementation, write short architecture decision records for the -highest-risk details: - -- GPU Operator staged NVML loading and host-mode re-exec. -- MIG, vGPU, MPS, and time-slicing representation. -- cgroup v1/v2 parser and owner attribution. -- k8s report export path: `pods/exec` versus HTTP endpoint versus aggregator. - -### M1: Doctor and RuntimePlan - -No behavior changes to collection yet. - -Deliver: - -- `gua doctor` -- host NVML/device checks -- k8s checks -- Slurm checks -- structured JSON output -- recommended plan - -This is the highest leverage milestone because it validates environment -assumptions without installing anything. - -### M2: Schema V2 and Combined Report Model - -Deliver: - -- migration-safe DB schema -- allocation table -- combined classes -- fake scheduler tests -- old DB compatibility -- retention and rollup policy - -This is the differentiating feature. It should land early so every runtime -adapter can target the same model. - -### M3: CLI Surface and State - -Deliver: - -- `gua start --dry-run` -- `gua status` -- local state file -- compatibility aliases for old commands - -No k8s install yet. - -### M4: Kubernetes Runtime Adapter - -Deliver: - -- official OCI image -- embedded DaemonSet manifest -- `gua start --mode k8s` -- `gua stop --mode k8s` -- `gua report` from collector pods with parallel, windowed export - -This solves the observed GPU Operator environment. - -### M5: Kubernetes Scheduler Adapter - -Deliver: - -- pod/process attribution -- PodResources API integration where available -- report by namespace/pod/user -- detection of `NVIDIA_VISIBLE_DEVICES=all` pods without GPU requests -- anomaly headline for unrequested GPU access - -### M6: Host Runtime Adapter - -Deliver: - -- systemd unit install -- foreground mode -- host preflight -- GPU Operator staged NVML re-exec or clear diagnostic - -### M7: Slurm Scheduler Adapter - -Deliver: - -- Slurm detection -- job allocation snapshots -- process-to-job mapping through cgroups -- exact GPU-to-job mapping on a best-effort basis -- report by job/user/account - -### M8: Documentation and Release Polish - -Deliver: - -- quickstart -- architecture docs -- troubleshooting matrix -- release workflow for wheel + OCI image -- optional Helm chart - -## Current Server Interpretation - -The observed `gpusystem` server fits: - -```text -runtime: k8s-daemonset -telemetry: nvml -scheduler: k8s -``` - -Why: - -- Host only shows `/dev/nvidiactl`. -- Host NVML cannot see devices. -- Kubernetes workload containers can see `/dev/nvidia0..9`. -- Some pods use `runtimeClassName=nvidia` and `nvidia.com/gpu` requests. -- Some pods expose `NVIDIA_VISIBLE_DEVICES=all` without GPU requests. - -This environment is exactly why runtime placement and scheduler context must be -separate abstractions. - -## Open Questions - -Proposed decisions: - -1. `nvidia-ml-py` should become a default dependency. -2. k8s DaemonSet should default to `hostPID: true`, with `--no-host-pid` opt-out. -3. k8s install should target GPU-capable nodes by default. -4. Collector RBAC should be read-only: pods and nodes. `pods/exec` is only - needed for the exec-based report transport. -5. `gua report` should default to the current node when local state is - node-scoped, and support `--all-nodes` for cluster reports. -6. Slurm MVP should include detection, node-level job allocation, and cgroup - PID-to-job mapping. Exact GPU-to-job mapping is best effort. -7. MIG fields should be in schema v2 even if reports initially treat them as - ordinary GPU-like devices. -8. `gua` should become the primary command. `gpu-usage-audit` should remain as - a compatibility alias. - -Still open: - -1. Should the first k8s report transport be `pods/exec`, HTTP export, or both? -2. What default raw retention window is acceptable for busy nodes? -3. Should rollups be computed in the collector process or during report/export? -4. How should fractional sharing from HAMi/vGPU/time-slicing be normalized - across schedulers? - -## References - -- NVIDIA DCGM Exporter deployment patterns: - https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html -- NVIDIA Container Toolkit GPU environment variables: - https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.18.1/docker-specialized.html -- NVIDIA GPU Operator overview: - https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html -- NVIDIA GPU Operator CDI and GPU Management Containers: - https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/cdi.html -- Kubernetes Device Plugins: - https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/ -- Kubernetes kubelet files and Pod Resources API path: - https://kubernetes.io/docs/reference/node/kubelet-files/ -- Slurm GRES GPU scheduling: - https://slurm.schedmd.com/gres.html -- Slurm `gres.conf`: - https://slurm.schedmd.com/gres.conf.html -- Slurm cgroups: - https://slurm.schedmd.com/cgroups.html -- Jeon et al., "Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN - Training Workloads", USENIX ATC 2019: - https://www.usenix.org/conference/atc19/presentation/jeon -- Hu et al., "Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for - Deep Learning Training Jobs", ASPLOS 2023: - https://doi.org/10.1145/3575693.3575705 diff --git a/src/gpu_usage_audit/__init__.py b/src/gpu_usage_audit/__init__.py index 0c77716..30d729c 100644 --- a/src/gpu_usage_audit/__init__.py +++ b/src/gpu_usage_audit/__init__.py @@ -1,8 +1,7 @@ """gpu-usage-audit — surfaces idle-held NVIDIA GPU memory. -이 패키지의 외부 API 는 아직 *진행 중*. v0.2.0 알파 단계에서 -Go v0.1.0 의 5-section report 를 Python 으로 옮기는 작업이 진행 중. -v0.2.0 stable 까지는 import path 가 바뀔 수 있음. +1.0 scope는 단일 로컬 베어메탈 NVIDIA 호스트의 NVML telemetry를 SQLite에 +기록하고, active / idle-held / truly-idle retrospective report를 출력하는 것. """ # 런타임에서 버전 노출. pyproject.toml 의 [project.version] 과 동기 유지. diff --git a/src/gpu_usage_audit/__main__.py b/src/gpu_usage_audit/__main__.py index f67f0b2..8813ab4 100644 --- a/src/gpu_usage_audit/__main__.py +++ b/src/gpu_usage_audit/__main__.py @@ -34,7 +34,6 @@ doctor_report_to_dict, render_doctor, ) -from .env import detect_env_kind from .identity import system_user_lookup from .model import HostMeta from .nvml import NVMLNotAvailableError, NVMLTier @@ -64,6 +63,7 @@ "d": "days", } DEFAULT_DB_PATH = DOCTOR_DEFAULT_DB_PATH +LOCAL_ENV_KIND = "bare" def _duration(s: str) -> timedelta: @@ -225,7 +225,7 @@ def _cmd_daemon(args: argparse.Namespace) -> int: return 1 host = HostMeta( hostname=socket.gethostname() or "unknown", - env_kind=detect_env_kind("/proc"), + env_kind=LOCAL_ENV_KIND, driver_version=driver, first_seen=datetime.now(UTC), ) @@ -295,7 +295,7 @@ def _cmd_demo(args: argparse.Namespace) -> int: driver = tier.probe() host = HostMeta( hostname=socket.gethostname() or "unknown", - env_kind=detect_env_kind("/proc"), + env_kind=LOCAL_ENV_KIND, driver_version=driver, first_seen=datetime.now(UTC), ) diff --git a/src/gpu_usage_audit/doctor.py b/src/gpu_usage_audit/doctor.py index 39476e6..e566bb1 100644 --- a/src/gpu_usage_audit/doctor.py +++ b/src/gpu_usage_audit/doctor.py @@ -16,10 +16,10 @@ from pathlib import Path from typing import Literal -from .model import RuntimePlan from .nvml import NVMLNotAvailableError, _decode, _load_pynvml, nvml_init_error_message type CheckStatus = Literal["ok", "warning", "error", "skipped"] +type ReadinessMode = Literal["host", "unsupported"] type Which = Callable[[str], str | None] DEFAULT_COMMAND_TIMEOUT_SECONDS = 3.0 @@ -114,11 +114,21 @@ class DetectionFacts: database: DatabaseInfo +@dataclass(slots=True) +class DoctorPlan: + """`gua doctor` 의 로컬 베어메탈 readiness 판정.""" + + mode: ReadinessMode + reasons: list[str] = field(default_factory=list) + blockers: list[str] = field(default_factory=list) + warnings: list[str] = field(default_factory=list) + + @dataclass(slots=True) class DoctorReport: generated_at: datetime checks: list[DoctorCheck] - plan: RuntimePlan + plan: DoctorPlan def run_command(cmd: Sequence[str], timeout: float) -> CommandResult: @@ -173,7 +183,7 @@ def build_doctor_report( nvml=nvml_info, database=database_info, ) - plan = select_runtime_plan(facts) + plan = select_doctor_plan(facts) return DoctorReport( generated_at=generated_at, checks=[ @@ -446,15 +456,12 @@ def probe_default_db(db_path: str | Path = DEFAULT_DB_PATH) -> tuple[DatabaseInf ) -def select_runtime_plan(facts: DetectionFacts) -> RuntimePlan: +def select_doctor_plan(facts: DetectionFacts) -> DoctorPlan: blockers = _unsupported_blockers(facts) warnings = _host_warnings(facts) if blockers: - return RuntimePlan( + return DoctorPlan( mode="unsupported", - telemetry="nvml", - scheduler="none", - confidence="high", reasons=[ "This command only audits the local machine, and host readiness is incomplete." ], @@ -462,21 +469,14 @@ def select_runtime_plan(facts: DetectionFacts) -> RuntimePlan: warnings=warnings, ) - return RuntimePlan( + return DoctorPlan( mode="host", - telemetry="nvml", - scheduler="none", - confidence="high", reasons=[ f"Local NVML initialized and sees {facts.nvml.device_count} GPU(s).", "`nvidia-smi -L` lists GPUs on this machine.", "The 1.0 workflow writes local NVML samples to a local SQLite database.", ], warnings=warnings, - required_privileges=[ - "permission to read NVML GPU and process state", - "write access to the collector database path", - ], ) @@ -535,7 +535,7 @@ def doctor_report_to_dict(report: DoctorReport) -> dict[str, object]: "read_only": True, "no_system_changes": True, "checks": [doctor_check_to_dict(check) for check in report.checks], - "plan": runtime_plan_to_dict(report.plan), + "plan": doctor_plan_to_dict(report.plan), } if report.plan.mode == "host": data["recommended_commands"] = _recommended_commands_for(report) @@ -552,18 +552,12 @@ def doctor_check_to_dict(check: DoctorCheck) -> dict[str, object]: } -def runtime_plan_to_dict(plan: RuntimePlan) -> dict[str, object]: +def doctor_plan_to_dict(plan: DoctorPlan) -> dict[str, object]: return { "mode": plan.mode, - "telemetry": plan.telemetry, - "scheduler": plan.scheduler, - "confidence": plan.confidence, "reasons": plan.reasons, "blockers": plan.blockers, "warnings": plan.warnings, - "required_privileges": plan.required_privileges, - # schema_version=1 호환을 위해 RuntimePlan 모델 필드 없이 빈 리스트를 유지한다. - "actions": [], } diff --git a/src/gpu_usage_audit/env.py b/src/gpu_usage_audit/env.py deleted file mode 100644 index 8c5015d..0000000 --- a/src/gpu_usage_audit/env.py +++ /dev/null @@ -1,57 +0,0 @@ -"""호스트 환경 분류 — `/proc/1/cgroup` 의 마지막 필드를 보고 bare/docker/k8s 결정. - -PID 1 은 부팅 직후 커널이 띄우는 init — bare 머신이면 systemd 관리 -경로(`/system.slice/...`, `/init.scope` 등), 컨테이너 안이면 -`/docker/...` 또는 `/kubepods/...` 같은 시그니처가 등장한다. - -매칭 우선순위: k8s → docker → bare → unknown. -- k8s 를 먼저 보는 이유: k8s 파드는 내부적으로 docker/containerd 위에 - 도는 경우가 흔해 docker 시그니처가 false positive 가 될 수 있다. -- unknown 은 silent 폴백 — *알 수 없는 환경* 을 "bare 인 척" 하면 위험. -""" - -from __future__ import annotations - -from pathlib import Path - - -def detect_env_kind(proc_root: str | Path = "/proc") -> str: - """`proc_root/1/cgroup` 을 읽고 "bare"/"docker"/"k8s"/"unknown" 반환. - - Args: - proc_root: 일반적으로 `/proc`. 테스트에서는 t.TempDir() 같은 - pyfakefs 대신 *실 파일* 픽스처를 깔아도 동작 — Go 의 - DetectEnvKind 와 동일한 시그니처. - - Returns: - 분류 문자열. 파일 부재/읽기 실패 시 "unknown". - """ - path = Path(proc_root) / "1" / "cgroup" - try: - data = path.read_text() - except OSError: - return "unknown" - - if "kubepods" in data: - return "k8s" - if "docker" in data or "containerd" in data: - return "docker" - - # cgroup 라인 형식: "::" (v1) 또는 - # "0::" (v2). 마지막 필드가 systemd 관리 경로면 bare. - for raw_line in data.splitlines(): - line = raw_line.strip() - if not line: - continue - parts = line.split(":", 2) - if len(parts) != 3: - continue - p = parts[2] - if ( - p == "/" - or p == "/init.scope" - or p.startswith("/system.slice") - or p.startswith("/user.slice") - ): - return "bare" - return "unknown" diff --git a/src/gpu_usage_audit/model.py b/src/gpu_usage_audit/model.py index 198d131..aad0f23 100644 --- a/src/gpu_usage_audit/model.py +++ b/src/gpu_usage_audit/model.py @@ -10,15 +10,9 @@ from dataclasses import dataclass, field from datetime import datetime -from typing import Literal from .classify import Class -RuntimeMode = Literal["host", "unsupported"] -TelemetrySource = Literal["nvml"] -SchedulerSource = Literal["none"] -PlanConfidence = Literal["high", "medium", "low"] - @dataclass(slots=True) class GPUSample: @@ -58,9 +52,9 @@ class Snapshot: class HostMeta: """데몬 startup 에 한 번 결정하고 *수명 내내 들고 다니는* 호스트 컨텍스트. - hostname/env_kind/driver_version 은 데몬 lifetime 동안 변하지 않는다는 - 가정. first_seen 은 host row 의 immutable 필드 (재시작 후에도 첫 - INSERT 시각 보존), last_seen 은 매 틱 갱신. + 1.0은 로컬 베어메탈 호스트 전용이므로 env_kind 는 "bare" 로 기록한다. + hostname/env_kind/driver_version 은 데몬 lifetime 동안 변하지 않는다는 가정. + first_seen 은 host row 의 immutable 필드, last_seen 은 매 틱 갱신. """ hostname: str @@ -69,20 +63,6 @@ class HostMeta: first_seen: datetime -@dataclass(slots=True) -class RuntimePlan: - """`gua doctor` 가 만든 로컬 호스트 readiness 판정.""" - - mode: RuntimeMode - telemetry: TelemetrySource - scheduler: SchedulerSource - confidence: PlanConfidence - reasons: list[str] = field(default_factory=list) - blockers: list[str] = field(default_factory=list) - warnings: list[str] = field(default_factory=list) - required_privileges: list[str] = field(default_factory=list) - - @dataclass(slots=True) class HostRow: """report 측이 host 테이블에서 읽어 헤더에 노출하는 모양. diff --git a/src/gpu_usage_audit/tier.py b/src/gpu_usage_audit/tier.py index 786f30a..b2ccac3 100644 --- a/src/gpu_usage_audit/tier.py +++ b/src/gpu_usage_audit/tier.py @@ -1,8 +1,7 @@ -"""데이터 소스 추상 + 학습/테스트용 FakeTier. +"""GPU telemetry source 추상 + demo/test용 FakeTier. Tier 는 "한 틱의 GPU 텔레메트리를 어디서 받아오는가" 의 추상. 운영용 -NVMLTier (v0.2.0 후속에서 추가) 와 학습/테스트용 FakeTier 가 같은 -자리에 꽂힌다. +NVMLTier 와 demo/test용 FakeTier 가 같은 자리에 꽂힌다. Python 에는 typing.Protocol — Go 의 interface 와 *구조적 호환* (implements 선언 불필요). FakeTier 와 NVMLTier 가 같은 모양을 가지면 diff --git a/tests/test_doctor.py b/tests/test_doctor.py index 37d657c..30ccf97 100644 --- a/tests/test_doctor.py +++ b/tests/test_doctor.py @@ -62,8 +62,6 @@ def test_build_doctor_report_checks_only_local_bare_metal(tmp_path: Path) -> Non ] assert runner.calls == [("nvidia-smi", "-L")] assert report.plan.mode == "host" - assert report.plan.telemetry == "nvml" - assert report.plan.scheduler == "none" rendered = render_doctor(report) assert "Scope:\n machine: local" in rendered @@ -324,7 +322,8 @@ def test_doctor_report_json_is_local_scope(tmp_path: Path) -> None: assert isinstance(plan, dict) assert isinstance(checks, list) assert plan["mode"] == "host" - assert plan["actions"] == [] + assert "scheduler" not in plan + assert "actions" not in plan assert [check["id"] for check in checks if isinstance(check, dict)] == [ "os", "nvidia_devices", diff --git a/tests/test_env.py b/tests/test_env.py deleted file mode 100644 index d68dd5d..0000000 --- a/tests/test_env.py +++ /dev/null @@ -1,55 +0,0 @@ -"""DetectEnvKind 테스트. Go v0.1.0 의 TestDetectEnvKind 와 동일 케이스.""" - -from __future__ import annotations - -from pathlib import Path - -import pytest - -from gpu_usage_audit.env import detect_env_kind - - -@pytest.mark.parametrize( - ("name", "content", "want"), - [ - ( - "k8s — kubepods 경로", - "12:devices:/kubepods/besteffort/pod-abc/container-xyz\n", - "k8s", - ), - ( - "k8s 우선순위 — kubepods + docker 둘 다", - "12:devices:/kubepods/...\n11:cpu:/docker/abc\n", - "k8s", - ), - ("docker — docker 경로", "12:devices:/docker/abcdef\n", "docker"), - ("docker — containerd 경로", "12:devices:/containerd/xyz\n", "docker"), - ("bare — system.slice", "0::/system.slice/gpu-audit.service\n", "bare"), - ("bare — init.scope", "0::/init.scope\n", "bare"), - ("bare — 루트 경로", "0::/\n", "bare"), - ("bare — user.slice", "0::/user.slice/user-1000.slice\n", "bare"), - ("unknown — 모르는 경로", "0::/some/weird/path\n", "unknown"), - ], -) -def test_detect_env_kind_from_content( - tmp_path: Path, - name: str, - content: str, - want: str, -) -> None: - proc_dir = tmp_path / "1" - proc_dir.mkdir() - (proc_dir / "cgroup").write_text(content) - got = detect_env_kind(tmp_path) - assert got == want, f"{name}: got {got!r}, want {want!r}\n content={content!r}" - - -def test_detect_env_kind_missing_file(tmp_path: Path) -> None: - # proc_root 자체는 존재하지만 1/cgroup 파일 없음 — unknown 폴백. - assert detect_env_kind(tmp_path) == "unknown" - - -def test_detect_env_kind_missing_root(tmp_path: Path) -> None: - # proc_root 자체가 없는 경로도 OSError 흡수 → unknown. - nonexistent = tmp_path / "does-not-exist" - assert detect_env_kind(nonexistent) == "unknown" diff --git a/tests/test_smoke.py b/tests/test_smoke.py index e45dafd..055c60d 100644 --- a/tests/test_smoke.py +++ b/tests/test_smoke.py @@ -24,8 +24,7 @@ gua_main, main, ) -from gpu_usage_audit.doctor import DoctorCheck, DoctorReport -from gpu_usage_audit.model import RuntimePlan +from gpu_usage_audit.doctor import DoctorCheck, DoctorPlan, DoctorReport def test_version_string_is_nonempty() -> None: @@ -207,11 +206,8 @@ def _fake_doctor_report(*, db_path: str | Path = DEFAULT_DB_PATH) -> DoctorRepor details={"path": str(db_path), "is_default": Path(db_path) == DEFAULT_DB_PATH}, ), ], - plan=RuntimePlan( + plan=DoctorPlan( mode="host", - telemetry="nvml", - scheduler="none", - confidence="high", reasons=["Local NVML initialized and sees 2 GPU(s)."], ), )