diff --git a/CHANGELOG.md b/CHANGELOG.md index 835eeda..79693f9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,12 @@ # Changelog +## Unreleased + +- Made `gua` the documented command surface for daemon, report, demo, and doctor output. +- Made `gua daemon` start the collector in the background by default, with + `gua daemon --foreground` available for systemd and debugging. +- Added `gua start`, `gua status`, and `gua stop` for background collector management. + ## 1.0.0 - 2026-05-15 Bare-metal 1.0 narrows `gpu-usage-audit` to one clear workflow: inspect the diff --git a/README.md b/README.md index 0056a6f..7b0340a 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ Jupyter notebook open with an 8 GB tensor on the GPU and went to lunch — `nvidia-smi` will show 1% utilization, but the card is *unusable* by anyone else. This tool measures that. -> **Status:** bare-metal 1.0 release candidate. +> **Status:** bare-metal 1.0. > `gua doctor` checks only the current machine. `daemon` records NVML > telemetry from the current NVIDIA host, `report` reads the resulting > SQLite database, and `demo` runs anywhere with fake telemetry. The Go @@ -30,8 +30,10 @@ runtime. If Python downloads are disabled by local policy, install Python uv tool install gpu-usage-audit gua doctor -gpu-usage-audit daemon --interval 30s -gpu-usage-audit report --since 1h --interval 30s +gua daemon --interval 30s +gua status +gua report --since 1h --interval 30s +gua stop ``` `gua doctor` is intentionally read-only. It checks only the current @@ -46,7 +48,8 @@ with GPU UUIDs, so review it before sharing it outside your team. `gua doctor` does not need `sudo`; run it as the same user that will run the daemon. -Available `gua` subcommands: `doctor`. +Available `gua` subcommands: `doctor`, `daemon`, `start`, `status`, +`stop`, `report`, `demo`, `version`, `help`. Update or remove the installed tool with uv: @@ -74,8 +77,8 @@ uvx --from "./$WHEEL" gua doctor ## What you get ``` -$ gpu-usage-audit report --since 1h --interval 30s -gpu-usage-audit — lab-a100 (bare, driver 560.35.05) Window: 1:00:00 +$ gua report --since 1h --interval 30s +gua — lab-a100 (bare, driver 560.35.05) Window: 1:00:00 §1 Headline █████████▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░ @@ -113,7 +116,7 @@ The `demo` subcommand records 30 ticks of fake telemetry and prints the report — all in one process, no second shell needed. ```sh -gpu-usage-audit demo +gua demo ``` The bundled `FakeTier` produces a deterministic 5-tick workload — @@ -146,21 +149,28 @@ can collect real telemetry. Then run the collector: ```sh -gpu-usage-audit daemon --interval 30s +gua daemon --interval 30s +gua status ``` -Run the report from another shell: +Run the report: ```sh -gpu-usage-audit report --since 1h --interval 30s +gua report --since 1h --interval 30s +``` + +Stop the background collector when the collection window is done: + +```sh +gua stop ``` If `--db` is omitted, both `daemon` and `report` use `/tmp/gua.db`. `daemon` refuses to start when that database file already exists, so a new collection run does not silently append to an old test database. If `gua doctor` reports that the database already exists, either run -`gpu-usage-audit report` against the existing data or choose a fresh -`--db PATH` for the next daemon run. +`gua report` against the existing data or choose a fresh `--db PATH` for +the next daemon run. > The daemon requires the NVIDIA driver and `libnvidia-ml.so.1`. On a > driverless host it exits with a friendly NVML initialization error. For @@ -168,18 +178,24 @@ new collection run does not silently append to an old test database. If ## Usage -`gpu-usage-audit` has three commands sharing one SQLite file: +`gua` has commands sharing one SQLite file. The `gpu-usage-audit` entry +point remains installed for compatibility, but new examples use `gua`. | Command | What it does | | -------- | ----------------------------------------------------------- | -| `daemon` | Long-running background process. Samples real NVML telemetry on every tick and writes to a new database. Stop with Ctrl+C (SIGINT) or `systemctl stop`. NVIDIA host required. | +| `daemon` | Starts the collector in the background. Samples real NVML telemetry on every tick and writes to a new database. NVIDIA host required. | +| `start` | Alias for `gua daemon`. | +| `status` | Shows whether the background collector PID is still running. | +| `stop` | Stops the background collector with SIGTERM. | | `report` | One-shot read against the accumulated database. Safe to run **while the daemon is still writing** — SQLite WAL mode handles the concurrency. | | `demo` | Self-contained showcase. Records N fake ticks and immediately prints the report. No GPU, no second shell, no operational meaning — just to see the output shape. | -### `daemon` +### `daemon` / `start` ``` -gpu-usage-audit daemon [--db PATH] [--interval D] +gua daemon [--db PATH] [--interval D] [--pid-file PATH] [--log-file PATH] +gua start [--db PATH] [--interval D] [--pid-file PATH] [--log-file PATH] +gua daemon --foreground [--db PATH] [--interval D] ``` - `--db PATH` (default `/tmp/gua.db`) — SQLite file to create and write @@ -187,14 +203,21 @@ gpu-usage-audit daemon [--db PATH] [--interval D] is enabled automatically. - `--interval D` (default `30s`) — how often to sample. Accepts `30s`, `1m`, `200ms`, etc. +- `--pid-file PATH` (default `/tmp/gua.pid`) — background PID file. +- `--log-file PATH` (default `/tmp/gua.log`) — stdout/stderr from the + background collector. +- `--foreground` — keep the collector attached to the current process. + Use this for systemd or debugging. -Each tick prints a one-line summary to stdout; on shutdown the cumulative -row count is printed. +By default, `gua daemon` returns after the collector starts. Each tick is +written to the log file; on shutdown the cumulative row count is written +there too. `gua daemon --foreground` prints the tick summaries directly +to the terminal and exits on Ctrl+C, SIGTERM, or `systemctl stop`. ### `report` ``` -gpu-usage-audit report [--db PATH] [--since D] [--interval D] [--width N] +gua report [--db PATH] [--since D] [--interval D] [--width N] ``` - `--db PATH` (default `/tmp/gua.db`) — same SQLite file the daemon writes @@ -211,7 +234,7 @@ gpu-usage-audit report [--db PATH] [--since D] [--interval D] [--width N] ### `demo` ``` -gpu-usage-audit demo [--db PATH] [--ticks N] [--interval D] +gua demo [--db PATH] [--ticks N] [--interval D] ``` - `--db PATH` (optional) — if omitted, a fresh temporary database is @@ -223,7 +246,7 @@ gpu-usage-audit demo [--db PATH] [--ticks N] [--interval D] ### Operational notes - **Same `--interval` on both sides.** If you ran the daemon with - `--interval 30s`, run `report --interval 30s` too. + `--interval 30s`, run `gua report --interval 30s` too. - **Let it run for a while.** §1/§3 are meaningful after one tick; §4 (Top identities) needs hours; §5 (Heatmap) needs days. - **WAL leaves sidecar files** (`gua.db-wal`, `gua.db-shm`). They are @@ -238,12 +261,12 @@ For a long-running deployment, drop a unit file in ```ini [Unit] -Description=gpu-usage-audit daemon +Description=gua daemon After=network.target [Service] Type=simple -ExecStart=/usr/local/bin/gpu-usage-audit daemon --db /var/lib/gua/gua.db --interval 30s +ExecStart=/usr/local/bin/gua daemon --foreground --db /var/lib/gua/gua.db --interval 30s Restart=on-failure User=gua @@ -283,7 +306,7 @@ uv sync # create .venv, install dev deps uv run pytest # run the test suite uv run ruff check # lint uv run mypy # type-check (strict) -uv run gpu-usage-audit demo # see the report shape locally +uv run gua demo # see the report shape locally ``` CI runs ruff + format check + mypy + pytest, then builds and smoke-tests diff --git a/pyproject.toml b/pyproject.toml index 7e698fe..7e1059a 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -23,7 +23,7 @@ dependencies = ["nvidia-ml-py>=12.535"] nvml = [] [project.scripts] -# pip / uvx 가 만들 entry point. `uvx gpu-usage-audit ...` 한 줄로 실행. +# `gua` is the canonical CLI; `gpu-usage-audit` remains as a compatibility alias. gpu-usage-audit = "gpu_usage_audit.__main__:main" gua = "gpu_usage_audit.__main__:gua_main" diff --git a/scripts/smoke-dist-wheel.sh b/scripts/smoke-dist-wheel.sh index 37a7270..51cfbc3 100644 --- a/scripts/smoke-dist-wheel.sh +++ b/scripts/smoke-dist-wheel.sh @@ -113,4 +113,4 @@ if "NVML initialization failed" in summary: raise SystemExit(f"summary still has duplicate init prefix: {summary}") PY -"$tmpdir/venv/bin/gpu-usage-audit" demo --ticks 1 --interval 1ms >/dev/null +"$tmpdir/venv/bin/gua" demo --ticks 1 --interval 1ms >/dev/null diff --git a/src/gpu_usage_audit/__main__.py b/src/gpu_usage_audit/__main__.py index 8813ab4..84fb731 100644 --- a/src/gpu_usage_audit/__main__.py +++ b/src/gpu_usage_audit/__main__.py @@ -1,7 +1,10 @@ -"""CLI entry point. `python -m gpu_usage_audit` 와 `uvx gpu-usage-audit` 둘 다 여기로. +"""CLI entry point. `gua` 가 canonical CLI, `gpu-usage-audit` 는 compatibility CLI. 서브커맨드: daemon 실 NVIDIA NVML 텔레메트리를 SQLite 에 적재 (운영용, 백그라운드) + start daemon alias + status 백그라운드 collector 상태 확인 + stop 백그라운드 collector 종료 report 누적 DB 에서 §1~§5 retrospective 리포트 demo 데모용 — fake telemetry 로 30 tick 적재 + 즉시 report (한 프로세스) version 버전 출력 @@ -13,13 +16,18 @@ from __future__ import annotations import argparse +import contextlib import json import logging +import os import re +import signal import socket +import subprocess import sys import tempfile import threading +import time from datetime import UTC, datetime, timedelta from pathlib import Path @@ -63,7 +71,11 @@ "d": "days", } DEFAULT_DB_PATH = DOCTOR_DEFAULT_DB_PATH +DEFAULT_PID_PATH = Path("/tmp/gua.pid") +DEFAULT_LOG_PATH = Path("/tmp/gua.log") +DISPLAY_COMMAND_ENV = "GPU_USAGE_AUDIT_DISPLAY_COMMAND" LOCAL_ENV_KIND = "bare" +STARTUP_CHECK_SECONDS = 0.3 def _duration(s: str) -> timedelta: @@ -164,11 +176,84 @@ def build_parser() -> argparse.ArgumentParser: return parser +def _add_daemon_args(parser: argparse.ArgumentParser) -> None: + parser.add_argument( + "--db", + default=str(DEFAULT_DB_PATH), + help=f"Path to a new SQLite database file [default: {DEFAULT_DB_PATH}]", + ) + parser.add_argument( + "--interval", + type=_duration, + default=timedelta(seconds=30), + help="Tick interval (e.g. 30s, 1m, 200ms) [default: 30s]", + ) + + +def _add_report_args(parser: argparse.ArgumentParser) -> None: + parser.add_argument( + "--db", + default=str(DEFAULT_DB_PATH), + help=f"Path to SQLite database file [default: {DEFAULT_DB_PATH}]", + ) + parser.add_argument( + "--since", + type=_duration, + default=timedelta(hours=1), + help="Report window (e.g. 1h, 24h, 5m) [default: 1h]", + ) + parser.add_argument( + "--interval", + type=_duration, + default=timedelta(seconds=30), + help="Daemon tick interval — for §2 Waste / §4 time conversion [default: 30s]", + ) + parser.add_argument( + "--width", + type=int, + default=60, + help="Width of the headline bar [default: 60]", + ) + + +def _add_demo_args(parser: argparse.ArgumentParser) -> None: + parser.add_argument( + "--db", + default=None, + help="SQLite database path [default: a fresh temporary file]", + ) + parser.add_argument( + "--ticks", + type=int, + default=30, + help="Number of fake ticks to record before printing the report [default: 30]", + ) + parser.add_argument( + "--interval", + type=_duration, + default=timedelta(seconds=1), + help="Tick interval for the fake daemon [default: 1s]", + ) + + +def _add_runtime_file_args(parser: argparse.ArgumentParser) -> None: + parser.add_argument( + "--pid-file", + default=str(DEFAULT_PID_PATH), + help=f"Background daemon PID file [default: {DEFAULT_PID_PATH}]", + ) + parser.add_argument( + "--log-file", + default=str(DEFAULT_LOG_PATH), + help=f"Background daemon log file [default: {DEFAULT_LOG_PATH}]", + ) + + def build_gua_parser() -> argparse.ArgumentParser: - """로컬 bare-metal `gua` command surface 구성.""" + """사용자용 `gua` command surface 구성.""" parser = argparse.ArgumentParser( prog="gua", - description="Local bare-metal readiness checks for gpu-usage-audit.", + description="Audit local bare-metal NVIDIA GPU usage.", formatter_class=argparse.RawDescriptionHelpFormatter, epilog='Use "gua -h" for command-specific flags.', ) @@ -192,6 +277,58 @@ def build_gua_parser() -> argparse.ArgumentParser: ) p_doctor.set_defaults(func=_cmd_gua_doctor) + p_daemon = sub.add_parser( + "daemon", + help="Start the collector in the background", + ) + _add_daemon_args(p_daemon) + _add_runtime_file_args(p_daemon) + p_daemon.add_argument( + "--foreground", + action="store_true", + help="Run in the foreground instead of starting a background process", + ) + p_daemon.set_defaults(func=_cmd_gua_daemon) + + p_start = sub.add_parser( + "start", + help="Alias for `gua daemon`", + ) + _add_daemon_args(p_start) + _add_runtime_file_args(p_start) + p_start.set_defaults(func=_cmd_gua_start) + + p_status = sub.add_parser( + "status", + help="Show background collector status", + ) + _add_runtime_file_args(p_status) + p_status.set_defaults(func=_cmd_gua_status) + + p_stop = sub.add_parser( + "stop", + help="Stop the background collector", + ) + _add_runtime_file_args(p_stop) + p_stop.set_defaults(func=_cmd_gua_stop) + + p_report = sub.add_parser( + "report", + help="Print §1–§5 retrospective report", + ) + _add_report_args(p_report) + p_report.set_defaults(func=_cmd_report, display_command="gua report") + + p_demo = sub.add_parser( + "demo", + help="Run a self-contained demo with fake telemetry", + ) + _add_demo_args(p_demo) + p_demo.set_defaults(func=_cmd_demo) + + sub.add_parser("version", help="Print version") + sub.add_parser("help", help="Show this message") + return parser @@ -205,54 +342,181 @@ def _cmd_gua_doctor(args: argparse.Namespace) -> int: return 0 +def _cmd_gua_daemon(args: argparse.Namespace) -> int: + if args.foreground: + args.display_command = "gua daemon --foreground" + return _cmd_daemon(args) + return _cmd_gua_start(args) + + +def _cmd_gua_start(args: argparse.Namespace) -> int: + db_path = Path(args.db) + pid_path = Path(args.pid_file) + log_path = Path(args.log_file) + + existing_pid = _read_pid(pid_path) + if existing_pid is not None and _pid_alive(existing_pid): + print(f"gua daemon: already running (pid {existing_pid})") + return 0 + if existing_pid is not None: + _unlink_if_exists(pid_path) + + if db_path.exists(): + print( + f"gua daemon: {db_path} already exists; " + "run `gua report` for existing data or choose another --db path.", + file=sys.stderr, + ) + return 2 + + pid_path.parent.mkdir(parents=True, exist_ok=True) + log_path.parent.mkdir(parents=True, exist_ok=True) + command = [ + sys.executable, + "-m", + "gpu_usage_audit", + "daemon", + "--db", + str(db_path), + "--interval", + _duration_cli_value(args.interval), + ] + env = os.environ.copy() + env[DISPLAY_COMMAND_ENV] = "gua daemon --foreground" + with log_path.open("ab") as log: + proc = subprocess.Popen( + command, + stdin=subprocess.DEVNULL, + stdout=log, + stderr=log, + env=env, + start_new_session=True, + ) + + pid_path.write_text(f"{proc.pid}\n", encoding="utf-8") + time.sleep(STARTUP_CHECK_SECONDS) + rc = proc.poll() + if rc is not None: + _unlink_if_exists(pid_path) + print(f"gua daemon: failed to start (exit {rc}); log: {log_path}", file=sys.stderr) + tail = _tail_text(log_path) + if tail: + print(tail, file=sys.stderr) + return rc or 1 + + print(f"gua daemon: started pid {proc.pid}") + print(f" db: {db_path}") + print(f" log: {log_path}") + print(" stop: gua stop") + return 0 + + +def _cmd_gua_status(args: argparse.Namespace) -> int: + pid_path = Path(args.pid_file) + log_path = Path(args.log_file) + pid = _read_pid(pid_path) + if pid is None: + print("gua daemon: not running") + return 0 + if _pid_alive(pid): + print(f"gua daemon: running (pid {pid})") + print(f" pid file: {pid_path}") + print(f" log: {log_path}") + return 0 + print(f"gua daemon: not running (stale pid {pid})") + _unlink_if_exists(pid_path) + return 0 + + +def _cmd_gua_stop(args: argparse.Namespace) -> int: + pid_path = Path(args.pid_file) + pid = _read_pid(pid_path) + if pid is None: + print("gua daemon: not running") + return 0 + if not _pid_alive(pid): + _unlink_if_exists(pid_path) + print(f"gua daemon: not running (removed stale pid {pid})") + return 0 + + try: + os.kill(pid, signal.SIGTERM) + except PermissionError: + print(f"gua daemon: permission denied stopping pid {pid}", file=sys.stderr) + return 1 + except ProcessLookupError: + _unlink_if_exists(pid_path) + print(f"gua daemon: not running (removed stale pid {pid})") + return 0 + + deadline = time.monotonic() + 5.0 + while time.monotonic() < deadline: + if not _pid_alive(pid): + _unlink_if_exists(pid_path) + print(f"gua daemon: stopped pid {pid}") + return 0 + time.sleep(0.1) + + print(f"gua daemon: sent SIGTERM to pid {pid}, but it is still running", file=sys.stderr) + return 1 + + def _cmd_daemon(args: argparse.Namespace) -> int: """실 NVML 데몬 — 운영용.""" + display_command = getattr( + args, + "display_command", + os.environ.get(DISPLAY_COMMAND_ENV, "gpu-usage-audit daemon"), + ) db_path = Path(args.db) if db_path.exists(): print( - f"gpu-usage-audit daemon: {db_path} already exists; " + f"{display_command}: {db_path} already exists; " "choose another --db path or remove the existing file before starting.", file=sys.stderr, ) return 2 - conn = open_db(args.db) tier = NVMLTier() try: try: driver = tier.probe() except NVMLNotAvailableError as e: - print(f"gpu-usage-audit daemon: {e}", file=sys.stderr) + print(f"{display_command}: {e}", file=sys.stderr) return 1 - host = HostMeta( - hostname=socket.gethostname() or "unknown", - env_kind=LOCAL_ENV_KIND, - driver_version=driver, - first_seen=datetime.now(UTC), - ) - stop = threading.Event() - install_signal_handlers(stop) - run_daemon( - tier=tier, - db=conn, - host=host, - interval=args.interval, - lookup=system_user_lookup, - stop=stop, - ) - total = conn.execute("SELECT COUNT(*) FROM gpu_sample").fetchone()[0] - print(f"\n{args.db}: {total} total gpu_sample rows") - return 0 + conn = open_db(args.db) + try: + host = HostMeta( + hostname=socket.gethostname() or "unknown", + env_kind=LOCAL_ENV_KIND, + driver_version=driver, + first_seen=datetime.now(UTC), + ) + stop = threading.Event() + install_signal_handlers(stop) + run_daemon( + tier=tier, + db=conn, + host=host, + interval=args.interval, + lookup=system_user_lookup, + stop=stop, + ) + total = conn.execute("SELECT COUNT(*) FROM gpu_sample").fetchone()[0] + print(f"\n{args.db}: {total} total gpu_sample rows") + return 0 + finally: + conn.close() finally: tier.close() - conn.close() def _cmd_report(args: argparse.Namespace) -> int: + display_command = getattr(args, "display_command", "gpu-usage-audit report") db_path = Path(args.db) if not db_path.exists(): print( - f"gpu-usage-audit report: {db_path} does not exist; " - "run `gpu-usage-audit daemon` first or pass --db PATH.", + f"{display_command}: {db_path} does not exist; " + "run `gua daemon` first or pass --db PATH.", file=sys.stderr, ) return 2 @@ -333,12 +597,7 @@ def _cmd_demo(args: argparse.Namespace) -> int: def main(argv: list[str] | None = None) -> int: """Entry point. argv=None 이면 sys.argv 사용.""" - logging.basicConfig( - level=logging.INFO, - format="%(asctime)s %(message)s", - datefmt="%Y/%m/%d %H:%M:%S", - stream=sys.stderr, - ) + _configure_logging() parser = build_parser() args = parser.parse_args(argv) @@ -359,9 +618,16 @@ def main(argv: list[str] | None = None) -> int: def gua_main(argv: list[str] | None = None) -> int: """새 `gua` command surface entry point.""" + _configure_logging() parser = build_gua_parser() args = parser.parse_args(argv) + if args.command == "version": + print(__version__) + return 0 + if args.command == "help": + parser.print_help() + return 0 if hasattr(args, "func"): result: int = args.func(args) return result @@ -370,5 +636,59 @@ def gua_main(argv: list[str] | None = None) -> int: return 2 +def _configure_logging() -> None: + logging.basicConfig( + level=logging.INFO, + format="%(asctime)s %(message)s", + datefmt="%Y/%m/%d %H:%M:%S", + stream=sys.stderr, + ) + + +def _duration_cli_value(value: timedelta) -> str: + seconds = value.total_seconds() + milliseconds = seconds * 1000 + if 0 < milliseconds < 1000 and milliseconds.is_integer(): + return f"{int(milliseconds)}ms" + return f"{seconds:g}s" + + +def _read_pid(path: Path) -> int | None: + try: + raw = path.read_text(encoding="utf-8").strip() + except OSError: + return None + if not raw: + return None + try: + pid = int(raw) + except ValueError: + return None + return pid if pid > 0 else None + + +def _pid_alive(pid: int) -> bool: + try: + os.kill(pid, 0) + except ProcessLookupError: + return False + except PermissionError: + return True + return True + + +def _unlink_if_exists(path: Path) -> None: + with contextlib.suppress(FileNotFoundError): + path.unlink() + + +def _tail_text(path: Path, *, max_lines: int = 12) -> str: + try: + lines = path.read_text(encoding="utf-8", errors="replace").splitlines() + except OSError: + return "" + return "\n".join(lines[-max_lines:]) + + if __name__ == "__main__": raise SystemExit(main()) diff --git a/src/gpu_usage_audit/doctor.py b/src/gpu_usage_audit/doctor.py index e566bb1..2f60bd2 100644 --- a/src/gpu_usage_audit/doctor.py +++ b/src/gpu_usage_audit/doctor.py @@ -24,8 +24,8 @@ DEFAULT_COMMAND_TIMEOUT_SECONDS = 3.0 DEFAULT_DB_PATH = Path("/tmp/gua.db") -COLLECT_COMMAND = "gpu-usage-audit daemon --interval 30s" -REPORT_COMMAND = "gpu-usage-audit report --since 1h --interval 30s" +COLLECT_COMMAND = "gua daemon --interval 30s" +REPORT_COMMAND = "gua report --since 1h --interval 30s" @dataclass(slots=True) @@ -618,13 +618,13 @@ def _recommended_commands_for(report: DoctorReport) -> dict[str, str]: def _collect_command(db_path: str) -> str: if Path(db_path) == DEFAULT_DB_PATH: return COLLECT_COMMAND - return f"gpu-usage-audit daemon --db {shlex.quote(db_path)} --interval 30s" + return f"gua daemon --db {shlex.quote(db_path)} --interval 30s" def _report_command(db_path: str) -> str: if Path(db_path) == DEFAULT_DB_PATH: return REPORT_COMMAND - return f"gpu-usage-audit report --db {shlex.quote(db_path)} --since 1h --interval 30s" + return f"gua report --db {shlex.quote(db_path)} --since 1h --interval 30s" def _short_error(result: CommandResult) -> str: @@ -669,7 +669,7 @@ def _host_warnings(facts: DetectionFacts) -> list[str]: warnings: list[str] = [] if facts.database.exists and facts.database.is_file: warnings.append( - f"{facts.database.path} already exists; `gpu-usage-audit daemon` will refuse " + f"{facts.database.path} already exists; `gua daemon` will refuse " "this path until it is removed or another --db path is provided." ) elif ( diff --git a/src/gpu_usage_audit/render.py b/src/gpu_usage_audit/render.py index 2f39fd0..c3881b2 100644 --- a/src/gpu_usage_audit/render.py +++ b/src/gpu_usage_audit/render.py @@ -40,15 +40,12 @@ def render_headline( 줘서 합이 항상 width — 마지막 칸이 비어 보이지 않게. """ if not host.hostname: - print( - f"gpu-usage-audit (no host row — daemon hasn't run yet?) Window: {since}\n", - file=w, - ) + print(f"gua (no host row — daemon hasn't run yet?) Window: {since}\n", file=w) else: ctx = host.env_kind if host.driver_version: ctx = f"{host.env_kind}, driver {host.driver_version}" - print(f"gpu-usage-audit — {host.hostname} ({ctx}) Window: {since}\n", file=w) + print(f"gua — {host.hostname} ({ctx}) Window: {since}\n", file=w) print("§1 Headline", file=w) if h.samples == 0: diff --git a/tests/test_doctor.py b/tests/test_doctor.py index 30ccf97..8aabdd6 100644 --- a/tests/test_doctor.py +++ b/tests/test_doctor.py @@ -70,6 +70,8 @@ def test_build_doctor_report_checks_only_local_bare_metal(tmp_path: Path) -> Non assert "NVML: ok, initialized, GPU count=2, driver 560.35.05" in rendered assert "status: absent, ready for a new daemon run" in rendered assert "Recommended commands:" in rendered + assert "collect: gua daemon --" in rendered + assert "report after collecting: gua report --" in rendered assert "Kubernetes" not in rendered assert "Slurm" not in rendered assert "Docker" not in rendered @@ -257,10 +259,10 @@ def test_custom_db_path_is_rendered_and_shell_quoted(tmp_path: Path) -> None: rendered = render_doctor(report) quoted = f"'{db_path}'" assert f"target: {db_path}" in rendered - assert f"collect: gpu-usage-audit daemon --db {quoted} --interval 30s" in rendered + assert f"collect: gua daemon --db {quoted} --interval 30s" in rendered assert ( - f"report after collecting: gpu-usage-audit report --db {quoted} --since 1h --interval 30s" - ) in rendered + f"report after collecting: gua report --db {quoted} --since 1h --interval 30s" in rendered + ) def test_nvidia_smi_counts_mig_instances(tmp_path: Path) -> None: diff --git a/tests/test_render.py b/tests/test_render.py index 53eae9c..90ab7f6 100644 --- a/tests/test_render.py +++ b/tests/test_render.py @@ -40,7 +40,7 @@ def test_render_headline_with_host_and_samples() -> None: timedelta(hours=1), width=60, ) - assert "gpu-usage-audit — lab-a100 (bare, driver 560.35.05)" in out + assert "gua — lab-a100 (bare, driver 560.35.05)" in out assert "Window: 1:00:00" in out assert "§1 Headline" in out assert "(8 samples)" in out diff --git a/tests/test_smoke.py b/tests/test_smoke.py index 055c60d..8298a9d 100644 --- a/tests/test_smoke.py +++ b/tests/test_smoke.py @@ -12,12 +12,14 @@ import tomllib from datetime import UTC, datetime, timedelta from pathlib import Path +from typing import Any import pytest from gpu_usage_audit import __version__ from gpu_usage_audit.__main__ import ( DEFAULT_DB_PATH, + DISPLAY_COMMAND_ENV, _duration, build_gua_parser, build_parser, @@ -25,6 +27,7 @@ main, ) from gpu_usage_audit.doctor import DoctorCheck, DoctorPlan, DoctorReport +from gpu_usage_audit.nvml import NVMLNotAvailableError def test_version_string_is_nonempty() -> None: @@ -66,6 +69,10 @@ def test_gua_parser_registers_command_surface() -> None: assert ns.command == "doctor" assert ns.db == "/var/lib/gua/gua.db" + for cmd in ("daemon", "start", "status", "stop", "report", "demo", "version", "help"): + ns = p.parse_args([cmd]) + assert ns.command == cmd + def _required_args_for(cmd: str) -> list[str]: # daemon/report/demo 는 --db 옵셔널. version/help 는 추가 인자 없음. @@ -157,6 +164,124 @@ def test_daemon_refuses_existing_db_before_nvml( assert f"{db_path} already exists" in captured.err +def test_daemon_does_not_create_db_when_nvml_is_unavailable( + monkeypatch: pytest.MonkeyPatch, + tmp_path: Path, + capsys: pytest.CaptureFixture[str], +) -> None: + db_path = tmp_path / "gua.db" + + class FailingTier: + def probe(self) -> str: + raise NVMLNotAvailableError("NVML unavailable") + + def close(self) -> None: + pass + + monkeypatch.setattr("gpu_usage_audit.__main__.NVMLTier", FailingTier) + + rc = main(["daemon", "--db", str(db_path)]) + captured = capsys.readouterr() + assert rc == 1 + assert "NVML unavailable" in captured.err + assert not db_path.exists() + + +def test_gua_daemon_background_refuses_existing_db_before_start( + tmp_path: Path, + capsys: pytest.CaptureFixture[str], +) -> None: + db_path = tmp_path / "gua.db" + db_path.write_text("existing", encoding="utf-8") + + rc = gua_main( + [ + "daemon", + "--db", + str(db_path), + "--pid-file", + str(tmp_path / "gua.pid"), + "--log-file", + str(tmp_path / "gua.log"), + ] + ) + + captured = capsys.readouterr() + assert rc == 2 + assert f"{db_path} already exists" in captured.err + assert "gua report" in captured.err + + +def test_gua_daemon_foreground_uses_foreground_daemon_path( + tmp_path: Path, + capsys: pytest.CaptureFixture[str], +) -> None: + db_path = tmp_path / "gua.db" + db_path.write_text("existing", encoding="utf-8") + + rc = gua_main(["daemon", "--foreground", "--db", str(db_path)]) + + captured = capsys.readouterr() + assert rc == 2 + assert f"gua daemon --foreground: {db_path} already exists" in captured.err + + +def test_gua_daemon_background_starts_subprocess( + monkeypatch: pytest.MonkeyPatch, + tmp_path: Path, + capsys: pytest.CaptureFixture[str], +) -> None: + pid_file = tmp_path / "gua.pid" + log_file = tmp_path / "gua.log" + db_path = tmp_path / "gua.db" + seen: dict[str, Any] = {} + + class FakeProc: + pid = 4242 + + def poll(self) -> int | None: + return None + + def fake_popen(command: list[str], **kwargs: Any) -> FakeProc: + seen["command"] = command + seen["kwargs"] = kwargs + return FakeProc() + + monkeypatch.setattr("gpu_usage_audit.__main__.subprocess.Popen", fake_popen) + monkeypatch.setattr("gpu_usage_audit.__main__.time.sleep", lambda _seconds: None) + + rc = gua_main( + [ + "daemon", + "--db", + str(db_path), + "--interval", + "200ms", + "--pid-file", + str(pid_file), + "--log-file", + str(log_file), + ] + ) + + captured = capsys.readouterr() + command = seen["command"] + kwargs = seen["kwargs"] + assert rc == 0 + assert pid_file.read_text(encoding="utf-8") == "4242\n" + assert command[:3] == [sys.executable, "-m", "gpu_usage_audit"] + assert command[3:] == [ + "daemon", + "--db", + str(db_path), + "--interval", + "200ms", + ] + assert kwargs["env"][DISPLAY_COMMAND_ENV] == "gua daemon --foreground" + assert kwargs["start_new_session"] is True + assert "started pid 4242" in captured.out + + def test_report_refuses_missing_db_without_creating_it( tmp_path: Path, capsys: pytest.CaptureFixture[str], @@ -170,6 +295,33 @@ def test_report_refuses_missing_db_without_creating_it( assert not db_path.exists() +def test_gua_report_refuses_missing_db_without_creating_it( + tmp_path: Path, + capsys: pytest.CaptureFixture[str], +) -> None: + db_path = tmp_path / "missing.db" + + rc = gua_main(["report", "--db", str(db_path)]) + captured = capsys.readouterr() + assert rc == 2 + assert f"gua report: {db_path} does not exist" in captured.err + assert "gua daemon" in captured.err + assert not db_path.exists() + + +def test_gua_status_and_stop_are_idempotent_without_pid_file( + tmp_path: Path, + capsys: pytest.CaptureFixture[str], +) -> None: + pid_file = tmp_path / "missing.pid" + + assert gua_main(["status", "--pid-file", str(pid_file)]) == 0 + assert "not running" in capsys.readouterr().out + + assert gua_main(["stop", "--pid-file", str(pid_file)]) == 0 + assert "not running" in capsys.readouterr().out + + def _fake_doctor_report(*, db_path: str | Path = DEFAULT_DB_PATH) -> DoctorReport: return DoctorReport( generated_at=datetime(2026, 5, 14, 0, 0, tzinfo=UTC), @@ -251,6 +403,28 @@ def test_demo_command_records_and_prints_report( assert db_path.exists() +def test_gua_demo_command_records_and_prints_report( + tmp_path: Path, + capsys: pytest.CaptureFixture[str], +) -> None: + db_path = tmp_path / "demo.db" + rc = gua_main( + [ + "demo", + "--db", + str(db_path), + "--ticks", + "1", + "--interval", + "10ms", + ] + ) + captured = capsys.readouterr() + assert rc == 0 + assert "§1 Headline" in captured.out + assert db_path.exists() + + @pytest.mark.parametrize( ("text", "want"), [