Module Reference

Applies to v1.0.0 — updated 2026-05-02

Public interfaces for every module in llm-valet. Internal helpers (prefixed _) are omitted unless they are part of a documented contract.

`llm_valet/api.py`

`create_app(settings)`

def create_app(settings: Settings | None = None) -> FastAPI

Application factory. Constructs the provider and collector from settings, wires up the Watchdog, registers all security middleware, and attaches all route handlers. When settings is None, calls load_settings() automatically.

Called once at module level to produce the app singleton that uvicorn serves. Tests call it directly to inject a custom Settings.

Middleware stack (outermost to innermost):

Layer	Purpose
`CORSMiddleware`	Blocks cross-origin requests not listed in `cors_origins`
Body size check (`@app.middleware`)	Rejects bodies > 64 KB before JSON parsing; handles chunked encoding
`TrustedHostMiddleware`	Rejects `Host` headers not in the allowlist (DNS rebinding mitigation)
`require_api_key` (per-route dep)	Enforces `X-API-Key` for non-localhost callers

Startup sequence (via FastAPI lifespan):

Checks for root (os.getuid() == 0 → immediate exit).
Configures JSON rotating-file logging and suppresses httpx/uvicorn access noise.
Runs an overcommit check: if a model is already loaded and its footprint exceeds ram_pause_pct, logs a structured warning.
Starts watchdog.run() as an asyncio task.

Shutdown sequence: calls watchdog.stop(), cancels the watchdog task.

First-run setup flow (`/setup/*`)

Three endpoints handle first-run API key acknowledgment and network binding configuration. All three are localhost-only (_is_local() guard) and excluded from the OpenAPI schema.

Path	Method	Description
`/setup`	GET	Returns `{needs_setup, api_key}`. `api_key` is only included for localhost requests before `key_acknowledged` is set; after acknowledgment it is always `null`.
`/setup/acknowledge`	POST	Marks the key as seen; persists `key_acknowledged: true` to `config.yaml`.
`/setup/apply`	POST	Validates and applies a `{host, port}` change, persists to `config.yaml`, then triggers a graceful restart via `os._exit(0)` after a 1s delay so the HTTP response returns first.

`_is_local(request)`

def _is_local(request: Request) -> bool

Returns True if the request client address is 127.0.0.1 or ::1. Used to gate the /setup/* endpoints and the single-display of the generated API key.

`require_api_key` dependency

async def require_api_key(
    request: Request,
    x_api_key: Annotated[str, Header()] = "",
) -> None

FastAPI dependency injected via Auth = Annotated[None, Depends(require_api_key)]. Skips auth when client is 127.0.0.1 or ::1. For all other origins, requires a non-empty api_key in config and validates it with hmac.compare_digest (constant-time). Returns HTTP 403 if api_key is not configured; HTTP 401 if the key does not match.

Endpoint Reference

Method	Path	Auth	Description
GET	`/`	No	Serve `static/index.html` or fallback JSON if static dir absent
GET	`/status`	Yes	Provider state + resource snapshot + watchdog state + overcommit flag
GET	`/watchdog`	Yes	Watchdog FSM state and last transition reason
GET	`/metrics`	Yes	Live `SystemMetrics` from `ResourceCollector`
POST	`/pause`	Yes	Manual graceful pause (rate-limited: 2s cooldown)
POST	`/pause/force`	Yes	Force-evict model by killing runner processes then `keep_alive=0`
POST	`/resume`	Yes	Manual resume — pre-warm model (rate-limited: 2s cooldown)
GET	`/models`	Yes	List all locally available models
POST	`/load`	Yes	Load a specific model; unloads current model first if different
DELETE	`/models/{model_name}`	Yes	Delete a model from local storage
POST	`/models/pull`	Yes	Pull model from registry; requires 5 GB free disk (rate-limited: 5s)
POST	`/start`	Yes	Full service start via svcmgr; returns immediately (rate-limited: 3s)
POST	`/stop`	Yes	Graceful service shutdown via svcmgr; returns immediately (rate-limited: 3s)
POST	`/stop/force`	Yes	Kill runners then stop service; returns immediately (rate-limited: 3s, shared key with `/stop`)
POST	`/restart`	Yes	stop → 2s sleep → start; returns immediately (rate-limited: 3s)
GET	`/config`	Yes	Read current `ResourceThresholds` as JSON
PUT	`/config`	Yes	Partial threshold update; validates hysteresis invariant; persists to `config.yaml`
GET	`/docs`	No	Auto-generated OpenAPI UI (FastAPI default)
GET	`/setup`	No	First-run key display (localhost only)
POST	`/setup/acknowledge`	No	Mark key as seen (localhost only)
POST	`/setup/apply`	No	Apply host/port config and restart (localhost only)

start, stop, stop/force, and restart return immediately with {"ok": true, "action": "..."} and complete the operation in a FastAPI BackgroundTask. Poll /status to observe the result.

`llm_valet/watchdog.py`

`WatchdogState`

class WatchdogState(enum.Enum):
    RUNNING       = "running"
    PAUSING       = "pausing"
    PAUSED        = "paused"
    RESUMING      = "resuming"
    PROVIDER_DOWN = "provider_down"

`Watchdog`

class Watchdog:
    def __init__(
        self,
        provider: LLMProvider,
        collector: ResourceCollector,
        thresholds: ResourceThresholds,
    ) -> None

Constructs the watchdog with injected provider and collector. Never calls psutil or any platform API directly for resource data — all platform specifics are delegated to collector. Creates a ThresholdEngine from thresholds.

Properties

Property	Type	Description
`state`	`WatchdogState`	Current FSM state. Read by `api.py` for `/status` and `/watchdog`.
`last_reason`	`str`	Structured reason string from the most recent state transition. Examples: `"RAM 87.3% >= 85.0% threshold"`, `"game detected — steamapps/common/Hades"`, `"manual pause"`.

Methods

async def run(self) -> None

Main loop. Runs until stop() is called. Calls _tick() every check_interval_seconds. Catches and logs any exception from _tick() without stopping the loop.

async def stop(self) -> None

Sets the running flag to False, causing run() to exit after the current sleep. Does not call provider.pause() — the provider is left in its current state.

def notify_manual_pause(self) -> None

Called by api.py after a successful POST /pause or POST /pause/force. Sets state to PAUSED, records _paused_at = time.monotonic(), and sets last_reason = "manual pause". Syncs the grace period clock so auto-resume behaves correctly after a manual pause.

def notify_manual_resume(self) -> None

Called by api.py after a successful POST /resume or POST /load. Sets state to RUNNING, clears _paused_at and _pause_trigger, and sets last_reason = "manual resume". Bypasses evaluate_resume() — the model is already loaded.

`llm_valet/config.py`

`Settings`

@dataclass
class Settings:
    host: str = "127.0.0.1"
    port: int = 8765
    provider: str = "ollama"
    ollama_url: str = "http://127.0.0.1:11434"
    model_name: str | None = None
    api_key: str = ""
    key_acknowledged: bool = False
    cors_origins: list[str] = field(default_factory=list)
    extra_allowed_hosts: list[str] = field(default_factory=list)
    thresholds: ResourceThresholds = field(default_factory=ResourceThresholds)
    log_file: str = "~/.llm-valet/valet.log"

Field	Type	Default	Description
`host`	`str`	`"127.0.0.1"`	Bind address. Use `"0.0.0.0"` for LAN access (requires `api_key`)
`port`	`int`	`8765`	Listen port
`provider`	`str`	`"ollama"`	Provider name; only `"ollama"` is supported in v1.0
`ollama_url`	`str`	`"http://127.0.0.1:11434"`	Ollama base URL; validated to localhost or RFC1918 only (T6)
`model_name`	`str \| None`	`None`	Default model for pause/resume; `None` uses the currently loaded model
`api_key`	`str`	`""`	API key for LAN access; empty disables all non-localhost requests
`key_acknowledged`	`bool`	`False`	True after first-run setup flow completes
`cors_origins`	`list[str]`	`[]`	Explicit CORS origin allowlist; never `"*"`
`extra_allowed_hosts`	`list[str]`	`[]`	Additional hosts for `TrustedHostMiddleware`
`thresholds`	`ResourceThresholds`	(see below)	Pause/resume threshold configuration
`log_file`	`str`	`"~/.llm-valet/valet.log"`	Rotating JSON log file path (tilde-expanded)

`Settings` methods

def acknowledge_key(self) -> None

Sets key_acknowledged = True and persists to config.yaml.

def apply_network_config(self, host: str, port: int) -> None

Updates host and port, sets key_acknowledged = True, and persists.

def update_thresholds(self, data: dict[str, Any]) -> dict[str, Any]

Applies a partial threshold update dict. Validates all percentage fields are in (0, 100], validates check_interval_seconds >= 1, enforces the ram_resume_pct < ram_pause_pct hysteresis invariant. Raises ValueError on any violation. Persists on success and returns the full updated threshold dict.

`load_settings()`

def load_settings() -> Settings

Loads ~/.llm-valet/config.yaml if it exists, warns if the file is world-readable (group or other read bits set), applies YAML values, then applies env var overrides. Returns a Settings instance with defaults for any missing values. If config.yaml is corrupt YAML, logs an error and continues with defaults.

ollama_url from YAML is validated — scheme must be http or https, host must be localhost, ::1, a .local mDNS name, or an RFC1918 address. Invalid values are logged and ignored.

`_apply_env_overrides(settings)`

def _apply_env_overrides(settings: Settings) -> None

Applies environment variable overrides after YAML loading. Env vars take precedence over config.yaml.

Variable	Field	Notes
`LLM_VALET_HOST`	`host`	Bind address override
`LLM_VALET_PORT`	`port`	Must be a valid integer; warning logged and default kept on parse failure
`LLM_VALET_API_KEY`	`api_key`	API key override
`LLM_VALET_PROVIDER`	`provider`	Provider name override

`llm_valet/providers/base.py`

`ProviderStatus`

@dataclass
class ProviderStatus:
    running: bool
    model_loaded: bool
    model_name: str | None
    memory_used_mb: int | None
    size_vram_mb: int | None = None
    loaded_context_length: int | None = None

Field	Type	Description
`running`	`bool`	True if the provider process is reachable
`model_loaded`	`bool`	True if a model is currently resident in memory
`model_name`	`str \| None`	Name of the loaded model, or `None` if none loaded
`memory_used_mb`	`int \| None`	Total memory used by the loaded model in MB (from `/api/ps size`)
`size_vram_mb`	`int \| None`	VRAM portion of model memory in MB (Ollama `/api/ps size_vram`)
`loaded_context_length`	`int \| None`	Active context window in tokens (Ollama `/api/ps context_length`)

`ModelInfo`

@dataclass
class ModelInfo:
    name: str
    size_mb: int
    context_length: int | None

`LLMProvider` ABC

class LLMProvider(ABC):
    @abstractmethod
    async def start(self) -> bool: ...

    @abstractmethod
    async def stop(self) -> bool: ...

    @abstractmethod
    async def pause(self) -> bool: ...

    @abstractmethod
    async def resume(self) -> bool: ...

    @abstractmethod
    async def force_pause(self) -> bool: ...

    @abstractmethod
    async def status(self) -> ProviderStatus: ...

    @abstractmethod
    async def health_check(self) -> bool: ...

    @abstractmethod
    async def list_models(self) -> list[ModelInfo]: ...

    @abstractmethod
    async def load_model(self, model_name: str, num_ctx: int | None = None) -> bool: ...

    @abstractmethod
    async def delete_model(self, model_name: str) -> bool: ...

    @abstractmethod
    async def pull_model(self, model_name: str) -> bool: ...

All methods return bool where True indicates success. status() returns a ProviderStatus that callers must not cache — it reflects live provider state at call time. health_check() is a lightweight liveness probe (no model state, 5s timeout); status() is a heavier call that includes model metadata via /api/ps.

`llm_valet/providers/ollama.py`

`OllamaProvider`

class OllamaProvider(LLMProvider):
    def __init__(
        self,
        base_url: str = "http://127.0.0.1:11434",
        model_name: str | None = None,
        request_timeout: float = 15.0,
    ) -> None

Parameter	Type	Default	Description
`base_url`	`str`	`"http://127.0.0.1:11434"`	Ollama API base URL (trailing slash stripped)
`model_name`	`str \| None`	`None`	Default model name; `None` triggers auto-detection via `/api/ps`
`request_timeout`	`float`	`15.0`	Default HTTP timeout in seconds for all Ollama API calls

Pause / Resume Mechanism

Both operations go through Ollama's /api/generate endpoint with stream: false:

Pause: POST /api/generate {model, keep_alive: 0, stream: false} — Ollama evicts the model from memory and returns done_reason: "unload". stream: false is required; without it Ollama sends a chunked response and only the first chunk is parsed, so done_reason is never seen.
Resume: POST /api/generate {model, keep_alive: -1, stream: false} — Ollama pre-warms the model into memory. Uses a 60s timeout (model loading from slow storage can take tens of seconds). If _last_loaded_ctx was captured at pause time, it is restored via options: {num_ctx: ...}.

Before pause() sends keep_alive: 0, it calls status() to capture loaded_context_length. The /api/ps endpoint returns empty after eviction, so context length must be captured before the eviction call.

`force_pause`

async def force_pause(self) -> bool

Used when pause() is blocked by an active inference request. Sequence:

Call status() to capture model name and context length before the runner is killed.
Call _kill_ollama_runners() — finds and kills processes named ollama_llama_runner or Ollama binaries invoked with a runner subcommand (excluding serve). Returns the count killed. Uses psutil.kill() — no shell, no injection surface.
Sleep 500ms to let Ollama register the runner exit.
Call pause() (keep_alive: 0) regardless of whether any runner was killed — the keep_alive call signals Ollama to release the model slot and prevents auto-restart.

Falls back gracefully to regular pause() if no runner processes are found.

`_resolve_model()`

async def _resolve_model(self) -> str | None

Returns the model name to act on. Resolution order:

self._model_name (from config or constructor) — returned if set and passes name validation.
First model in /api/ps — the currently loaded model.
self._last_loaded_model — cached at pause time. /api/ps returns empty after eviction; this cache allows resume() to restore the right model.

Returns None if no model name can be determined. All names are validated against ^[a-zA-Z0-9:._-]{1,200}$ before use. When pause() gets None back from _resolve_model(), it skips silently and returns True (no model loaded is not an error).

`load_model`

async def load_model(self, model_name: str, num_ctx: int | None = None) -> bool

Serialized by _load_lock (asyncio.Lock) — concurrent /load calls are queued, not raced. Sequence: unload current model via keep_alive: 0 if a different model is loaded, then pre-warm the new model via keep_alive: -1. Updates _model_name and _last_loaded_model on success.

num_ctx overrides Ollama's default context window. Must be >= 512 if provided; silently ignored if below that floor.

`llm_valet/resources/base.py`

`PressureLevel`

class PressureLevel(enum.Enum):
    NORMAL   = "normal"
    WARN     = "warn"
    CRITICAL = "critical"

On macOS, sourced from the memory_pressure CLI (OS-native signal for Apple Silicon unified memory). On Linux and Windows, derived from RAM% thresholds. Reported in /metrics for informational purposes. Not used as a pause trigger — loading a large model on Apple Silicon routinely produces transient CRITICAL readings even within a safe RAM% budget.

Metric Dataclasses

`MemoryMetrics`

@dataclass
class MemoryMetrics:
    total_mb: int
    used_mb: int
    used_pct: float
    pressure: PressureLevel

`CPUMetrics`

@dataclass
class CPUMetrics:
    used_pct: float   # 1-second average
    core_count: int

`GPUMetrics`

@dataclass
class GPUMetrics:
    available: bool           # False if no GPU driver accessible
    vram_total_mb: int | None
    vram_used_mb: int | None
    vram_used_pct: float | None
    compute_pct: float | None

Callers must check available before trusting any other field. On macOS M-series, GPU and CPU share unified memory — vram_* fields reflect the GPU portion of that unified pool.

`DiskMetrics`

@dataclass
class DiskMetrics:
    path: str           # "/" on macOS/Linux, "C:\\" on Windows
    total_mb: int
    used_mb: int
    free_mb: int
    used_pct: float

Used to gate model pulls — /models/pull rejects requests when free_mb < 5120 (5 GB).

`SystemMetrics`

@dataclass
class SystemMetrics:
    memory: MemoryMetrics
    cpu: CPUMetrics
    gpu: GPUMetrics
    disk: DiskMetrics
    timestamp: float = field(default_factory=time.time)

Complete snapshot returned by ResourceCollector.collect(). timestamp is a Unix epoch float set at collection time.

`ResourceCollector` ABC

class ResourceCollector(ABC):
    @abstractmethod
    def collect(self) -> SystemMetrics: ...

    @abstractmethod
    def supported_metrics(self) -> set[str]: ...

    def collect_disk(self) -> DiskMetrics: ...

collect() returns a full SystemMetrics snapshot. supported_metrics() returns a set of strings indicating which fields are populated from real hardware data — e.g. {"memory", "cpu", "gpu", "pressure", "disk"}. Callers check this before trusting optional GPU fields on platforms where GPU data is unavailable.

collect_disk() is a concrete base implementation using psutil.disk_usage(). It is identical on macOS, Linux, and Windows and does not need to be overridden in platform subclasses.

`ResourceThresholds`

@dataclass
class ResourceThresholds:
    ram_pause_pct: float = 85.0
    ram_resume_pct: float = 60.0
    cpu_pause_pct: float = 90.0
    cpu_sustained_seconds: int = 30
    gpu_vram_pause_pct: float = 85.0
    pause_timeout_seconds: int = 120
    check_interval_seconds: int = 10
    auto_resume_on_ram_pressure: bool = True

Field	Type	Default	Description
`ram_pause_pct`	`float`	`85.0`	RAM% at which to pause; must be > `ram_resume_pct`
`ram_resume_pct`	`float`	`60.0`	RAM% below which resume is allowed; hysteresis gap prevents oscillation
`cpu_pause_pct`	`float`	`90.0`	CPU% threshold for pause trigger
`cpu_sustained_seconds`	`int`	`30`	Seconds CPU must stay above threshold before pausing
`gpu_vram_pause_pct`	`float`	`85.0`	GPU VRAM% at which to pause; triggers immediately (no sustained window)
`pause_timeout_seconds`	`int`	`120`	Grace period in seconds after pressure clears before auto-resume
`check_interval_seconds`	`int`	`10`	Watchdog tick interval in seconds
`auto_resume_on_ram_pressure`	`bool`	`True`	When `False`, RAM-triggered pauses require manual `/resume` to prevent oscillation on machines where the model is the dominant RAM consumer

The config layer enforces ram_resume_pct < ram_pause_pct at load time and on PUT /config. Violations are rejected with HTTP 400.

`ThresholdEngine`

class ThresholdEngine:
    def __init__(self, thresholds: ResourceThresholds) -> None

Pure logic, no I/O. Holds a reference to ResourceThresholds for threshold values.

def evaluate(self, metrics: SystemMetrics) -> tuple[bool, str]

Returns (should_pause, reason). Checks in order: RAM, CPU, GPU VRAM. Returns on the first breach. RAM and GPU trigger immediately; CPU always returns True when the threshold is exceeded — the watchdog caller tracks sustained-seconds externally. Reason strings are structured for log parsing, e.g. "RAM 87.3% >= 85.0% threshold".

def evaluate_resume(self, metrics: SystemMetrics) -> tuple[bool, str]

Returns (safe_to_resume, reason). All metrics must be below their resume thresholds for this to return True. RAM uses ram_resume_pct (hysteresis); CPU and GPU use their respective pause thresholds (the grace period provides the sustained buffer on the resume side).

`svcmgr/macos.py`

Manages the Ollama service on macOS. Handles two install variants. Detection checks filesystem paths at call time — no caching.

Variant	Detection condition	Start mechanism	Stop mechanism
App	`/Applications/Ollama.app` directory exists	`open -a Ollama`	`osascript -e 'quit app "Ollama"'`; falls back to psutil SIGTERM by exe path if osascript fails
Brew CLI	`~/Library/LaunchAgents/homebrew.mxcl.ollama.plist` or `com.ollama.ollama.plist` exists	`launchctl bootstrap gui/<uid> <plist>`	`launchctl bootout gui/<uid>/<label>`

def start_service() -> bool

Detects the install variant and starts accordingly. Returns False if neither variant is found.

def stop_service() -> bool

Stops and prevents automatic respawn. For the Brew variant, "No such process" in stderr is treated as success (already stopped). For the App variant, AppleScript failure falls back to psutil SIGTERM by matching the process exe path against /Applications/Ollama.app/Contents/MacOS/Ollama.

All subprocess calls use shell=False. The user domain (gui/<uid>) is used throughout — root is never required.

Supports both homebrew.mxcl.ollama.plist (current formula) and com.ollama.ollama.plist (older formula), checked in that order.

`svcmgr/linux.py`

def start_service() -> bool

Detection order:

If a systemd user unit ollama.service exists (systemctl --user cat ollama.service returns 0): runs systemctl --user start ollama.service.
Otherwise: spawns ollama serve as a detached background process (start_new_session=True) so it survives llm-valet restarts. Searches PATH then /usr/local/bin/ollama, /usr/bin/ollama, ~/.local/bin/ollama.

def stop_service() -> bool

If a systemd user unit exists: runs systemctl --user stop ollama.service and returns True.

If only a root-owned system service exists (the official Ollama Linux installer default) or Ollama is running as a bare process: returns False. ollama.py then handles termination via psutil SIGTERM / SIGKILL fallback. llm-valet never runs as root, so it cannot control root-owned system services directly.

`svcmgr/windows.py`

def start_service() -> bool

Detection order:

If a Windows Service named "Ollama" is registered (sc query Ollama returns 0): runs sc start Ollama.
Otherwise: launches the Ollama executable directly with DETACHED_PROCESS | CREATE_NO_WINDOW flags so the process survives if llm-valet's console window is closed. Checks %LOCALAPPDATA%\Programs\Ollama\ollama.exe first, then PATH.

def stop_service() -> bool

If a Windows Service exists: runs sc stop Ollama. Treats error 1062 ("service not started") as success.

If Ollama is running as a tray application (the default for the official installer): returns False. ollama.py handles termination via psutil. The Windows Service path is included for enterprise deployments that register Ollama manually — it is not the default install.

All subprocess calls use shell=False.

Module Reference

Module Reference

llm_valet/api.py

create_app(settings)

First-run setup flow (/setup/*)

_is_local(request)

require_api_key dependency

Endpoint Reference

llm_valet/watchdog.py

WatchdogState

Watchdog

Properties

Methods

llm_valet/config.py

Settings

Settings methods

load_settings()

_apply_env_overrides(settings)

llm_valet/providers/base.py

ProviderStatus

ModelInfo

LLMProvider ABC

llm_valet/providers/ollama.py

OllamaProvider

Pause / Resume Mechanism

force_pause

_resolve_model()

load_model

llm_valet/resources/base.py

PressureLevel

Metric Dataclasses

MemoryMetrics

CPUMetrics

GPUMetrics

DiskMetrics

SystemMetrics

ResourceCollector ABC

ResourceThresholds

ThresholdEngine

svcmgr/macos.py

svcmgr/linux.py

svcmgr/windows.py

Uh oh!

Uh oh!

Clone this wiki locally

`llm_valet/api.py`

`create_app(settings)`

First-run setup flow (`/setup/*`)

`_is_local(request)`

`require_api_key` dependency

`llm_valet/watchdog.py`

`WatchdogState`

`Watchdog`

`llm_valet/config.py`

`Settings`

`Settings` methods

`load_settings()`

`_apply_env_overrides(settings)`

`llm_valet/providers/base.py`

`ProviderStatus`

`ModelInfo`

`LLMProvider` ABC

`llm_valet/providers/ollama.py`

`OllamaProvider`

`force_pause`

`_resolve_model()`

`load_model`

`llm_valet/resources/base.py`

`PressureLevel`

`MemoryMetrics`

`CPUMetrics`

`GPUMetrics`

`DiskMetrics`

`SystemMetrics`

`ResourceCollector` ABC

`ResourceThresholds`

`ThresholdEngine`

`svcmgr/macos.py`

`svcmgr/linux.py`

`svcmgr/windows.py`