Model-loading: pre-painted dialog, throttled cold loads, keep_alive setting, startup warm by mrjeeves · Pull Request #210 · mrjeeves/MyOwnLLM

mrjeeves · 2026-05-29T01:19:16Z

Makes loading a local model far less disruptive — especially on laptops where a cold load could freeze the whole machine.

1. Cold-start "loading the model" dialog

A non-blocking dialog over the chat surface while the model loads, with a Cancel button. Adapts copy for local vs. mesh ("Loading <model>…" / "Waiting for <host>…").

Pre-painted before the freeze: before sending, we check Ollama's /api/ps to see if the model is resident. On a predicted cold start we show the dialog and force an actual paint (tick + double rAF) before firing the request — so it's on screen ahead of any load freeze. Warm sends keep a lightweight 5s reactive fallback timer.
Tears down on the first agent frame; not re-armed for warm later turns.

2. Throttle the load so the machine stays usable

The freeze happens inside the Ollama server while it pages multi-GB weights in from disk. Since the app spawns that server itself, we now throttle it:

Lower IO priority of the spawned server — Linux ionice -c3 (idle) + a small renice; macOS taskpolicy -b; Windows BelowNormal priority class. IO is the real lever (loading is disk-bound), so this keeps the desktop responsive with negligible impact on inference (compute-bound). Best-effort; only applies when we own the process (not a system/tray Ollama).
Cap memory pressure on spawn via OLLAMA_MAX_LOADED_MODELS=1 and OLLAMA_NUM_PARALLEL=1, cutting the swap thrash behind the hardest freezes.

3. Warm the model proactively at startup

After launch we warm the active chat model in the background (throttled), so the one-time cold load happens at a predictable moment — with the dialog shown — instead of on the first message. Skipped when the model isn't downloaded or keep_alive is 0.

4. Configurable `keep_alive` (the repeated-cold-start fix)

Chat requests never set Ollama's keep_alive, so they relied on Ollama's 5-minute default; idle past that and the model unloaded, making the next message a cold start.

Chat requests (streaming + one-shot) now send keep_alive, read in Rust from config.
Configurable in Settings → Hardware → Performance → Model memory, default 30m, from "Unload immediately" (frees RAM/VRAM for transcription on tight machines) to "Until the app quits".

5. Live resource readout in the dialog

Shows live CPU / RAM / GPU + disk free so the user can see why a load is slow (e.g. RAM near full → paging from disk).

Reuses the existing usage_live_snapshot command; the LiveSnapshot type was promoted to types.ts and is now shared with the Usage settings tab.
Polled only while the dialog is visible (1.2s), with interval cleanup on teardown.

New backend surface

ollama_model_loaded(model) → bool (via /api/ps)
ollama_warm(model) (ensures running, then warms)
process::lower_priority(pid) — cross-platform best-effort throttle
ollama_keep_alive config field (default 30m)

Verification

pnpm run check — 0 errors, 0 warnings
pnpm run build — succeeds
cargo check + cargo clippy — my code clean (only pre-existing mesh/ unused-import warnings)
cargo test ... ollama — 4 passed

Notes / possible follow-ups

Throttling/mem-caps only apply when MyOwnLLM spawns Ollama. If Ollama runs as a system/tray service (common on Windows/macOS), we don't own the PID — could be extended to discover and throttle it.
The dialog's resource readout is whole-machine, not Ollama-process-specific; an /api/ps-based "model X — 100% GPU, 3.2 GB VRAM" line is a possible follow-up.

https://claude.ai/code/session_01Eze77o5msnfo5CBnJjd3Sd

When the first token doesn't arrive within 5s of sending, surface a small non-blocking dialog explaining the model is loading into memory, with a Cancel button. The dialog tears down on the first frame (delta, tool call, or terminal event) and isn't re-armed for warm later turns.

Two improvements to the model-loading experience: 1. Cold-start fix: chat requests now send Ollama a keep_alive so the model stays resident between turns instead of relying on Ollama's 5-minute default (which caused repeated cold-start reloads). The value is user-configurable in Settings > Hardware > Performance, defaulting to 30m, with options from 'unload immediately' (for memory-tight machines coexisting with transcription) to 'keep until the app quits'. Read in Rust from config so both the streaming and one-shot chat paths pick it up. 2. The model-loading dialog now shows a live CPU / RAM / GPU readout (and disk free) so the user can see why a load is slow — e.g. RAM near full means the model is paging in from disk. Reuses the existing usage_live_snapshot command and the LiveSnapshot type, now promoted to types.ts and shared with the Usage settings tab.

Addresses laptops freezing so hard during a cold model load that the load dialog never paints, and warms the model without locking up the machine: - Throttle the Ollama server we spawn: lower its IO priority (Linux ionice idle + small renice; macOS taskpolicy -b; Windows BelowNormal) so the disk thrash of paging weights in no longer starves the desktop. Best-effort, only when we own the process. - Cap memory pressure via OLLAMA_MAX_LOADED_MODELS=1 and OLLAMA_NUM_PARALLEL=1 on spawn, cutting the swap thrash that causes the hardest freezes. - Pre-paint the load dialog: before sending, check /api/ps to see if the model is resident; on a predicted cold start, show the dialog and force a paint BEFORE firing the request so it's on screen ahead of any freeze. Warm starts keep the lightweight 5s reactive timer. - Warm the chat model in the background at startup (throttled) so the one-time cold load happens at a predictable moment instead of on the first message. Skipped when the model is missing or keep_alive is 0. - warm() now honors the configured keep_alive instead of a fixed 10m.

The live usage sampler returned None for system-wide CPU% and RAM-used on macOS (only the per-app figures populated), so the load dialog and Usage tab showed app metrics but blank system metrics. - System CPU%: sum every process's ps %cpu and normalise by core count (single fast call; no host_statistics FFI, no top -l 2 stall). - System RAM used: (active + wired + compressed) pages x page size from vm_stat — the components Activity Monitor reports as Memory Used — using vm_stat's own header page size for self-consistent math. Parsing is factored into pure helpers with unit tests so the logic is verified on any host even though the macOS shell calls only run there.

Mechanical, behavior-preserving cleanup via cargo fix / clippy --fix: - Remove unused re-exports (mesh identity/roster/signing). - Inline format args in format!/anyhow!/write! across asr, diarize, mesh, transcribe, cli, main. cargo check is now warning-free. The remaining clippy-only lints (result_large_err, doc-list indentation) need invasive manual changes and are left for a focused follow-up.

macOS inference was crippled because the throttle used taskpolicy -b (background QoS), which demotes the whole process to efficiency cores and throttles compute, not just disk. Switch to IO-only throttling so the machine stays responsive during a load while token generation runs at full speed: - macOS: taskpolicy -d throttle (disk IO policy only; CPU/QoS untouched). - Linux: ionice best-effort low (-c2 -n7) instead of idle, and drop the renice so inference keeps full CPU. - Windows: unchanged (BelowNormal is a mild priority nudge, not a compute throttle). Daemon binary search no longer logs a 'skipping ...' line for every probed-but-inapplicable location on the happy path. Reasons are now collected and printed only when the search actually fails (no usable binary, or every candidate fails to spawn). Clean up the warnings surfaced by a Windows build (verified via x86_64-pc-windows-gnu cross-check): - usage.rs: drop unused std::ffi::c_void import. - process.rs: drop redundant CommandExt import (tokio Command has an inherent creation_flags). - ollama.rs: allow(unreachable_code) on install() — the tail Ok(()) is the Linux/unsupported fallback, unreachable on macOS/Windows by design. - hardware.rs: cfg-gate the Linux-only parsers' dead_code allowance.

Promote performance settings out of the Hardware tab into their own Performance tab (listed right after Hardware), and make the load throttle user-tunable: - New ollama_throttle config (off | io | aggressive), default io. - off: no throttle (fastest load, can bog the machine down). - io: ease disk IO priority only; inference stays full speed (default). - aggressive: also demote CPU/QoS; most responsive desktop, slower inference. - lower_priority() now takes the mode and branches per platform; the Ollama spawn reads the config and skips throttling entirely when off. - New PerformanceSection.svelte hosts both the keep-model-loaded (keep_alive) and load-throttle settings; removed the inline Performance group from HardwareSection.

cargo fmt --check failed on two lines the earlier cargo fix / clippy --fix pass left wrapped non-canonically (embedder.rs anyhow! call and roster.rs re-export list). Reflow to rustfmt's canonical form. fmt --check, clippy --all-targets, and cargo test all pass locally on the pinned 1.88.0 toolchain.

The io throttle was applied post-spawn via taskpolicy -p, which is a no-op on macOS, so the server ran unthrottled and a load could starve the display/networking and freeze the machine. And the previous fix left the CPU fully open to the server (IO-only), which is what starved the system in the first place. Fix: throttle at launch with a moderate 0 so the server yields CPU to the system (display, networking, WebView) when they need it, but still gets the bulk of the cores when nothing competes — responsive machine, inference not crippled. Applied as an argv prefix (nice execs the target), which is also the only reliable way to set macOS IO policy. - io (balanced, default): nice -n 10 (+ low best-effort ionice on Linux). - aggressive: nice -n 19 + idle ionice (Linux) / background QoS (macOS). - Windows: post-spawn priority class (BelowNormal / Idle). - Fallback to a direct spawn if the wrapper can't bring Ollama up, so a missing/incompatible tool never disables the LLM. Restore warm_on_startup to default ON (the load now runs under the throttle, so it won't lock up the machine); it remains a toggle in Settings → Performance.

Replace the floating load dialog with an in-bubble indicator that takes the place of the typing dots while the model loads — no jolting overlay. Minimal prose: a reassurance word that rotates every 3s with a moving shine (recreated per change so it fades in), plus a quiet live CPU/RAM line as proof the machine is still working. The composer's Stop button already covers cancel, so the modal's Cancel/heading/spinner are gone.

Extract the cold-start indicator (rotating shining word + live CPU/RAM) into a reusable LoadingPulse component and use it in two places: - In chat: still shown in place of the typing dots whenever a call is slow (cold load or a long-running turn) — unchanged behavior, now via the shared component. - At startup: when warm-on-startup runs, hold a full-screen loading screen (spinner + LoadingPulse beneath it) over the chat until the model is resident, instead of dropping into a chat that feels sluggish while it competes with the cold load. The chat still mounts behind the screen, so it's ready the moment the screen lifts; a Continue button is the escape hatch. LoadingPulse self-manages its word rotation and usage poll (mount/ unmount lifecycle), so Chat no longer hand-rolls those.

The indicator covers both a cold model load and a slow in-progress turn, so model-specific phrases (Loading the model / Reading the weights / Warming up) wrongly implied a reload mid-chat. Swap for neutral 'work is underway' phrases that fit either case.

claude added 2 commits May 29, 2026 01:19

mrjeeves changed the title ~~Add cold-start model-loading dialog to chat~~ Model-loading dialog: cold-start UX, keep_alive setting, live resource readout May 29, 2026

mrjeeves changed the title ~~Model-loading dialog: cold-start UX, keep_alive setting, live resource readout~~ Model-loading: pre-painted dialog, throttled cold loads, keep_alive setting, startup warm May 29, 2026

claude added 9 commits May 29, 2026 05:04

mrjeeves merged commit d266d1b into main May 29, 2026
4 checks passed

mrjeeves deleted the claude/gifted-tesla-HAIae branch May 29, 2026 06:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model-loading: pre-painted dialog, throttled cold loads, keep_alive setting, startup warm#210

Model-loading: pre-painted dialog, throttled cold loads, keep_alive setting, startup warm#210
mrjeeves merged 12 commits into
mainfrom
claude/gifted-tesla-HAIae

mrjeeves commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mrjeeves commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Cold-start "loading the model" dialog

2. Throttle the load so the machine stays usable

3. Warm the model proactively at startup

4. Configurable keep_alive (the repeated-cold-start fix)

5. Live resource readout in the dialog

New backend surface

Verification

Notes / possible follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mrjeeves commented May 29, 2026 •

edited

Loading

4. Configurable `keep_alive` (the repeated-cold-start fix)