Model-loading: pre-painted dialog, throttled cold loads, keep_alive setting, startup warm#210
Merged
Conversation
When the first token doesn't arrive within 5s of sending, surface a small non-blocking dialog explaining the model is loading into memory, with a Cancel button. The dialog tears down on the first frame (delta, tool call, or terminal event) and isn't re-armed for warm later turns.
Two improvements to the model-loading experience: 1. Cold-start fix: chat requests now send Ollama a keep_alive so the model stays resident between turns instead of relying on Ollama's 5-minute default (which caused repeated cold-start reloads). The value is user-configurable in Settings > Hardware > Performance, defaulting to 30m, with options from 'unload immediately' (for memory-tight machines coexisting with transcription) to 'keep until the app quits'. Read in Rust from config so both the streaming and one-shot chat paths pick it up. 2. The model-loading dialog now shows a live CPU / RAM / GPU readout (and disk free) so the user can see why a load is slow — e.g. RAM near full means the model is paging in from disk. Reuses the existing usage_live_snapshot command and the LiveSnapshot type, now promoted to types.ts and shared with the Usage settings tab.
Addresses laptops freezing so hard during a cold model load that the load dialog never paints, and warms the model without locking up the machine: - Throttle the Ollama server we spawn: lower its IO priority (Linux ionice idle + small renice; macOS taskpolicy -b; Windows BelowNormal) so the disk thrash of paging weights in no longer starves the desktop. Best-effort, only when we own the process. - Cap memory pressure via OLLAMA_MAX_LOADED_MODELS=1 and OLLAMA_NUM_PARALLEL=1 on spawn, cutting the swap thrash that causes the hardest freezes. - Pre-paint the load dialog: before sending, check /api/ps to see if the model is resident; on a predicted cold start, show the dialog and force a paint BEFORE firing the request so it's on screen ahead of any freeze. Warm starts keep the lightweight 5s reactive timer. - Warm the chat model in the background at startup (throttled) so the one-time cold load happens at a predictable moment instead of on the first message. Skipped when the model is missing or keep_alive is 0. - warm() now honors the configured keep_alive instead of a fixed 10m.
The live usage sampler returned None for system-wide CPU% and RAM-used on macOS (only the per-app figures populated), so the load dialog and Usage tab showed app metrics but blank system metrics. - System CPU%: sum every process's ps %cpu and normalise by core count (single fast call; no host_statistics FFI, no top -l 2 stall). - System RAM used: (active + wired + compressed) pages x page size from vm_stat — the components Activity Monitor reports as Memory Used — using vm_stat's own header page size for self-consistent math. Parsing is factored into pure helpers with unit tests so the logic is verified on any host even though the macOS shell calls only run there.
Mechanical, behavior-preserving cleanup via cargo fix / clippy --fix: - Remove unused re-exports (mesh identity/roster/signing). - Inline format args in format!/anyhow!/write! across asr, diarize, mesh, transcribe, cli, main. cargo check is now warning-free. The remaining clippy-only lints (result_large_err, doc-list indentation) need invasive manual changes and are left for a focused follow-up.
macOS inference was crippled because the throttle used taskpolicy -b (background QoS), which demotes the whole process to efficiency cores and throttles compute, not just disk. Switch to IO-only throttling so the machine stays responsive during a load while token generation runs at full speed: - macOS: taskpolicy -d throttle (disk IO policy only; CPU/QoS untouched). - Linux: ionice best-effort low (-c2 -n7) instead of idle, and drop the renice so inference keeps full CPU. - Windows: unchanged (BelowNormal is a mild priority nudge, not a compute throttle). Daemon binary search no longer logs a 'skipping ...' line for every probed-but-inapplicable location on the happy path. Reasons are now collected and printed only when the search actually fails (no usable binary, or every candidate fails to spawn). Clean up the warnings surfaced by a Windows build (verified via x86_64-pc-windows-gnu cross-check): - usage.rs: drop unused std::ffi::c_void import. - process.rs: drop redundant CommandExt import (tokio Command has an inherent creation_flags). - ollama.rs: allow(unreachable_code) on install() — the tail Ok(()) is the Linux/unsupported fallback, unreachable on macOS/Windows by design. - hardware.rs: cfg-gate the Linux-only parsers' dead_code allowance.
Promote performance settings out of the Hardware tab into their own
Performance tab (listed right after Hardware), and make the load
throttle user-tunable:
- New ollama_throttle config (off | io | aggressive), default io.
- off: no throttle (fastest load, can bog the machine down).
- io: ease disk IO priority only; inference stays full speed (default).
- aggressive: also demote CPU/QoS; most responsive desktop, slower
inference.
- lower_priority() now takes the mode and branches per platform; the
Ollama spawn reads the config and skips throttling entirely when off.
- New PerformanceSection.svelte hosts both the keep-model-loaded
(keep_alive) and load-throttle settings; removed the inline
Performance group from HardwareSection.
cargo fmt --check failed on two lines the earlier cargo fix / clippy --fix pass left wrapped non-canonically (embedder.rs anyhow! call and roster.rs re-export list). Reflow to rustfmt's canonical form. fmt --check, clippy --all-targets, and cargo test all pass locally on the pinned 1.88.0 toolchain.
The io throttle was applied post-spawn via taskpolicy -p, which is a no-op on macOS, so the server ran unthrottled and a load could starve the display/networking and freeze the machine. And the previous fix left the CPU fully open to the server (IO-only), which is what starved the system in the first place. Fix: throttle at launch with a moderate 0 so the server yields CPU to the system (display, networking, WebView) when they need it, but still gets the bulk of the cores when nothing competes — responsive machine, inference not crippled. Applied as an argv prefix (nice execs the target), which is also the only reliable way to set macOS IO policy. - io (balanced, default): nice -n 10 (+ low best-effort ionice on Linux). - aggressive: nice -n 19 + idle ionice (Linux) / background QoS (macOS). - Windows: post-spawn priority class (BelowNormal / Idle). - Fallback to a direct spawn if the wrapper can't bring Ollama up, so a missing/incompatible tool never disables the LLM. Restore warm_on_startup to default ON (the load now runs under the throttle, so it won't lock up the machine); it remains a toggle in Settings → Performance.
Replace the floating load dialog with an in-bubble indicator that takes the place of the typing dots while the model loads — no jolting overlay. Minimal prose: a reassurance word that rotates every 3s with a moving shine (recreated per change so it fades in), plus a quiet live CPU/RAM line as proof the machine is still working. The composer's Stop button already covers cancel, so the modal's Cancel/heading/spinner are gone.
Extract the cold-start indicator (rotating shining word + live CPU/RAM) into a reusable LoadingPulse component and use it in two places: - In chat: still shown in place of the typing dots whenever a call is slow (cold load or a long-running turn) — unchanged behavior, now via the shared component. - At startup: when warm-on-startup runs, hold a full-screen loading screen (spinner + LoadingPulse beneath it) over the chat until the model is resident, instead of dropping into a chat that feels sluggish while it competes with the cold load. The chat still mounts behind the screen, so it's ready the moment the screen lifts; a Continue button is the escape hatch. LoadingPulse self-manages its word rotation and usage poll (mount/ unmount lifecycle), so Chat no longer hand-rolls those.
The indicator covers both a cold model load and a slow in-progress turn, so model-specific phrases (Loading the model / Reading the weights / Warming up) wrongly implied a reload mid-chat. Swap for neutral 'work is underway' phrases that fit either case.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Makes loading a local model far less disruptive — especially on laptops where a cold load could freeze the whole machine.
1. Cold-start "loading the model" dialog
A non-blocking dialog over the chat surface while the model loads, with a Cancel button. Adapts copy for local vs. mesh ("Loading
<model>…" / "Waiting for<host>…")./api/psto see if the model is resident. On a predicted cold start we show the dialog and force an actual paint (tick + double rAF) before firing the request — so it's on screen ahead of any load freeze. Warm sends keep a lightweight 5s reactive fallback timer.2. Throttle the load so the machine stays usable
The freeze happens inside the Ollama server while it pages multi-GB weights in from disk. Since the app spawns that server itself, we now throttle it:
ionice -c3(idle) + a smallrenice; macOStaskpolicy -b; WindowsBelowNormalpriority class. IO is the real lever (loading is disk-bound), so this keeps the desktop responsive with negligible impact on inference (compute-bound). Best-effort; only applies when we own the process (not a system/tray Ollama).OLLAMA_MAX_LOADED_MODELS=1andOLLAMA_NUM_PARALLEL=1, cutting the swap thrash behind the hardest freezes.3. Warm the model proactively at startup
After launch we warm the active chat model in the background (throttled), so the one-time cold load happens at a predictable moment — with the dialog shown — instead of on the first message. Skipped when the model isn't downloaded or
keep_aliveis0.4. Configurable
keep_alive(the repeated-cold-start fix)Chat requests never set Ollama's
keep_alive, so they relied on Ollama's 5-minute default; idle past that and the model unloaded, making the next message a cold start.keep_alive, read in Rust from config.5. Live resource readout in the dialog
Shows live CPU / RAM / GPU + disk free so the user can see why a load is slow (e.g. RAM near full → paging from disk).
usage_live_snapshotcommand; theLiveSnapshottype was promoted totypes.tsand is now shared with the Usage settings tab.New backend surface
ollama_model_loaded(model)→ bool (via/api/ps)ollama_warm(model)(ensures running, then warms)process::lower_priority(pid)— cross-platform best-effort throttleollama_keep_aliveconfig field (default30m)Verification
pnpm run check— 0 errors, 0 warningspnpm run build— succeedscargo check+cargo clippy— my code clean (only pre-existingmesh/unused-import warnings)cargo test ... ollama— 4 passedNotes / possible follow-ups
/api/ps-based "model X — 100% GPU, 3.2 GB VRAM" line is a possible follow-up.https://claude.ai/code/session_01Eze77o5msnfo5CBnJjd3Sd