Cut inference and AX hot-path waste: P-core threads, gated logprob, halved KV, geometry cache#667
Merged
Merged
Conversation
…lizer statics App half of the #661 performance pass; pairs with cotabbyinference 6e1a9ba (P-core decode threads, gated logprob, halved KV allocation). - LlamaRuntimeCore now tells the engine to skip per-token log-probabilities whenever confidenceFloor == -infinity, the shipping default where ConfidenceSuppressionPolicy returns before reading the value. Every generated token was paying two O(vocab) passes plus a vocab-wide exp() to produce a number that was summed and discarded. Suggestions are byte-identical; raising the floor re-enables the computation per request. - AXHelper.displayGeometries() is cached and invalidated on didChangeScreenParameters instead of rebuilding NSScreen.screens + CGDisplayBounds for every AX rect conversion at the focus-poll cadence. - SuggestionTextNormalizer: <think>-stripping short-circuits on a contains check before compiling either regex (both patterns require the literal tag), and the scaffolding-label list is length-sorted once, statically, instead of on every prediction.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
High-confidence performance work for #661: cuts per-token CPU, decode-thread oversubscription, resident KV RAM, and hot-path rebuild work, with no behavior or UX change in the shipping configuration. Pairs with engine commit
6e1a9baon cotabbyinference main ("Cut idle inference costs: P-core threads, gated logprob, right-sized KV").Engine side (cotabbyinference, already on main):
hw.perflevel0.physicalcpu, physical-core fallback on Intel) instead of all logical cores. Barriered matmul threads on E-cores stall every layer; with full Metal offload the CPU threads only orchestrate, so logical-core oversubscription was wasted wakeups.SampleResult.logprob(two O(vocab) passes + a vocab-wideexp()per generated token) is now computed only when the sequence opts in via the newsetComputeLogprob. Engine default stays ON, so existing callers are untouched.MAX_SEQUENCES4 -> 2 halves the shared KV allocation (n_ctx = window * MAX_SEQUENCES). The app holds at most one live sequence (destroy-before-create everywhere); 2 keeps a spare slot for tests/evals. Per-sequence window unchanged.App side (this PR):
LlamaRuntimeCoredisables the engine's per-token logprob wheneverconfidenceFloor == -infinity— the shipping default, whereConfidenceSuppressionPolicy.shouldSuppressreturns before ever reading the value. In the default config every generated token was paying ~2x vocab-size float ops plus a vocab-wide exp() to produce a number that was summed and discarded. Byte-identical suggestions; the confidence-suppression feature stays fully wired for anyone who raises the floor.AXHelper.displayGeometries()is now cached and invalidated ondidChangeScreenParameters. It was rebuildingNSScreen.screens+CGDisplayBoundsfor every AX rect conversion — many per focus resolve at the poll cadence — for identical results between display changes.SuggestionTextNormalizer:<think>-block stripping now short-circuits with acontainscheck before any regex work (both patterns require the literal tag; most completions have none), and the scaffolding-label list is length-sorted once instead of per call.Considered and deliberately excluded (uncertain benefit or behavior tradeoffs): OCR result dedupe (the capture band contains the blinking caret, so "identical" content rarely produces identical pixels), debounce/interval tuning, OCR
.accurate->.fast, greedy sampling at low temp, menu-bar observation split, InputMonitor allocation tweaks.Validation
Engine (cotabbyinference @ 6e1a9ba):
App (this branch, resolved against engine 6e1a9ba):
Linked issues
Refs #661.
Risk / rollout notes
6e1a9ba. The package is pinnedbranch: mainwith an untrackedPackage.resolved, so CI and fresh checkouts resolve it automatically; locally, re-resolve packages (or deletePackage.resolved) to pick it up.setComputeLogprobis additive (a setter, not aSamplingConfigfield, precisely so existing memberwise initializers keep compiling), logprob defaults to ON engine-side, and the sequence-count reduction is invisible to an app that holds one sequence.gpu_layers = -1, the shipping config) CPU threads only orchestrate/sample, so throughput is unaffected; in a CPU-bound fallback, P-cores-only is llama.cpp's own Apple Silicon guidance (E-core stragglers stall every layer barrier). Validated end-to-end generation in the engine suite on this machine.xcodegen generate.Greptile Summary
This PR applies focused hot-path optimizations across three files: gating per-token logprob computation in
LlamaRuntimeCorebehind theconfidenceFloorcheck, caching display geometries inAXHelperwithdidChangeScreenParametersinvalidation, and short-circuiting<think>-block stripping and sortingscaffoldingLabelsonce at startup inSuggestionTextNormalizer.engine.setComputeLogprob(seqID, options.confidenceFloor > -.infinity)on both the fresh-sequence and KV-reuse paths, skipping two O(vocab) passes per generated token in the default (suppression-off) configuration.cachedDisplayGeometries(a static optional) populated on first call and cleared by a lazily-registeredNSApplication.didChangeScreenParametersNotificationobserver, eliminating repeatedNSScreen.screens+CGDisplayBoundswork per AX rect conversion at the focus-poll cadence.stripThinkBlockswith acontains(\"<think>\")fast-path and promotesscaffoldingLabels.sortedfrom per-call to astatic let, both on the per-prediction critical path.Confidence Score: 4/5
Safe to merge. All three changes are tightly scoped optimisations with no behaviour change in the default configuration.
The logprob-gating and normalizer changes are straightforward and correct. The display-geometry cache is well-structured with documented main-thread assumptions and a notification-based invalidation path. The one item worth a second look is the
setComputeLogprobcall on the KV-reuse path: the comment says the flag must be re-asserted per request regardless of incremental decoding, but the call is nested insideif !remaining.isEmpty, making that re-assertion conditional. Under currentreusableTokenCountsemanticsremainingis never empty, so this is harmless today, but it contradicts the stated invariant.Cotabby/Services/Runtime/LlamaRuntimeCore.swift — specifically the placement of
setComputeLogprobon the KV-reuse path.Important Files Changed
if !remaining.isEmptyguard that is unreachable in practice — a future change to reusableTokenCount semantics could silently skip it.Sequence Diagram
sequenceDiagram participant C as LlamaRuntimeCore participant E as LlamaEngine participant AX as AXHelper participant NS as NSScreen/CGDisplay Note over C,E: Per-request path (fresh sequence) C->>E: setForceWordContinuation(seqID, ...) C->>E: "setComputeLogprob(seqID, floor > -inf)" C->>E: decodePrompt(seqID, tokens) loop sampleNext C->>E: sampleNext(seqID) E-->>C: SampleResult end Note over C,E: Per-request path (KV reuse) C->>E: trimKV(seqID, reusableCount) C->>E: "setComputeLogprob(seqID, floor > -inf)" C->>E: decodePrompt(seqID, remaining) Note over AX,NS: Display geometry cache AX->>AX: "_ = displayChangeObserver (lazy init)" alt cache hit AX-->>AX: return cachedDisplayGeometries else cache miss AX->>NS: NSScreen.screens + CGDisplayBounds NS-->>AX: raw geometries AX->>AX: "cachedDisplayGeometries = geometries" end Note over AX: Invalidated by didChangeScreenParametersComments Outside Diff (1)
Cotabby/Services/Runtime/LlamaRuntimeCore.swift, line 376-392 (link)setComputeLogprobre-assertion belongs outside theif !remaining.isEmptyguard. Ifremainingwere ever empty (full KV hit), the sequence would carry its previous flag value into the next generation, silently enabling the O(vocab) work for a request that expects it off. Moving the call up matches the stated design intent ("re-assert per request") without changing current behaviour.Reviews (1): Last reviewed commit: "Skip discarded per-token logprob, cache ..." | Re-trigger Greptile