Cut inference and AX hot-path waste: P-core threads, gated logprob, halved KV, geometry cache by FuJacob · Pull Request #667 · FuJacob/cotabby

FuJacob · 2026-06-11T04:06:41Z

Summary

High-confidence performance work for #661: cuts per-token CPU, decode-thread oversubscription, resident KV RAM, and hot-path rebuild work, with no behavior or UX change in the shipping configuration. Pairs with engine commit 6e1a9ba on cotabbyinference main ("Cut idle inference costs: P-core threads, gated logprob, right-sized KV").

Engine side (cotabbyinference, already on main):

Decode threads now match the performance-core count (hw.perflevel0.physicalcpu, physical-core fallback on Intel) instead of all logical cores. Barriered matmul threads on E-cores stall every layer; with full Metal offload the CPU threads only orchestrate, so logical-core oversubscription was wasted wakeups.
SampleResult.logprob (two O(vocab) passes + a vocab-wide exp() per generated token) is now computed only when the sequence opts in via the new setComputeLogprob. Engine default stays ON, so existing callers are untouched.
MAX_SEQUENCES 4 -> 2 halves the shared KV allocation (n_ctx = window * MAX_SEQUENCES). The app holds at most one live sequence (destroy-before-create everywhere); 2 keeps a spare slot for tests/evals. Per-sequence window unchanged.

App side (this PR):

LlamaRuntimeCore disables the engine's per-token logprob whenever confidenceFloor == -infinity — the shipping default, where ConfidenceSuppressionPolicy.shouldSuppress returns before ever reading the value. In the default config every generated token was paying ~2x vocab-size float ops plus a vocab-wide exp() to produce a number that was summed and discarded. Byte-identical suggestions; the confidence-suppression feature stays fully wired for anyone who raises the floor.
AXHelper.displayGeometries() is now cached and invalidated on didChangeScreenParameters. It was rebuilding NSScreen.screens + CGDisplayBounds for every AX rect conversion — many per focus resolve at the poll cadence — for identical results between display changes.
SuggestionTextNormalizer: <think>-block stripping now short-circuits with a contains check before any regex work (both patterns require the literal tag; most completions have none), and the scaffolding-label list is length-sorted once instead of per call.

Considered and deliberately excluded (uncertain benefit or behavior tradeoffs): OCR result dedupe (the capture band contains the blinking caret, so "identical" content rarely produces identical pixels), debounce/interval tuning, OCR .accurate -> .fast, greedy sampling at low temp, menu-bar observation split, InputMonitor allocation tweaks.

Validation

Engine (cotabbyinference @ 6e1a9ba):

COTABBY_TEST_MODEL_PATH=.../Qwen3-0.6B-Q4_K_M.gguf swift test
# 22 tests, 0 failures — includes the two-concurrent-sequence path under MAX_SEQUENCES=2,
# the default-on logprob assertion, and a new test that setComputeLogprob(false) zeroes logprob.

App (this branch, resolved against engine 6e1a9ba):

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build
# ** BUILD SUCCEEDED **

xcodebuild ... -only-testing:CotabbyTests/SuggestionTextNormalizerTests \
  -only-testing:CotabbyTests/DisplayCoordinateConverterTests \
  -only-testing:CotabbyTests/AXTextGeometryResolverTests \
  -only-testing:CotabbyTests/LlamaSuggestionEngineCancellationTests \
  CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO test
# ** TEST SUCCEEDED ** — 38 tests, 0 failures (3 pre-existing environment-gated skips)

swiftlint lint --quiet <changed files>
# exit 0

Linked issues

Refs #661.

Risk / rollout notes

Engine dependency. The Swift plumbing requires cotabbyinference main at or after 6e1a9ba. The package is pinned branch: main with an untracked Package.resolved, so CI and fresh checkouts resolve it automatically; locally, re-resolve packages (or delete Package.resolved) to pick it up.
Engine main is backward-compatible for other branches: setComputeLogprob is additive (a setter, not a SamplingConfig field, precisely so existing memberwise initializers keep compiling), logprob defaults to ON engine-side, and the sequence-count reduction is invisible to an app that holds one sequence.
Thread-count change is the one non-byte-identical item: decode threads drop from all logical cores (12 on M3 Max) to P-cores only. With Metal offload (gpu_layers = -1, the shipping config) CPU threads only orchestrate/sample, so throughput is unaffected; in a CPU-bound fallback, P-cores-only is llama.cpp's own Apple Silicon guidance (E-core stragglers stall every layer barrier). Validated end-to-end generation in the engine suite on this machine.
KV reduction halves resident inference RAM at load time; per-sequence context window is unchanged, so no prompt that fit before can stop fitting.
No schema, settings, or pbxproj changes; no new files, so no xcodegen generate.

Greptile Summary

This PR applies focused hot-path optimizations across three files: gating per-token logprob computation in LlamaRuntimeCore behind the confidenceFloor check, caching display geometries in AXHelper with didChangeScreenParameters invalidation, and short-circuiting <think>-block stripping and sorting scaffoldingLabels once at startup in SuggestionTextNormalizer.

LlamaRuntimeCore.swift: Calls engine.setComputeLogprob(seqID, options.confidenceFloor > -.infinity) on both the fresh-sequence and KV-reuse paths, skipping two O(vocab) passes per generated token in the default (suppression-off) configuration.
AXHelper.swift: Adds cachedDisplayGeometries (a static optional) populated on first call and cleared by a lazily-registered NSApplication.didChangeScreenParametersNotification observer, eliminating repeated NSScreen.screens + CGDisplayBounds work per AX rect conversion at the focus-poll cadence.
SuggestionTextNormalizer.swift: Guards stripThinkBlocks with a contains(\"<think>\") fast-path and promotes scaffoldingLabels.sorted from per-call to a static let, both on the per-prediction critical path.

Confidence Score: 4/5

Safe to merge. All three changes are tightly scoped optimisations with no behaviour change in the default configuration.

The logprob-gating and normalizer changes are straightforward and correct. The display-geometry cache is well-structured with documented main-thread assumptions and a notification-based invalidation path. The one item worth a second look is the setComputeLogprob call on the KV-reuse path: the comment says the flag must be re-asserted per request regardless of incremental decoding, but the call is nested inside if !remaining.isEmpty, making that re-assertion conditional. Under current reusableTokenCount semantics remaining is never empty, so this is harmless today, but it contradicts the stated invariant.

Cotabby/Services/Runtime/LlamaRuntimeCore.swift — specifically the placement of setComputeLogprob on the KV-reuse path.

Important Files Changed

Filename	Overview
Cotabby/Services/Runtime/LlamaRuntimeCore.swift	Adds setComputeLogprob gating on both fresh and reuse paths; correct for all current inputs, but the re-assertion on the reuse path is nested inside an `if !remaining.isEmpty` guard that is unreachable in practice — a future change to reusableTokenCount semantics could silently skip it.
Cotabby/Support/AXHelper.swift	Adds a lazily-initialized display-geometry cache with correct notification-based invalidation; main-thread assumption is documented and upheld by existing callers.
Cotabby/Support/SuggestionTextNormalizer.swift	Guards stripThinkBlocks with contains() fast-path and hoists scaffolding label sort to a static let; both correct.

Sequence Diagram

sequenceDiagram
    participant C as LlamaRuntimeCore
    participant E as LlamaEngine
    participant AX as AXHelper
    participant NS as NSScreen/CGDisplay

    Note over C,E: Per-request path (fresh sequence)
    C->>E: setForceWordContinuation(seqID, ...)
    C->>E: "setComputeLogprob(seqID, floor > -inf)"
    C->>E: decodePrompt(seqID, tokens)
    loop sampleNext
        C->>E: sampleNext(seqID)
        E-->>C: SampleResult
    end

    Note over C,E: Per-request path (KV reuse)
    C->>E: trimKV(seqID, reusableCount)
    C->>E: "setComputeLogprob(seqID, floor > -inf)"
    C->>E: decodePrompt(seqID, remaining)

    Note over AX,NS: Display geometry cache
    AX->>AX: "_ = displayChangeObserver (lazy init)"
    alt cache hit
        AX-->>AX: return cachedDisplayGeometries
    else cache miss
        AX->>NS: NSScreen.screens + CGDisplayBounds
        NS-->>AX: raw geometries
        AX->>AX: "cachedDisplayGeometries = geometries"
    end
    Note over AX: Invalidated by didChangeScreenParameters

Comments Outside Diff (1)

Cotabby/Services/Runtime/LlamaRuntimeCore.swift, line 376-392 (link)

The setComputeLogprob re-assertion belongs outside the if !remaining.isEmpty guard. If remaining were ever empty (full KV hit), the sequence would carry its previous flag value into the next generation, silently enabling the O(vocab) work for a request that expects it off. Moving the call up matches the stated design intent ("re-assert per request") without changing current behaviour.

_{Reviews (1): Last reviewed commit: "Skip discarded per-token logprob, cache ..." | Re-trigger Greptile}

…lizer statics App half of the #661 performance pass; pairs with cotabbyinference 6e1a9ba (P-core decode threads, gated logprob, halved KV allocation). - LlamaRuntimeCore now tells the engine to skip per-token log-probabilities whenever confidenceFloor == -infinity, the shipping default where ConfidenceSuppressionPolicy returns before reading the value. Every generated token was paying two O(vocab) passes plus a vocab-wide exp() to produce a number that was summed and discarded. Suggestions are byte-identical; raising the floor re-enables the computation per request. - AXHelper.displayGeometries() is cached and invalidated on didChangeScreenParameters instead of rebuilding NSScreen.screens + CGDisplayBounds for every AX rect conversion at the focus-poll cadence. - SuggestionTextNormalizer: <think>-stripping short-circuits on a contains check before compiling either regex (both patterns require the literal tag), and the scaffolding-label list is length-sorted once, statically, instead of on every prediction.

FuJacob merged commit 1a42cf2 into main Jun 11, 2026
4 checks passed

FuJacob mentioned this pull request Jun 11, 2026

fix(insertion): IME-safe accept so Japanese/CJK suggestions land on Tab #668

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cut inference and AX hot-path waste: P-core threads, gated logprob, halved KV, geometry cache#667

Cut inference and AX hot-path waste: P-core threads, gated logprob, halved KV, geometry cache#667
FuJacob merged 1 commit into
mainfrom
perf-inference-hot-paths

FuJacob commented Jun 11, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented Jun 11, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented Jun 11, 2026 •

edited by greptile-apps Bot

Loading