Skip LLM for description when the page already provides one by biafra23 · Pull Request #13 · biafra23/URLVault

biafra23 · 2026-04-29T17:06:55Z

Summary

generateDescription previously called into the LLM unconditionally for every provider (AICore, llama.cpp, LEAP, LiteRT-LM). But most bookmarkable pages already carry a publisher-written 1-2 sentence summary in <meta property="og:description"> or <meta name="description"> — these are tuned for social cards / SERP display and almost always better than what a small on-device model would produce by rewriting them.

Apply the same short-circuit generateTitle already uses for <title>:

val nativeDesc = pageContent?.let { it.ogDescription ?: it.metaDescription }
if (!nativeDesc.isNullOrBlank()) {
    return@runCatching validateDescription(nativeDesc.trim())
}
// only here does the LLM fire — with visibleText as source

Why this matters per-runtime

AICore (Gemini Nano) — saves a few seconds per generation on devices where it works at all.
llama.cpp — saves multi-second CPU work for the 80%+ of pages with metadata.
LEAP / LFM2-Extract — biggest win. LFM2-Extract is fine-tuned for extraction, not generation. Asking it to "rewrite" an already-good summary often produced degenerate output ({"description":":\",\",..."}) because the grammar's minLength: 1 forced something. With this change, it only fires when extraction from visibleText (the long-form body) is genuinely needed.
LiteRT-LM — on Pixel Tensor CPU, ~70 s tag generations collapse to <100 ms native lookups for pages with metadata.

What stays unchanged

The looksDegenerate rejection in LeapModelProvider still applies on the LLM-fallback path for pages without metadata.
The debug provenance tag (e.g. liteRt[CPU]:gemma-3-1b-it-int4:N.NNs) only fires on the LLM path now — its absence from a saved bookmark signals the description came from the page itself, which is a useful signal for the talk demo.
The fall-through prompts now describe their input as "Page text:" instead of "Page summary:" since visibleText is body text, not a pre-written summary.

Branched off `claude/ai-activity-status-line`

This branch is based on PR #12 (the AI router status strip / HF token / LiteRT fixes) so it has the latest device fixes. After #12 merges, this PR's diff will shrink to just the description short-circuit.

Test plan

Bookmark a page with og:description (e.g. a blog post on hashnode / medium / substack). Confirm the description field populates instantly with the publisher's text and no debug provenance tag appears in the saved tags.
Bookmark a page with only <meta name="description"> (e.g. an old static site). Same expected outcome.
Bookmark a page with neither (e.g. a raw HTML file or single-page-app shell). Confirm the LLM fires (status strip shows description: …) and the saved bookmark carries the debug provenance tag.
On LEAP specifically: trigger description for both metadata-rich and metadata-bare pages. Bare-page output should still pass through looksDegenerate rejection if the model generates garbage.
Tags continue to be LLM-generated regardless of metadata (tags are a different task and the page rarely provides good ones directly).

The Toast spammed the screen on every generateXxx call and obscured the bookmark form so the user couldn't interact while it was visible. Replace it with a slim Material strip pinned to the bottom of the screen that: - shows "<action>: <model>…" with a spinner while inference runs - shows "<action> via <model> — 1.23 s" on completion - shows "<action>: no model ready (<reason>)" on failure - auto-hides 3.5 s after Completed, 5 s after NoProvider To time the call, LocalModelRouter now wraps each provider call with a wall-clock measurement and emits a new RouteEvent.Completed with durationMs + success. Picked is still emitted up front for the in-flight state. Layout: the strip lives as the last child of the outer Column with a weighted Box wrapping the active screen, so when AnimatedVisibility expands the strip from 0-height the screen above is pushed up — the Save / Cancel button row stays visible and responsive. The outer Column consumes the navigation bar inset once via navigationBarsPadding(), so neither the screen's BottomAppBar nor the strip applies its own gesture-pill padding (which previously caused the BottomAppBar's own box to grow when the strip slid in). DEBUG-only, matching the gating of the previous Toast.

Three connected changes for downloading gated LiteRT-LM / Gemma bundles: 1. Stop sending Authorization on cross-origin redirects in ModelDownloadManager. HF 302s gated downloads to a pre-signed cas-bridge.xethub.hf.co URL that already authenticates via query-string token; an extra Bearer header makes the CDN return 401. Track the original host and only attach the header while the request is on it. 2. Add a Settings UI for the HF token (HuggingFaceTokenRow in SettingsScreen). The infrastructure (LocalModelPreferences.saveHfToken / loadHfToken, EncryptedSharedPreferences) already existed but no screen ever wrote the value. The row also surfaces a build-time fallback so users of CI builds don't need to paste anything. 3. Bake an optional default into the APK from `HF_TOKEN`: - androidApp/build.gradle.kts reads HF_TOKEN env var or the `hfToken` property in <repo>/local.properties (gitignored) and emits BuildConfig.HF_TOKEN_DEFAULT, with a non-token-character filter so the generated string literal is always safe. - LocalModelPreferences.loadHfToken() now falls back to the BuildConfig field when the user hasn't saved one. - build.yml and release.yml workflows pass HF_TOKEN through from a GitHub repo secret. Both still build with the secret absent. Plus two unrelated polish items folded in: - BookmarkListScreen: replace the 🔄 / ⚙️ emoji in the top app bar with Icons.Default.Sync / Icons.Default.Settings (added compose.materialIconsExtended to shared/commonMain). - README: new "Hugging Face Token (gated models)" section explaining the user-entered, env-var, and local.properties paths plus the one-time per-repo licence acknowledgement step.

…OLDEST - runTimed now uses System.nanoTime() instead of System.currentTimeMillis() so wall-clock jumps (NTP / manual changes) can't produce negative or inflated durationMs. Pointed out by Gemini and Copilot independently. - runTimed wraps the call in try/finally so RouteEvent.Completed is always emitted, including when block() throws (notably CancellationException, which runCatching re-raises). Without this, a cancelled call left the status strip stuck in "Running…" until the next event fired. Matches the KDoc claim that Completed fires after the call "returns or throws". - _events now uses BufferOverflow.DROP_OLDEST. With Picked + Completed pairs per call, a slow / backgrounded collector could overflow the 16-slot buffer; tryEmit dropping the latest event would leave the UI stuck. With DROP_OLDEST, tryEmit always succeeds and the consumer eventually catches up to current state. - AiActivityStatusLine KDoc rewritten to match the actual usage in MainActivity (last child of a Column, content above given weight(1f)) rather than the original "in an overlaying Box" advice that would reintroduce the obscuring behaviour the strip was meant to fix. The Copilot suggestion to make `block: suspend () -> Result<T>` (and crossinline) was checked but not applied — `runTimed` is `suspend inline` already, which is exactly the construct that lets a non-suspending lambda parameter call suspending provider methods (the body is inlined into the suspend caller). The existing form compiles and runs correctly.

Every entry in ModelCatalog is text-only, but the bridge was passing `visionBackend = backend` and `audioBackend = backend` to EngineConfig. The SDK then probes for vision/audio sections in the .litertlm bundle during initialize() and aborts text-only models with: Failed to create engine: NOT_FOUND: TF_LITE_VISION_ENCODER not found in the model. Affected every text-only LiteRT-LM model — FunctionGemma 270M, Gemma 3 270M, Qwen3 0.6B, etc. — on every backend (NPU → GPU → CPU all fail the same way), so load() always threw. Set visionBackend / audioBackend to null. When a true multi-modal Gemma 4 E2B bundle lands later, set them per-entry instead of bridge- wide.

Observed in the wild: for a page where LFM2-Extract had little to extract from, the bridge returned {"description":":\",\",\",\",\",\",\",\",\",\",\",\",..."} JSON shape is valid (the grammar constraint did its job) but the value is degenerate punctuation repeated until the maxLength budget runs out — the sampler picks whatever satisfies the grammar when the model has nothing meaningful to say. Two defences: 1. `looksDegenerate` heuristic — checks letter ratio, distinct-char count, and minimum length. Rejected output throws and the UI surfaces "AI description generation failed" instead of persisting `:","",...` as the bookmark's description. Applied to both description and title (tags pass through length / charset filters that would already catch this). 2. Strengthened the description prompt to: - explicitly require natural-language sentences and forbid punctuation-only output; - give an exact canonical fallback ("No summary available.") for when the supplied text doesn't have enough to extract, so the model has a way out instead of being cornered by the grammar. Note that detection is the safety net; the prompt change is what we hope reduces the failure rate. The pre-existing comment already warned about this exact mode — it had regressed for thin pages.

The description-success LaunchedEffect previously chained tag generation only on success. With degenerate-output rejection now landing on the failure branch (and observed: same URL where LEAP's description sampler went sideways produced perfectly clean tags), falling back to "no tags either" leaves the user with nothing. Tags are an independent extraction and don't need the description as context. On description error, kick them off from URL + title alone.

When a URL is shared, MainActivity composes with aiCoreEnabled briefly false because `anyProviderReady` is computed via produceState on Dispatchers.Default and hasn't finished by the time LaunchedEffect(prefilledUrl) fires in AddEditBookmarkScreen. That sent the first trigger down the legacy metadata-extraction branch, and the single-key LaunchedEffect never re-ran when aiCoreEnabled later flipped to true — leaving the user with regex-extracted tags and no AI output. Two prior fixes contributed to widening the race window: - tightening produceState keys (removed downloadStates) so the read is deferred to the first true state change rather than every tick, - moving the readiness probe off the main thread. Both correct individually; together they make the startup gap visible. Add aiCoreEnabled to the LaunchedEffect keys and pass force=true so the second invocation bypasses the aiTriggeredForUrl dedup. The legacy result-handling LaunchedEffects already gate their applies on `if (!aiCoreEnabled)`, so once AI takes over the racing legacy results are discarded harmlessly.

Adds a synthetic last tag of the form `dbg:<provider-id>@<HH:mm:ss>` to every successful generateTags call in DEBUG builds, so a glance at the saved bookmark tells you which SDK / model variant produced those tags and when — useful when comparing LEAP vs llama.cpp vs LiteRT vs AICore live during the talk demo. Examples in saved bookmarks: dbg:leap:lfm2-1.2b-extract@14:13:14 dbg:litertlm:qwen3-0.6b@14:14:02 dbg:mlkit:gemini-nano-active@14:14:48 Stripped at the router level in release builds so synced Bitwarden entries don't carry `dbg:…` tags into production. Only affects the bookmark form path (BookmarkViewModel → router → provider). The ModelComparisonRunner calls providers directly and is unaffected — its UI already labels each row by provider.

Was: dbg:leap:lfm2-1.2b-extract@14:13:14 (wall-clock time of generation) Now: dbg:leap@2.34s (how long generation took) Short SDK names map ModelRuntime to: ML_KIT -> aicore LLAMA_CPP -> llama LEAP -> leap MEDIAPIPE -> liteRt Duration formatting: ms under 1s, two-decimal seconds above. Hand- rolled rather than String.format to avoid host-locale dependence.

The ProviderResultCard rendered `result.runtime.name`, leaking the enum constant. `ModelRuntime.MEDIAPIPE` is a historical leftover from when the LiteRT-LM bundle was loaded via MediaPipe-LLM; the runtime today is LiteRT-LM, so display it as such. Added a `runtimeLabel()` helper that mirrors the same mapping already in SettingsScreen.

Without a BackHandler, the Activity finishes on system back and the user is dropped out to the home screen — not what tapping the in-screen back arrow does. Mirror the same per-screen routing as the arrow buttons: Comparison -> Settings, Settings/AddEdit -> List. Disabled on List so the OS default (close the app) still applies at the root.

Symptom on Pixel 7a (Tensor G2): the GPU backend's `Engine.initialize()` succeeds fine, then the first `Session.generateContent` call throws com.google.ai.edge.litertlm.LiteRtLmJniException: Failed to generate content: UNKNOWN: Can not find OpenCL library on this device Cause: LiteRT-LM's Top-K sampler dlopens OpenCL even when the engine runs through WebGPU, and Tensor doesn't ship OpenCL drivers. The sampler-factory log shows it falling back to a statically-linked OpenCL sampler, which then fails the same way. Generation is fatal; load was happy. Fix: catch the failure in runCollect, blocklist the broken backend for the lifetime of the process, reload the same model on the next candidate (NPU → GPU → CPU strategy means we land on CPU here), and retry generation once. We hold the mutex throughout so no other caller can race in mid-recovery. Refactored: - `load()` body extracted into `loadInternalLocked()` so recovery can re-enter it without dropping the mutex, - `runCollect()` now wraps `runCollectOnce()` with the catch+retry, - `runtimeBlockedBackends` set is filtered out at the start of `loadInternalLocked()` and surfaces a clear "all backends blocked" error if we exhaust them. Detection is narrow today (`isRecoverableRuntimeError` only matches 'opencl'); widen as new device-specific failures show up.

Previous OpenCL self-heal (b8cc107) closed the broken GPU engine and reloaded on the next backend in the same mutex hold, then retried the failed generate call. Engine.close() doesn't release the GPU pipeline's native memory synchronously, so the in-flight reload of the CPU engine held both pipelines in RAM briefly. Pixel 7a / Tensor G2 peaked at 5957197824 B (~5.96 GB), the LMK reaped the app: am_pss : [12767, ..., 5957197824, ...] killinfo: [12767, ...] Process com.jaeckel.urlvault (pid 12767) has died: fg TOP User saw it as "LiteRt now crashes the app" with no FATAL exception in logcat — classic LMK signature. New behaviour: catch the OpenCL-style runtime failure, blocklist the broken backend for the session, release the engine, **fail this call**. The very next entry-point call (provider.generateXxx → bridge.load) sees engine == null and runs a fresh load through the strategy with the blocklist applied — single-pipeline peak, no concurrent two-engine window. UX is one failed generation followed by a working one.

The SDK auto-selects the Top-K sampler based on engine backend; there's no public API knob to override it. With Backend.GPU, the SDK pulls the WebGPU/OpenCL sampler chain, which on Pixel Tensor (7a / 8 / 9) all ends at OpenCL — including the static fallback baked into liblitertlm_jni.so. Generation throws `Can not find OpenCL library on this device` even when the engine itself is happily running through WebGPU. Deprioritising GPU below CPU in the strategy makes the *first* call on Tensor land directly on `Backend.CPU`, which uses the CPU sampler and works. The previous order (NPU → GPU → CPU) made the first call always fail before recovery kicked in (and earlier in this branch, recovery itself OOMed the process at peak memory). Cost: on devices where GPU sampling works (older Snapdragon w/ proper OpenCL drivers), we miss the GPU speedup. Acceptable — GPU is a hypothetical performance win for the model graph; correctness on widely- deployed Pixel hardware is more important. The runtime self-heal in runCollect still catches the OpenCL error if a user forces GPU via a custom strategy.

Two changes after running on a Pixel 10 Pro Fold and seeing 70 s tag generation, with no easy way to tell whether GPU acceleration was even in play: 1. Restore the original NPU → GPU → CPU strategy. The earlier reorder (3300669) was an over-correction for the Pixel 7a OpenCL bug. With the OOM-free recovery now in place (3f75958, blocklist + throw, no inline retry) the trade-off goes the other way: G2 takes one failed first call before settling on CPU; G5 / 10 Pro Fold lands on GPU directly and gets the speedup. Always-CPU was strictly worse for the device class with working OpenCL/WebGPU. 2. Surface the actual loaded backend in the debug provenance tag for LiteRT-LM. The saved bookmark now carries liteRt[GPU]:gemma-3-1b-it-int4:2.34s instead of liteRt:gemma-3-1b-it-int4:70.20s so "is this NPU/GPU/CPU?" is answerable at a glance, no logcat needed. AICore / llama.cpp / Leap don't expose a comparable backend distinction in this app, so the suffix only fires for LiteRT. Plumbing: LiteRtLmNativeBridge gains `currentBackendLabel()` (default null), LiteRtLmSdkBridge returns its `currentBackend` field, LiteRtLmModelProvider exposes a public method that calls into the bridge, and the router casts to LiteRtLmModelProvider to read it when building the tag.

Subsequent shares of the same URL produced two debug provenance tags on the saved bookmark, with different durations — proof of two real generation runs, not a duplicate state callback. Cause: the share-intent LaunchedEffect was keyed on `(prefilledUrl, aiCoreEnabled)` and called `triggerAiForUrl(force=true)`. The `force=true` was needed for the startup race (legacy first, AI second when the readiness probe settled) but it bypassed the URL dedup *unconditionally* — any unrelated recomposition that flipped `aiCoreEnabled` (or any other state Compose decided to invalidate the effect on) re-fired the whole AI flow. Description ran twice → each description-success chained tag generation → two `router.generateTags` calls → two debug tags. Fix: track the *mode* the URL was last triggered in (`"ai"` / `"legacy"`) alongside the URL itself, and dedup on the pair. Same URL + same mode = no-op. Legacy → AI is a real mode change and falls through (preserves the startup-race fix). The share-intent effect drops `force = true`; the dedup now does the right thing on its own. Other call sites that explicitly clear `aiTriggeredForUrl = null` to allow a re-trigger continue to work — null != targetUrl short-circuits the first AND clause, so the mode comparison never gates them.

Earlier (3300669 → 49a004e) I bounced this back and forth: deprioritised GPU after the Pixel 7a / Tensor G2 OpenCL failure, then re-prioritised it on the assumption that newer Tensor (G5 / Pixel 10 Pro Fold) might have proper OpenCL drivers. Turns out it doesn't — the user reports the same `Can not find OpenCL library on this device` on the Fold, so the LiteRT-LM 0.10.x GPU sampler is broken on every Pixel Tensor we've tested. Final strategy (and the right one until/unless we get a working-OpenCL device to test on): NPU → CPU → GPU. NPU rarely works (vendor libs aren't packaged in the LiteRT-LM AAR for any Pixel we've seen) and falls through to CPU, which has a CPU sampler and works correctly. GPU stays in the list as a last resort for hypothetical future devices, gated behind the runtime self-heal that blocklists it on the OpenCL error if anyone forces GPU via a custom strategy. Saves ~5–10 s of cold-start time on every first generation on the user's two Pixel devices.

Every provider's generateDescription previously called into the model unconditionally. But for any page with `og:description` or `<meta name="description">` (i.e. most pages worth bookmarking), the publisher already wrote a 1-2 sentence summary tuned for social-card / SERP display — the LLM can't beat that, and asking it to "rewrite" the existing string is wasted work that often degrades the result. Mirror the same short-circuit `generateTitle` already uses for `<title>` across all four providers (AICore, llama.cpp, LEAP, LiteRT-LM): if pageContent has og:description or meta description: return it verbatim (after URL stripping + length truncation) else: LLM as before, with `pageContent.visibleText` as the source Particularly important for LEAP: LFM2-Extract is fine-tuned for extraction, not generation. Asking it to summarise an already-good summary often produced degenerate output (`{"description":":\",\",..."}`) because the grammar's `minLength: 1` cornering forced *something* even when the model had nothing to add. With this change, LFM2-Extract only fires for pages where extraction is genuinely needed (and from `visibleText` — the actual long-form body — rather than a pre-written summary). The fall-through prompts now describe their input as "Page text:" instead of "Page summary:" since visibleText is body text, not a summary. Side effects: - ~70 s LLM calls on Tensor CPU collapse to <100 ms native lookups for the 80%+ of pages with metadata. - The debug provenance tag (e.g. `leap:lfm2-1.2b-extract:N.NNs`) only fires on the LLM path now — its absence from a saved bookmark signals the description came from the page itself. - The `looksDegenerate` LEAP rejection path stays in place for the visible-text-only case where the failure mode can still surface.

gemini-code-assist

Code Review

This pull request introduces support for Hugging Face gated models by allowing users to provide an access token via settings, environment variables, or local properties. It optimizes AI description generation by short-circuiting when page metadata is available and implements heuristics to detect degenerate model output. The LiteRT-LM backend strategy was updated to prioritize CPU over GPU on Tensor devices to avoid OpenCL-related failures, and a new debug UI component replaces toasts for tracking AI activity. Feedback was provided regarding offloading native resource cleanup to a background thread to prevent potential blocking of the calling thread.

gemini-code-assist · 2026-04-29T17:10:02Z

+                        "blocklisting; next request will reload on remaining backends.",
+                )
+                runtimeBlockedBackends += brokenBackend
+                runCatching { engine?.close() }


The engine?.close() call should be offloaded to Dispatchers.IO to avoid blocking the calling thread (which might be the main thread) with potentially heavy native resource cleanup. This would be consistent with how the engine is closed in loadInternalLocked and unload.

Suggested change

runCatching { engine?.close() }

withContext(Dispatchers.IO) { runCatching { engine?.close() } }

biafra23 added 20 commits April 29, 2026 12:04

Drop dbg: prefix from debug provenance tag

527f0fa

Include model name in debug provenance tag

7e91f44

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip LLM for description when the page already provides one#13

Skip LLM for description when the page already provides one#13
biafra23 wants to merge 20 commits into
mainfrom
claude/description-short-circuit

biafra23 commented Apr 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	runCatching { engine?.close() }
	withContext(Dispatchers.IO) { runCatching { engine?.close() } }

Conversation

biafra23 commented Apr 29, 2026

Summary

Why this matters per-runtime

What stays unchanged

Branched off claude/ai-activity-status-line

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Branched off `claude/ai-activity-status-line`