Local speech, video transcription, and voice mode#4
Open
stevederico wants to merge 46 commits into
Open
Conversation
Updated the README to include a new screenshot section with a table layout.
…-4 Mini support, privacy manifest, corrupt file detection, download persistence, updated llama.cpp engine, and new README following readme-standards
- Auto-download default SmolLM3 when no models are installed - Fix catalog sync for on-disk GGUF files; add Phi-4 Mini to catalog - Make loadModel async/throws with published modelLoadError - Throw on llama_decode failures and surface errors in chat - Restore conversation KV cache after title generation via encodePrompt - Fix model picker active check; Gemma/Phi RAM gates; privacy link
Replace older SmolLM3, Phi-4, LFM2.5, and Ministral entries with Qwen3.5, Gemma 3, Qwen3-4B, and Llama 3.2 3B. Keep Gemma 4 E2B as default. Update RAM gates, README, and license acknowledgements.
Fix review issues and refresh 2026 on-device model catalog
Introduces a Speech module with on-device-only recognition guards, audio extraction/chunking for long videos, and live mic voice input. Video import attaches transcripts to new chats for grounded LLM Q&A.
- Checkpointed transcription jobs with background task + chunk resume - Unload LLM during transcribe; reload after - TranscriptView with copy/share; attachment banner - On-device TTS for replies (Settings toggle)
Restore com.example.silo bundle ID and empty DEVELOPMENT_TEAM (Xcode had written personal team/bixbyapps IDs locally). Gitignore local signing artifacts and document optional Signing.local.xcconfig for device builds.
Default download and model resolution use LFM2.5 (~1.2 GB) on sim, ignore files over 1.6 GB, auto-bootstrap when only Gemma is present, and cap context/threads with clearer load errors.
Skip requiresOnDeviceRecognition guard in sim so video transcription can be tested.
Persist failureMessage with a red banner, mirror to videoImportError under the input, and stop loadModel from wiping messages after transcription.
Stale prefix cache mismatched re-tokenized assistant history. Clear memory between turns, fall back to full prompt init on decode error, and retry once.
Use partial results and task completion fallback, dictation hint, modern audio track loading, en-US retry, and clearer no-speech error guidance.
Show VideoTranscriptBanner immediately on video selection with preparing, transcribing, ready, and failed states; dismiss only via X (clears attachment).
- Use latest Unsloth Q4_K_XL QAT GGUF (2.6 GiB) for better on-device quality and lower memory - Keep legacy non-QAT Q4 and Q8 as download options - Update RAM requirements and model list - Improve filename parser to cleanly display QAT models as 'Gemma 4' - Update README and CHANGELOG
Strip [MM:SS] prefixes from each line instead of dropping timestamped lines entirely. Share, copy, and word count use the displayed text.
…pt UI Integrate whisper.xcframework for on-device video/voice transcription with model unload/reload around speech jobs. Generate video poster thumbnails for the chat attachment bubble and progress banner. Load large transcripts off the main thread via TranscriptSheetLoader and cache character counts. Adopt iOS 18 AVAssetExportSession.export(to:as:) and fix remaining build warnings.
Vendor whisper.cpp at 99613cb for local builds and reference. Include plan-whisper-cpp-integration.md and todo.md for the integration checklist.
Do not commit Apple Development Team IDs in the open-source project.
Reconnect branch history after LFS rewrite so the feature branch can merge via PR.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds on-device video and voice transcription using whisper.cpp, with UI for importing videos, viewing transcripts, and chatting over transcript context.
Highlights
whisper.xcframework), checkpointed background jobs, model unload/reload around speech workDEVELOPMENT_TEAMin project; local signing via Xcode or gitignoredConfig/Signing.local.xcconfigNotes for reviewers
Silo/models/whisper/*.bin) is tracked with Git LFS (~181 MB). Clone withgit lfs install.whisper.cppsource is vendored for reference/local builds; runtime useswhisper.xcframework.masterafter an LFS history rewrite (merge commit at tip).Test plan