Skip to content

Local speech, video transcription, and voice mode#4

Open
stevederico wants to merge 46 commits into
masterfrom
feature/local-speech-video-voice
Open

Local speech, video transcription, and voice mode#4
stevederico wants to merge 46 commits into
masterfrom
feature/local-speech-video-voice

Conversation

@stevederico

Copy link
Copy Markdown
Owner

Summary

Adds on-device video and voice transcription using whisper.cpp, with UI for importing videos, viewing transcripts, and chatting over transcript context.

Highlights

  • Video transcription — whisper.cpp integration (whisper.xcframework), checkpointed background jobs, model unload/reload around speech work
  • Transcript UI — sticky progress banner, tappable chat bubble with video thumbnail, async transcript sheet, optional timestamp hiding
  • Voice mode — hold-to-talk, live partial transcript in composer, auto-reload GGUF after speech unload
  • Models — Gemma 4 E2B QAT default; Whisper small model via Git LFS; Manage Models download paths for STT
  • Open-source signing — empty DEVELOPMENT_TEAM in project; local signing via Xcode or gitignored Config/Signing.local.xcconfig

Notes for reviewers

  • Whisper STT model (Silo/models/whisper/*.bin) is tracked with Git LFS (~181 MB). Clone with git lfs install.
  • whisper.cpp source is vendored for reference/local builds; runtime uses whisper.xcframework.
  • Branch was reconnected to master after an LFS history rewrite (merge commit at tip).

Test plan

  • Clone with Git LFS; build and run on device
  • Import video from Photos; confirm transcription completes and thumbnail appears
  • Tap transcript bubble/banner; sheet opens without UI stall
  • Toggle timestamps in transcript viewer
  • Voice hold-to-talk and send message
  • Chat with transcript context grounded in video content

stevederico and others added 30 commits January 31, 2025 09:40
Updated the README to include a new screenshot section with a table layout.
…-4 Mini support, privacy manifest, corrupt file detection, download persistence, updated llama.cpp engine, and new README following readme-standards
- Auto-download default SmolLM3 when no models are installed
- Fix catalog sync for on-disk GGUF files; add Phi-4 Mini to catalog
- Make loadModel async/throws with published modelLoadError
- Throw on llama_decode failures and surface errors in chat
- Restore conversation KV cache after title generation via encodePrompt
- Fix model picker active check; Gemma/Phi RAM gates; privacy link
Replace older SmolLM3, Phi-4, LFM2.5, and Ministral entries with
Qwen3.5, Gemma 3, Qwen3-4B, and Llama 3.2 3B. Keep Gemma 4 E2B as
default. Update RAM gates, README, and license acknowledgements.
Fix review issues and refresh 2026 on-device model catalog
Introduces a Speech module with on-device-only recognition guards,
audio extraction/chunking for long videos, and live mic voice input.
Video import attaches transcripts to new chats for grounded LLM Q&A.
- Checkpointed transcription jobs with background task + chunk resume
- Unload LLM during transcribe; reload after
- TranscriptView with copy/share; attachment banner
- On-device TTS for replies (Settings toggle)
Restore com.example.silo bundle ID and empty DEVELOPMENT_TEAM (Xcode
had written personal team/bixbyapps IDs locally). Gitignore local signing
artifacts and document optional Signing.local.xcconfig for device builds.
Default download and model resolution use LFM2.5 (~1.2 GB) on sim,
ignore files over 1.6 GB, auto-bootstrap when only Gemma is present,
and cap context/threads with clearer load errors.
Skip requiresOnDeviceRecognition guard in sim so video transcription can be tested.
Persist failureMessage with a red banner, mirror to videoImportError under
the input, and stop loadModel from wiping messages after transcription.
Stale prefix cache mismatched re-tokenized assistant history. Clear memory
between turns, fall back to full prompt init on decode error, and retry once.
Use partial results and task completion fallback, dictation hint, modern
audio track loading, en-US retry, and clearer no-speech error guidance.
Show VideoTranscriptBanner immediately on video selection with preparing,
transcribing, ready, and failed states; dismiss only via X (clears attachment).
- Use latest Unsloth Q4_K_XL QAT GGUF (2.6 GiB) for better on-device quality and lower memory
- Keep legacy non-QAT Q4 and Q8 as download options
- Update RAM requirements and model list
- Improve filename parser to cleanly display QAT models as 'Gemma 4'
- Update README and CHANGELOG
Strip [MM:SS] prefixes from each line instead of dropping timestamped
lines entirely. Share, copy, and word count use the displayed text.
…pt UI

Integrate whisper.xcframework for on-device video/voice transcription with
model unload/reload around speech jobs. Generate video poster thumbnails for
the chat attachment bubble and progress banner. Load large transcripts off the
main thread via TranscriptSheetLoader and cache character counts. Adopt iOS 18
AVAssetExportSession.export(to:as:) and fix remaining build warnings.
Vendor whisper.cpp at 99613cb for local builds and reference. Include
plan-whisper-cpp-integration.md and todo.md for the integration checklist.
Do not commit Apple Development Team IDs in the open-source project.
Reconnect branch history after LFS rewrite so the feature branch can merge via PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant