Local speech, video transcription, and voice mode by stevederico · Pull Request #4 · stevederico/silo

stevederico · 2026-06-08T22:20:08Z

Summary

Adds on-device video and voice transcription using whisper.cpp, with UI for importing videos, viewing transcripts, and chatting over transcript context.

Highlights

Video transcription — whisper.cpp integration (whisper.xcframework), checkpointed background jobs, model unload/reload around speech work
Transcript UI — sticky progress banner, tappable chat bubble with video thumbnail, async transcript sheet, optional timestamp hiding
Voice mode — hold-to-talk, live partial transcript in composer, auto-reload GGUF after speech unload
Models — Gemma 4 E2B QAT default; Whisper small model via Git LFS; Manage Models download paths for STT
Open-source signing — empty DEVELOPMENT_TEAM in project; local signing via Xcode or gitignored Config/Signing.local.xcconfig

Notes for reviewers

Whisper STT model (Silo/models/whisper/*.bin) is tracked with Git LFS (~181 MB). Clone with git lfs install.
whisper.cpp source is vendored for reference/local builds; runtime uses whisper.xcframework.
Branch was reconnected to master after an LFS history rewrite (merge commit at tip).

Test plan

Clone with Git LFS; build and run on device
Import video from Photos; confirm transcription completes and thumbnail appears
Tap transcript bubble/banner; sheet opens without UI stall
Toggle timestamps in transcript viewer
Voice hold-to-talk and send message
Chat with transcript context grounded in video content

Updated the README to include a new screenshot section with a table layout.

…-4 Mini support, privacy manifest, corrupt file detection, download persistence, updated llama.cpp engine, and new README following readme-standards

- Auto-download default SmolLM3 when no models are installed - Fix catalog sync for on-disk GGUF files; add Phi-4 Mini to catalog - Make loadModel async/throws with published modelLoadError - Throw on llama_decode failures and surface errors in chat - Restore conversation KV cache after title generation via encodePrompt - Fix model picker active check; Gemma/Phi RAM gates; privacy link

Replace older SmolLM3, Phi-4, LFM2.5, and Ministral entries with Qwen3.5, Gemma 3, Qwen3-4B, and Llama 3.2 3B. Keep Gemma 4 E2B as default. Update RAM gates, README, and license acknowledgements.

Fix review issues and refresh 2026 on-device model catalog

Introduces a Speech module with on-device-only recognition guards, audio extraction/chunking for long videos, and live mic voice input. Video import attaches transcripts to new chats for grounded LLM Q&A.

- Checkpointed transcription jobs with background task + chunk resume - Unload LLM during transcribe; reload after - TranscriptView with copy/share; attachment banner - On-device TTS for replies (Settings toggle)

…imits

…reen

Restore com.example.silo bundle ID and empty DEVELOPMENT_TEAM (Xcode had written personal team/bixbyapps IDs locally). Gitignore local signing artifacts and document optional Signing.local.xcconfig for device builds.

Default download and model resolution use LFM2.5 (~1.2 GB) on sim, ignore files over 1.6 GB, auto-bootstrap when only Gemma is present, and cap context/threads with clearer load errors.

Skip requiresOnDeviceRecognition guard in sim so video transcription can be tested.

Persist failureMessage with a red banner, mirror to videoImportError under the input, and stop loadModel from wiping messages after transcription.

Stale prefix cache mismatched re-tokenized assistant history. Clear memory between turns, fall back to full prompt init on decode error, and retry once.

Use partial results and task completion fallback, dictation hint, modern audio track loading, en-US retry, and clearer no-speech error guidance.

Show VideoTranscriptBanner immediately on video selection with preparing, transcribing, ready, and failed states; dismiss only via X (clears attachment).

- Use latest Unsloth Q4_K_XL QAT GGUF (2.6 GiB) for better on-device quality and lower memory - Keep legacy non-QAT Q4 and Q8 as download options - Update RAM requirements and model list - Improve filename parser to cleanly display QAT models as 'Gemma 4' - Update README and CHANGELOG

Strip [MM:SS] prefixes from each line instead of dropping timestamped lines entirely. Share, copy, and word count use the displayed text.

…pt UI Integrate whisper.xcframework for on-device video/voice transcription with model unload/reload around speech jobs. Generate video poster thumbnails for the chat attachment bubble and progress banner. Load large transcripts off the main thread via TranscriptSheetLoader and cache character counts. Adopt iOS 18 AVAssetExportSession.export(to:as:) and fix remaining build warnings.

Vendor whisper.cpp at 99613cb for local builds and reference. Include plan-whisper-cpp-integration.md and todo.md for the integration checklist.

Do not commit Apple Development Team IDs in the open-source project.

Reconnect branch history after LFS rewrite so the feature branch can merge via PR.

stevederico and others added 30 commits January 31, 2025 09:40

Initial commit

01357e9

1.0

47b5e7b

Update README.md

4c213f2

added example

1f0016f

Update README.md

cb000f0

Update README.md

c7585e1

Update README.md

5ff7965

2.0.0

d58a1d1

update README

d467320

2.0.1

4f53aa5

2.0.2

956e469

Update README.md

51b80a0

Revise README with new screenshot display

49aed51

Updated the README to include a new screenshot section with a table layout.

2.1.0 Sync from private: streaming markdown renderer, Gemma 4 and Phi…

e146de3

…-4 Mini support, privacy manifest, corrupt file detection, download persistence, updated llama.cpp engine, and new README following readme-standards

2.1.1 Swap README demo to UNLIMITED screenshot

2b4f366

Default first-run download to Gemma 4 E2B Instruct Q4 instead of SmolLM3

0ec4d01

Refresh recommended models for 2026 mobile on-device use

94643ff

Replace older SmolLM3, Phi-4, LFM2.5, and Ministral entries with Qwen3.5, Gemma 3, Qwen3-4B, and Llama 3.2 3B. Keep Gemma 4 E2B as default. Update RAM gates, README, and license acknowledgements.

Trim recommended model catalog

88645f7

Remove Phi-4 Mini from recommended models

51d1bb4

Remove Llama 3.2 and Gemma 3 from model catalog

69023be

Enforce 12-month max age on recommended model catalog

1c0eea0

Merge pull request #2 from stevederico/fix/review-issues

3856a15

Fix review issues and refresh 2026 on-device model catalog

Add on-device Apple Speech for video transcription and voice mode

f294802

Introduces a Speech module with on-device-only recognition guards, audio extraction/chunking for long videos, and live mic voice input. Video import attaches transcripts to new chats for grounded LLM Q&A.

Phase 2: transcript viewer, background jobs, TTS, model unload

58856ae

- Checkpointed transcription jobs with background task + chunk resume - Unload LLM during transcribe; reload after - TranscriptView with copy/share; attachment banner - On-device TTS for replies (Settings toggle)

Fix voice mode: show live transcript in input and send on mic stop

68d6157

Fix voice send: use input text fallback, surface model errors

87d8d65

Auto-reload GGUF after speech unload; fix resume model URL

03df6d0

Replace film/mic buttons with plus menu for video and voice

d37c186

Add hold-to-talk on input field with 450ms delay

71360ef

stevederico added 16 commits June 4, 2026 18:03

Improve model load errors; cap simulator context; explain Gemma sim l…

e36d7d1

…imits

Open Photos picker directly for video import; remove Files chooser sc…

635e3b2

…reen

Keep open-source signing defaults in repo

f661173

Restore com.example.silo bundle ID and empty DEVELOPMENT_TEAM (Xcode had written personal team/bixbyapps IDs locally). Gitignore local signing artifacts and document optional Signing.local.xcconfig for device builds.

Prefer LFM2.5 on Simulator; skip oversized GGUF loads

868cc62

Default download and model resolution use LFM2.5 (~1.2 GB) on sim, ignore files over 1.6 GB, auto-bootstrap when only Gemma is present, and cap context/threads with clearer load errors.

Allow Speech recognition on Simulator when on-device assets missing

079da0a

Skip requiresOnDeviceRecognition guard in sim so video transcription can be tested.

Show transcription errors after job ends; keep chat on model reload

cb6932c

Persist failureMessage with a red banner, mirror to videoImportError under the input, and stop loadModel from wiping messages after transcription.

Wrap simulator bootstrap call in #if targetEnvironment(simulator)

97b0114

Fix decode failures after voice/chat by resetting KV cache each turn

9b7d15b

Stale prefix cache mismatched re-tokenized assistant history. Clear memory between turns, fall back to full prompt init on decode error, and retry once.

Improve video transcription when on-device speech returns empty

1e58328

Use partial results and task completion fallback, dictation hint, modern audio track loading, en-US retry, and clearer no-speech error guidance.

Sticky video transcript banner from pick until user dismisses

6d0056f

Show VideoTranscriptBanner immediately on video selection with preparing, transcribing, ready, and failed states; dismiss only via X (clears attachment).

Fix transcript viewer hiding all text when timestamps are off

d5ca314

Strip [MM:SS] prefixes from each line instead of dropping timestamped lines entirely. Share, copy, and word count use the displayed text.

Add whisper.cpp source tree and integration planning notes

83af5fb

Vendor whisper.cpp at 99613cb for local builds and reference. Include plan-whisper-cpp-integration.md and todo.md for the integration checklist.

Bump Xcode LastUpgradeCheck; keep DEVELOPMENT_TEAM empty

1a16da7

Do not commit Apple Development Team IDs in the open-source project.

Merge feature/local-speech-video-voice onto master

a5cd1e9

Reconnect branch history after LFS rewrite so the feature branch can merge via PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local speech, video transcription, and voice mode#4

Local speech, video transcription, and voice mode#4
stevederico wants to merge 46 commits into
masterfrom
feature/local-speech-video-voice

stevederico commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stevederico commented Jun 8, 2026

Summary

Highlights

Notes for reviewers

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant