Real-time transcription + diarization overhaul (Phase 1: streaming captions) by mrjeeves · Pull Request #212 · mrjeeves/MyOwnLLM

mrjeeves · 2026-05-29T15:05:41Z

Why

The live transcription path felt ~8 s slow because it was an offline design in disguise: fixed 8 s, non-overlapping chunks decoded statelessly off disk, with nothing shown until a whole window finished. This PR rebuilds it around true streaming — interim hypotheses that refine within a few hundred ms and finalize on a pause (iPhone-style) — plus speaker attribution on the live path.

Phase 1 — landed (streaming live transcription)

Area	What
Streaming core	`LocalAgreement-2`, `StreamWindow`, `SilenceEndpointer` — pure logic, 14 unit tests
Tokens / caps	shared `AsrToken` + `AsrSegment.tokens`; per-tier window/hop caps; Moonshine + Parakeet emit tokens
Streaming loop	`run_streaming_loop`: rolling window → per-hop decode → LocalAgreement → interim→final captions + diarized speaker at each endpoint; `EmittedSegment.seg_id`/`partial` (2 loop tests via fake backend + `CaptureSink`)
Live paths flipped	both `run_session` (mic) and `run_remote_session` (mesh host) now stream off the channel — no disk shards
UI	interim caption rendered tentatively (dimmed/italic) then finalized in place; mesh forwards confirmed-only

Verification (headless): cargo fmt --check, cargo clippy --all-targets (no new warnings), cargo test 154 passed, pnpm check 0 errors. The disk-ingest writer is now unused (both live paths are in-memory) and marked allow(dead_code); run_upload/run_drain still read the disk format.

⚠️ Needs real-mic validation — the audio/GUI/ONNX path can't run in this headless environment. Everything is compile/lint/unit-verified, but the live behavior (latency feel, caption refinement, speaker labels) needs a human at a mic. Build + record to confirm.

Remaining

Silero VAD endpointing — upgrade over the current RMS-based endpointer. Needs a new model artifact + ONNX wrapper; best done where it can be run/tuned. (The RMS endpointer is a working stand-in until then.)
Phase 2 — accuracy: window shrink, beam-on-final for Moonshine, VAD-trimmed spans, optional cache-aware Parakeet export.
Phase 3 — running speaker profile: EMA centroids + persistent cross-session SpeakerRegistry + the cold-start re-label via the seg-id upsert (directly addresses the "doesn't build a running profile" complaint; largely unit-testable).

https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ

Phase 1 foundation for the low-latency live-transcription overhaul. The live mic path will decode overlapping windows of audio; this turns that stream of whole-window hypotheses into a stable confirmed prefix (final captions) plus a refining interim tail (live captions). New pure-logic module `asr::streaming` (no cpal/ORT/clock deps, so it unit-tests like `diarize::cluster`): - LocalAgreement: confirms a token once two consecutive hypotheses agree on it (LocalAgreement-2); exposes the interim tail and a finalize-on-endpoint that resets for the next utterance. - StreamWindow: rolling 16 kHz context buffer with a context cap and confirmed-boundary trimming that keeps a short lookback. - SilenceEndpointer: trailing-silence endpointer reusing the RMS notion of the existing gate, so short utterances finalize on a pause without yet pulling in a neural VAD. 14 unit tests; fmt + clippy clean. The module is consumed by the worker in a follow-up that switches run_session off the 8 s disk-shard loop; until then it carries a crate-style allow(dead_code), matching the existing not-yet-wired AsrCaps streaming fields. https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ

Phase 1 scaffolding the rolling-window loop will consume next. - Promote streaming.rs's StreamToken to a shared `AsrToken` on the trait (text + chunk-relative end time) and add `AsrSegment.tokens` (serde-default empty, so the disk-shard path and existing JSON consumers are unaffected). The streaming module now feeds `AsrSegment.tokens` straight into LocalAgreement with no conversion. - Add live-path geometry to AsrCaps: window_seconds / hop_seconds / max_context_seconds. Moonshine 4.0/0.5/8.0 (overlap + LocalAgreement lets the old 8 s disk window shrink); Parakeet 2.0/0.3/6.0. - Populate tokens: Moonshine via AsrToken::words_uniform (no per-token timing, so end times are distributed); Parakeet word-level with real frame times via pieces_to_tokens, with the no-timestamps export falling back to words_uniform. Additive only — the live loop that reads these lands next. fmt + clippy clean; full test suite green (152 passed). https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ

Phase 1 — the rolling-window loop the streaming core feeds. `run_streaming_loop` is the live-path counterpart to ingest_loop + the disk-shard poll loop in run_session: it keeps a rolling StreamWindow in RAM, re-decodes the current utterance every caps.hop_seconds, runs LocalAgreement-2 over the hypotheses, and emits an interim caption (partial=true) that refines hop-to-hop, finalizing (partial=false) on a speech pause detected by SilenceEndpointer. One seg_id per utterance lets the UI replace the live line in place; a forced cut bounds a never- pausing monologue at the context cap, and a drain-time finalize avoids losing in-flight text on Stop. Protocol: EmittedSegment gains seg_id + partial (serde-default/skip, so the disk-shard path and existing JSON consumers are unaffected). The loop is decoupled from cpal — it drains the sample channel and writes through FrameSink — so it's unit-tested here with a scripted fake backend + CaptureSink: a refine-then-finalize run and an all-silence run. fmt + clippy clean; suite green (154 passed). Not yet wired into run_session (that behavioural switch needs real-mic validation) and speaker attribution over this path is a follow-up — hence allow(dead_code) for now. https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ

Frontend half of the streaming live path. EmittedSegment gains seg_id + partial (mirroring the Rust protocol). The state store routes a partial segment to a single transient `interimSegment` — rendered tentatively (dimmed/italic) beneath the confirmed transcript and never persisted — while final segments append to liveSegments exactly as the disk-shard path always has. When an utterance finalizes (a non-partial segment with the same seg_id), the interim line clears and the text lands as a normal turn. liveDelta (the Talking Points consumer) now projects confirmed text only, so it doesn't churn on interim refinements. Mesh: forward confirmed segments only — interim captions stay a local affordance so remote transcripts don't churn (interim-over-mesh later). pnpm check clean (0 errors). The disk-shard path is unchanged (no seg_id/partial → append + persist as before). Live rendering needs real-app/mic validation once run_session is switched onto the loop. https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ

Flip run_session off the 8 s disk-shard pipeline onto run_streaming_loop: audio now streams straight off the cpal channel through the rolling window + per-hop decode + LocalAgreement, emitting interim → final captions instead of waiting ~8 s for a whole chunk to land. No disk shards on the live path (a crashed live session is no longer drain- recoverable — fine for a live feature); a late cpal capture error is surfaced at teardown. run_streaming_loop now takes the diarizer: at each endpoint it runs the finalized utterance audio through diarize and attaches the dominant speaker (max-overlap, mirroring join_segments) to the final segment. Interim captions carry no speaker — it settles when the line finalizes. Compiles clean; clippy quiet; suite green (154). run_upload / run_drain stay on the disk path. Needs real-mic validation (can't run audio headless). The streaming endpointer is RMS-based for now; Silero VAD and the run_remote_session flip are still to come. https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ

Flip run_remote_session onto run_streaming_loop, same as the local mic path — the only difference is the audio source (the peer's PCM, fed via feed_remote_audio into rx). The loop now ends on rx disconnect when end_remote_audio drops the inbox tx, matching the old inbox-closed exit. Mesh peers already receive confirmed-only segments (interim filtered in mesh-transcribe.ts), so remote transcripts don't churn. With both live paths off disk, the disk-ingest writer (ingest_loop, write_f32_chunk, MAX_BACKLOG_SECONDS) is unused — marked allow(dead_code) and kept as the reference shard format (run_drain still reads it for crashed pre-flip sessions). Compiles clean; clippy quiet; suite green (154). https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ

claude added 3 commits May 29, 2026 15:45

mrjeeves force-pushed the claude/dreamy-cray-UMaAC branch from 339128d to 7ed89e3 Compare May 29, 2026 15:46

claude added 3 commits May 29, 2026 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real-time transcription + diarization overhaul (Phase 1: streaming captions)#212

Real-time transcription + diarization overhaul (Phase 1: streaming captions)#212
mrjeeves wants to merge 6 commits into
mainfrom
claude/dreamy-cray-UMaAC

mrjeeves commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mrjeeves commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Phase 1 — landed (streaming live transcription)

Remaining

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mrjeeves commented May 29, 2026 •

edited

Loading