Real-time transcription + diarization overhaul (Phase 1: streaming captions)#212
Open
mrjeeves wants to merge 6 commits into
Open
Real-time transcription + diarization overhaul (Phase 1: streaming captions)#212mrjeeves wants to merge 6 commits into
mrjeeves wants to merge 6 commits into
Conversation
Phase 1 foundation for the low-latency live-transcription overhaul.
The live mic path will decode overlapping windows of audio; this turns
that stream of whole-window hypotheses into a stable confirmed prefix
(final captions) plus a refining interim tail (live captions).
New pure-logic module `asr::streaming` (no cpal/ORT/clock deps, so it
unit-tests like `diarize::cluster`):
- LocalAgreement: confirms a token once two consecutive hypotheses
agree on it (LocalAgreement-2); exposes the interim tail and a
finalize-on-endpoint that resets for the next utterance.
- StreamWindow: rolling 16 kHz context buffer with a context cap and
confirmed-boundary trimming that keeps a short lookback.
- SilenceEndpointer: trailing-silence endpointer reusing the RMS
notion of the existing gate, so short utterances finalize on a
pause without yet pulling in a neural VAD.
14 unit tests; fmt + clippy clean. The module is consumed by the worker
in a follow-up that switches run_session off the 8 s disk-shard loop;
until then it carries a crate-style allow(dead_code), matching the
existing not-yet-wired AsrCaps streaming fields.
https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ
Phase 1 scaffolding the rolling-window loop will consume next. - Promote streaming.rs's StreamToken to a shared `AsrToken` on the trait (text + chunk-relative end time) and add `AsrSegment.tokens` (serde-default empty, so the disk-shard path and existing JSON consumers are unaffected). The streaming module now feeds `AsrSegment.tokens` straight into LocalAgreement with no conversion. - Add live-path geometry to AsrCaps: window_seconds / hop_seconds / max_context_seconds. Moonshine 4.0/0.5/8.0 (overlap + LocalAgreement lets the old 8 s disk window shrink); Parakeet 2.0/0.3/6.0. - Populate tokens: Moonshine via AsrToken::words_uniform (no per-token timing, so end times are distributed); Parakeet word-level with real frame times via pieces_to_tokens, with the no-timestamps export falling back to words_uniform. Additive only — the live loop that reads these lands next. fmt + clippy clean; full test suite green (152 passed). https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ
Phase 1 — the rolling-window loop the streaming core feeds. `run_streaming_loop` is the live-path counterpart to ingest_loop + the disk-shard poll loop in run_session: it keeps a rolling StreamWindow in RAM, re-decodes the current utterance every caps.hop_seconds, runs LocalAgreement-2 over the hypotheses, and emits an interim caption (partial=true) that refines hop-to-hop, finalizing (partial=false) on a speech pause detected by SilenceEndpointer. One seg_id per utterance lets the UI replace the live line in place; a forced cut bounds a never- pausing monologue at the context cap, and a drain-time finalize avoids losing in-flight text on Stop. Protocol: EmittedSegment gains seg_id + partial (serde-default/skip, so the disk-shard path and existing JSON consumers are unaffected). The loop is decoupled from cpal — it drains the sample channel and writes through FrameSink — so it's unit-tested here with a scripted fake backend + CaptureSink: a refine-then-finalize run and an all-silence run. fmt + clippy clean; suite green (154 passed). Not yet wired into run_session (that behavioural switch needs real-mic validation) and speaker attribution over this path is a follow-up — hence allow(dead_code) for now. https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ
339128d to
7ed89e3
Compare
Frontend half of the streaming live path. EmittedSegment gains seg_id + partial (mirroring the Rust protocol). The state store routes a partial segment to a single transient `interimSegment` — rendered tentatively (dimmed/italic) beneath the confirmed transcript and never persisted — while final segments append to liveSegments exactly as the disk-shard path always has. When an utterance finalizes (a non-partial segment with the same seg_id), the interim line clears and the text lands as a normal turn. liveDelta (the Talking Points consumer) now projects confirmed text only, so it doesn't churn on interim refinements. Mesh: forward confirmed segments only — interim captions stay a local affordance so remote transcripts don't churn (interim-over-mesh later). pnpm check clean (0 errors). The disk-shard path is unchanged (no seg_id/partial → append + persist as before). Live rendering needs real-app/mic validation once run_session is switched onto the loop. https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ
Flip run_session off the 8 s disk-shard pipeline onto run_streaming_loop: audio now streams straight off the cpal channel through the rolling window + per-hop decode + LocalAgreement, emitting interim → final captions instead of waiting ~8 s for a whole chunk to land. No disk shards on the live path (a crashed live session is no longer drain- recoverable — fine for a live feature); a late cpal capture error is surfaced at teardown. run_streaming_loop now takes the diarizer: at each endpoint it runs the finalized utterance audio through diarize and attaches the dominant speaker (max-overlap, mirroring join_segments) to the final segment. Interim captions carry no speaker — it settles when the line finalizes. Compiles clean; clippy quiet; suite green (154). run_upload / run_drain stay on the disk path. Needs real-mic validation (can't run audio headless). The streaming endpointer is RMS-based for now; Silero VAD and the run_remote_session flip are still to come. https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ
Flip run_remote_session onto run_streaming_loop, same as the local mic path — the only difference is the audio source (the peer's PCM, fed via feed_remote_audio into rx). The loop now ends on rx disconnect when end_remote_audio drops the inbox tx, matching the old inbox-closed exit. Mesh peers already receive confirmed-only segments (interim filtered in mesh-transcribe.ts), so remote transcripts don't churn. With both live paths off disk, the disk-ingest writer (ingest_loop, write_f32_chunk, MAX_BACKLOG_SECONDS) is unused — marked allow(dead_code) and kept as the reference shard format (run_drain still reads it for crashed pre-flip sessions). Compiles clean; clippy quiet; suite green (154). https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The live transcription path felt ~8 s slow because it was an offline design in disguise: fixed 8 s, non-overlapping chunks decoded statelessly off disk, with nothing shown until a whole window finished. This PR rebuilds it around true streaming — interim hypotheses that refine within a few hundred ms and finalize on a pause (iPhone-style) — plus speaker attribution on the live path.
Phase 1 — landed (streaming live transcription)
LocalAgreement-2,StreamWindow,SilenceEndpointer— pure logic, 14 unit testsAsrToken+AsrSegment.tokens; per-tier window/hop caps; Moonshine + Parakeet emit tokensrun_streaming_loop: rolling window → per-hop decode → LocalAgreement → interim→final captions + diarized speaker at each endpoint;EmittedSegment.seg_id/partial(2 loop tests via fake backend +CaptureSink)run_session(mic) andrun_remote_session(mesh host) now stream off the channel — no disk shardsVerification (headless):
cargo fmt --check,cargo clippy --all-targets(no new warnings),cargo test154 passed,pnpm check0 errors. The disk-ingest writer is now unused (both live paths are in-memory) and markedallow(dead_code);run_upload/run_drainstill read the disk format.Remaining
SpeakerRegistry+ the cold-start re-label via the seg-id upsert (directly addresses the "doesn't build a running profile" complaint; largely unit-testable).https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ