Skip to content

Real-time transcription + diarization overhaul (Phase 1: streaming captions)#212

Open
mrjeeves wants to merge 6 commits into
mainfrom
claude/dreamy-cray-UMaAC
Open

Real-time transcription + diarization overhaul (Phase 1: streaming captions)#212
mrjeeves wants to merge 6 commits into
mainfrom
claude/dreamy-cray-UMaAC

Conversation

@mrjeeves
Copy link
Copy Markdown
Owner

@mrjeeves mrjeeves commented May 29, 2026

Why

The live transcription path felt ~8 s slow because it was an offline design in disguise: fixed 8 s, non-overlapping chunks decoded statelessly off disk, with nothing shown until a whole window finished. This PR rebuilds it around true streaming — interim hypotheses that refine within a few hundred ms and finalize on a pause (iPhone-style) — plus speaker attribution on the live path.

Phase 1 — landed (streaming live transcription)

Area What
Streaming core LocalAgreement-2, StreamWindow, SilenceEndpointer — pure logic, 14 unit tests
Tokens / caps shared AsrToken + AsrSegment.tokens; per-tier window/hop caps; Moonshine + Parakeet emit tokens
Streaming loop run_streaming_loop: rolling window → per-hop decode → LocalAgreement → interim→final captions + diarized speaker at each endpoint; EmittedSegment.seg_id/partial (2 loop tests via fake backend + CaptureSink)
Live paths flipped both run_session (mic) and run_remote_session (mesh host) now stream off the channel — no disk shards
UI interim caption rendered tentatively (dimmed/italic) then finalized in place; mesh forwards confirmed-only

Verification (headless): cargo fmt --check, cargo clippy --all-targets (no new warnings), cargo test 154 passed, pnpm check 0 errors. The disk-ingest writer is now unused (both live paths are in-memory) and marked allow(dead_code); run_upload/run_drain still read the disk format.

⚠️ Needs real-mic validation — the audio/GUI/ONNX path can't run in this headless environment. Everything is compile/lint/unit-verified, but the live behavior (latency feel, caption refinement, speaker labels) needs a human at a mic. Build + record to confirm.

Remaining

  • Silero VAD endpointing — upgrade over the current RMS-based endpointer. Needs a new model artifact + ONNX wrapper; best done where it can be run/tuned. (The RMS endpointer is a working stand-in until then.)
  • Phase 2 — accuracy: window shrink, beam-on-final for Moonshine, VAD-trimmed spans, optional cache-aware Parakeet export.
  • Phase 3 — running speaker profile: EMA centroids + persistent cross-session SpeakerRegistry + the cold-start re-label via the seg-id upsert (directly addresses the "doesn't build a running profile" complaint; largely unit-testable).

https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ

claude added 3 commits May 29, 2026 15:45
Phase 1 foundation for the low-latency live-transcription overhaul.
The live mic path will decode overlapping windows of audio; this turns
that stream of whole-window hypotheses into a stable confirmed prefix
(final captions) plus a refining interim tail (live captions).

New pure-logic module `asr::streaming` (no cpal/ORT/clock deps, so it
unit-tests like `diarize::cluster`):

  - LocalAgreement: confirms a token once two consecutive hypotheses
    agree on it (LocalAgreement-2); exposes the interim tail and a
    finalize-on-endpoint that resets for the next utterance.
  - StreamWindow: rolling 16 kHz context buffer with a context cap and
    confirmed-boundary trimming that keeps a short lookback.
  - SilenceEndpointer: trailing-silence endpointer reusing the RMS
    notion of the existing gate, so short utterances finalize on a
    pause without yet pulling in a neural VAD.

14 unit tests; fmt + clippy clean. The module is consumed by the worker
in a follow-up that switches run_session off the 8 s disk-shard loop;
until then it carries a crate-style allow(dead_code), matching the
existing not-yet-wired AsrCaps streaming fields.

https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ
Phase 1 scaffolding the rolling-window loop will consume next.

- Promote streaming.rs's StreamToken to a shared `AsrToken` on the
  trait (text + chunk-relative end time) and add `AsrSegment.tokens`
  (serde-default empty, so the disk-shard path and existing JSON
  consumers are unaffected). The streaming module now feeds
  `AsrSegment.tokens` straight into LocalAgreement with no conversion.
- Add live-path geometry to AsrCaps: window_seconds / hop_seconds /
  max_context_seconds. Moonshine 4.0/0.5/8.0 (overlap + LocalAgreement
  lets the old 8 s disk window shrink); Parakeet 2.0/0.3/6.0.
- Populate tokens: Moonshine via AsrToken::words_uniform (no per-token
  timing, so end times are distributed); Parakeet word-level with real
  frame times via pieces_to_tokens, with the no-timestamps export
  falling back to words_uniform.

Additive only — the live loop that reads these lands next. fmt +
clippy clean; full test suite green (152 passed).

https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ
Phase 1 — the rolling-window loop the streaming core feeds.

`run_streaming_loop` is the live-path counterpart to ingest_loop + the
disk-shard poll loop in run_session: it keeps a rolling StreamWindow in
RAM, re-decodes the current utterance every caps.hop_seconds, runs
LocalAgreement-2 over the hypotheses, and emits an interim caption
(partial=true) that refines hop-to-hop, finalizing (partial=false) on a
speech pause detected by SilenceEndpointer. One seg_id per utterance lets
the UI replace the live line in place; a forced cut bounds a never-
pausing monologue at the context cap, and a drain-time finalize avoids
losing in-flight text on Stop.

Protocol: EmittedSegment gains seg_id + partial (serde-default/skip, so
the disk-shard path and existing JSON consumers are unaffected).

The loop is decoupled from cpal — it drains the sample channel and writes
through FrameSink — so it's unit-tested here with a scripted fake backend
+ CaptureSink: a refine-then-finalize run and an all-silence run. fmt +
clippy clean; suite green (154 passed).

Not yet wired into run_session (that behavioural switch needs real-mic
validation) and speaker attribution over this path is a follow-up — hence
allow(dead_code) for now.

https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ
@mrjeeves mrjeeves force-pushed the claude/dreamy-cray-UMaAC branch from 339128d to 7ed89e3 Compare May 29, 2026 15:46
claude added 3 commits May 29, 2026 15:53
Frontend half of the streaming live path.

EmittedSegment gains seg_id + partial (mirroring the Rust protocol). The
state store routes a partial segment to a single transient
`interimSegment` — rendered tentatively (dimmed/italic) beneath the
confirmed transcript and never persisted — while final segments append
to liveSegments exactly as the disk-shard path always has. When an
utterance finalizes (a non-partial segment with the same seg_id), the
interim line clears and the text lands as a normal turn. liveDelta (the
Talking Points consumer) now projects confirmed text only, so it doesn't
churn on interim refinements.

Mesh: forward confirmed segments only — interim captions stay a local
affordance so remote transcripts don't churn (interim-over-mesh later).

pnpm check clean (0 errors). The disk-shard path is unchanged (no
seg_id/partial → append + persist as before). Live rendering needs
real-app/mic validation once run_session is switched onto the loop.

https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ
Flip run_session off the 8 s disk-shard pipeline onto run_streaming_loop:
audio now streams straight off the cpal channel through the rolling
window + per-hop decode + LocalAgreement, emitting interim → final
captions instead of waiting ~8 s for a whole chunk to land. No disk
shards on the live path (a crashed live session is no longer drain-
recoverable — fine for a live feature); a late cpal capture error is
surfaced at teardown.

run_streaming_loop now takes the diarizer: at each endpoint it runs the
finalized utterance audio through diarize and attaches the dominant
speaker (max-overlap, mirroring join_segments) to the final segment.
Interim captions carry no speaker — it settles when the line finalizes.

Compiles clean; clippy quiet; suite green (154). run_upload / run_drain
stay on the disk path. Needs real-mic validation (can't run audio
headless). The streaming endpointer is RMS-based for now; Silero VAD and
the run_remote_session flip are still to come.

https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ
Flip run_remote_session onto run_streaming_loop, same as the local mic
path — the only difference is the audio source (the peer's PCM, fed via
feed_remote_audio into rx). The loop now ends on rx disconnect when
end_remote_audio drops the inbox tx, matching the old inbox-closed exit.
Mesh peers already receive confirmed-only segments (interim filtered in
mesh-transcribe.ts), so remote transcripts don't churn.

With both live paths off disk, the disk-ingest writer (ingest_loop,
write_f32_chunk, MAX_BACKLOG_SECONDS) is unused — marked allow(dead_code)
and kept as the reference shard format (run_drain still reads it for
crashed pre-flip sessions).

Compiles clean; clippy quiet; suite green (154).

https://claude.ai/code/session_01UTJ3ZYSrbrunWyZdqzDZfQ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants