fix(audio): reliable Whisper transcription + long-form + model-agnostic transcribe CLI (#119) by Pushkinist · Pull Request #123 · Pushkinist/rMLX

Pushkinist · 2026-06-17T10:00:24Z

Closes #119.

Problem

POST /v1/audio/transcriptions was unreliable on real speech — mostly empty
transcripts, sometimes wrong-script garbage, only occasionally correct. Model
load + mel front-end were fine; the defect was in greedy_decode's logit
handling (no in-loop suppression, timestamp tokens treated as a hard stop, raw
logits to the sampler). There was also no long-form support (audio >30s
silently truncated), no transcribe CLI, and the referenced smoke test did not
exist.

Root cause (deeper than the issue stated)

whisper-large-v3 has 100 language slots (<|en|>=50259 … <|yue|>=50358),
shifting every special token +1 vs the v1/v2 layout the constants assumed. So
TOK_TRANSCRIBE actually pointed at <|translate|> and the >= TOK_TIMESTAMP_BEGIN
halt fired on <|notimestamps|> → immediate stop (empty/garbage). Fixed the
special-token constants to the large-v3 layout and added faithful in-loop logit
filters.

Changes

P0 decode correctness — DecodeFilters (SuppressBlank + SuppressTokens
- faithful ApplyTimestampRules) applied every step on a host f32 logit
  vector; suppress set derived generally from the tokenizer (not a hardcoded id
  list); large-v3 special-token constants corrected; timestamp tokens no longer
  an unconditional halt.
P1 long-form — new rmlx_audio::transcribe::Transcriber: 30s sliding
windows, timestamp-driven seek, <|startofprev|> previous-text conditioning,
real cumulative per-segment timestamps, multi-segment vtt/srt/json/txt,
trailing-silence hallucination drop. Decode bounds derived from n_text_ctx
at runtime (prompt cap n/2-1, per-window budget min(n/2, n-prefix), plus a
decode-loop offset ceiling) so the positional table can never overflow.
P2 rmlx transcribe CLI — rmlx transcribe <audio> --model <snapshot> [--tokenizer --format vtt|srt|json|txt --language --translate --output],
arch-dispatched on config.json model_type (clean seam for future ASR;
Whisper today, not hardcoded). Decodes arbitrary containers to 16kHz mono
internally (stereo downmix + linear resampler; enabled existing symphonia
isomp4+aac features for m4a/MP4-AAC). HTTP endpoint routed through the same
long-form engine — one transcription core, not two.
P3 polish — npz alignment_heads dtype parsing; verbose_json now reports
real duration + segments.
Tests / env cleanup — removed the single-purpose RMLX_TEST_MODEL_WHISPER
env var in favor of RMLX_O_MODELS_ROOT auto-discovery (skips gracefully when
absent). Added a portable say-generated determinism smoke + a local
real-recording long-form WER regression that scans a gitignored fixtures dir
(any user drops their own audio + .transcript.vtt). Model-free regression
tests prove the positional-table bound across n_text_ctx sweeps.

Proof on real audio

Full 48-minute meeting recording, whisper-large-v3-mlx, temp=0:

Normalized WER = 0.0792 (684 word-edits / 8636 ref words) — gate ≤ 0.30.
Full coverage (last cue 48:20.858), run completes cleanly, no positional error.
Deterministic: two runs byte-identical (diff -q clean).
Decode ≈ 236–265s for 2900s audio → RTF ≈ 0.08–0.09 (~11× realtime).
say-clip smoke: WER 0.000, deterministic.

make lint (-D warnings) and make test green. Reviewed by the rust-reviewer
agent; the one HIGH (positional overflow) and nits are fixed in the second commit.

🤖 Generated with Claude Code

…nscribe CLI (#119) P0 — decode correctness. The greedy decode loop applied no in-loop logit suppression, treated any timestamp token as a hard stop, and used v1/v2 special token ids. large-v3 has 100 language slots (<|en|>=50259 … <|yue|>=50358), which shifts every special up by one — so TOK_TRANSCRIBE pointed at <|translate|> and the timestamp-begin check fired on <|notimestamps|>, yielding empty/garbage transcripts. Fixes: - Correct the special-token constants to the large-v3 layout (translate=50359, transcribe=50360, startofprev=50362, nospeech=50363, notimestamps=50364, timestamp_begin=50365); fix the language-detect range to 100 langs. - Add DecodeFilters: SuppressBlank (first step), SuppressTokens (non-speech set derived generally from the tokenizer, not a hardcoded id list), and a faithful port of openai-whisper ApplyTimestampRules (pairing, monotonic, BOS, tie-break) applied at every decode step on a host f32 logit vector. P1 — long-form. New rmlx_audio::transcribe engine: sliding 30 s windows in timestamp mode with timestamp-driven seek, previous-text conditioning (<|startofprev|>), real per-segment times, and multi-segment vtt/srt/json/txt. Drops filler hallucinated in the zero-pad tail. The HTTP endpoint and the CLI share this one engine. verbose_json now reports real duration + segments. P2 — rmlx transcribe CLI. Arch-dispatched on config.json (whisper today, clean seam for future ASR). Decodes any container to 16 kHz mono internally (enabled symphonia isomp4+aac features so .m4a works; downmix + linear resample). P3 — npz alignment_heads dtype parsing (b1/u1/i1→U8, u4/i4→U32/I32); verbose_json duration. Tests: replace the RMLX_TEST_MODEL_WHISPER env knob with RMLX_O_MODELS_ROOT auto-discovery; add DecodeFilters + npz-dtype unit tests and a real-model integration suite (say-clip determinism + full-file long-form WER) that scans a gitignored fixtures dir. Full 48-min real meeting recording: normalized WER ≈ 0.079, 473 segments, RTF ≈ 0.09, deterministic at temp=0. Docs: SERVER.md (long-form endpoint), CLI.md (rmlx transcribe), MODELS.md (special-token layout + RMLX_O_MODELS_ROOT auto-discovery), TESTING.md.

…ixes) Derive the previous-text prompt cap and per-window generation budget from n_text_ctx at runtime instead of fixed literals, so the decoder position can never reach n_text_ctx and overrun the positional-embedding table. Previously a full previous-text prompt (capped at n_text_ctx/2) plus a fixed 224-token budget could push the decoder offset to 452 > 448, making the positional slice error and abort the whole transcription. - previous_text_cap = n_text_ctx/2 - 1 (matches openai-whisper). - window_token_budget = min(n_text_ctx/2, n_text_ctx - prefix_len) so prefix_len + generated <= n_text_ctx always holds. - greedy_decode gains a belt-and-suspenders guard: never request a positional row >= n_text_ctx. - Model-free regression tests assert the worst-case (max prompt + max budget) decoder position stays in bounds across realistic n_text_ctx values. Also: detect_language returns crate::tokenizer::TOK_EN instead of the bare 50259 literal (doc prose de-hardcoded); Transcriber gets a minimal Debug impl (config dims + suppress-set sizes) replacing the reason-less allow; drop a gratuitous per-window resolved_lang clone.

Pushkinist added 2 commits June 17, 2026 16:11

Pushkinist merged commit 004df6d into main Jun 17, 2026
2 checks passed

Pushkinist deleted the fix/119-whisper-transcription branch June 17, 2026 10:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(audio): reliable Whisper transcription + long-form + model-agnostic transcribe CLI (#119)#123

fix(audio): reliable Whisper transcription + long-form + model-agnostic transcribe CLI (#119)#123
Pushkinist merged 2 commits into
mainfrom
fix/119-whisper-transcription

Pushkinist commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Pushkinist commented Jun 17, 2026

Problem

Root cause (deeper than the issue stated)

Changes

Proof on real audio

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant