Skip to content

fix(audio): reliable Whisper transcription + long-form + model-agnostic transcribe CLI (#119)#123

Merged
Pushkinist merged 2 commits into
mainfrom
fix/119-whisper-transcription
Jun 17, 2026
Merged

fix(audio): reliable Whisper transcription + long-form + model-agnostic transcribe CLI (#119)#123
Pushkinist merged 2 commits into
mainfrom
fix/119-whisper-transcription

Conversation

@Pushkinist

Copy link
Copy Markdown
Owner

Closes #119.

Problem

POST /v1/audio/transcriptions was unreliable on real speech — mostly empty
transcripts, sometimes wrong-script garbage, only occasionally correct. Model
load + mel front-end were fine; the defect was in greedy_decode's logit
handling (no in-loop suppression, timestamp tokens treated as a hard stop, raw
logits to the sampler). There was also no long-form support (audio >30s
silently truncated), no transcribe CLI, and the referenced smoke test did not
exist.

Root cause (deeper than the issue stated)

whisper-large-v3 has 100 language slots (<|en|>=50259 … <|yue|>=50358),
shifting every special token +1 vs the v1/v2 layout the constants assumed. So
TOK_TRANSCRIBE actually pointed at <|translate|> and the >= TOK_TIMESTAMP_BEGIN
halt fired on <|notimestamps|> → immediate stop (empty/garbage). Fixed the
special-token constants to the large-v3 layout and added faithful in-loop logit
filters.

Changes

  • P0 decode correctnessDecodeFilters (SuppressBlank + SuppressTokens
    • faithful ApplyTimestampRules) applied every step on a host f32 logit
      vector; suppress set derived generally from the tokenizer (not a hardcoded id
      list); large-v3 special-token constants corrected; timestamp tokens no longer
      an unconditional halt.
  • P1 long-form — new rmlx_audio::transcribe::Transcriber: 30s sliding
    windows, timestamp-driven seek, <|startofprev|> previous-text conditioning,
    real cumulative per-segment timestamps, multi-segment vtt/srt/json/txt,
    trailing-silence hallucination drop. Decode bounds derived from n_text_ctx
    at runtime (prompt cap n/2-1, per-window budget min(n/2, n-prefix), plus a
    decode-loop offset ceiling) so the positional table can never overflow.
  • P2 rmlx transcribe CLIrmlx transcribe <audio> --model <snapshot> [--tokenizer --format vtt|srt|json|txt --language --translate --output],
    arch-dispatched on config.json model_type (clean seam for future ASR;
    Whisper today, not hardcoded). Decodes arbitrary containers to 16kHz mono
    internally (stereo downmix + linear resampler; enabled existing symphonia
    isomp4+aac features for m4a/MP4-AAC). HTTP endpoint routed through the same
    long-form engine — one transcription core, not two.
  • P3 polish — npz alignment_heads dtype parsing; verbose_json now reports
    real duration + segments.
  • Tests / env cleanup — removed the single-purpose RMLX_TEST_MODEL_WHISPER
    env var in favor of RMLX_O_MODELS_ROOT auto-discovery (skips gracefully when
    absent). Added a portable say-generated determinism smoke + a local
    real-recording long-form WER regression that scans a gitignored fixtures dir
    (any user drops their own audio + .transcript.vtt). Model-free regression
    tests prove the positional-table bound across n_text_ctx sweeps.

Proof on real audio

Full 48-minute meeting recording, whisper-large-v3-mlx, temp=0:

  • Normalized WER = 0.0792 (684 word-edits / 8636 ref words) — gate ≤ 0.30.
  • Full coverage (last cue 48:20.858), run completes cleanly, no positional error.
  • Deterministic: two runs byte-identical (diff -q clean).
  • Decode ≈ 236–265s for 2900s audio → RTF ≈ 0.08–0.09 (~11× realtime).
  • say-clip smoke: WER 0.000, deterministic.

make lint (-D warnings) and make test green. Reviewed by the rust-reviewer
agent; the one HIGH (positional overflow) and nits are fixed in the second commit.

🤖 Generated with Claude Code

…nscribe CLI (#119)

P0 — decode correctness. The greedy decode loop applied no in-loop logit
suppression, treated any timestamp token as a hard stop, and used v1/v2 special
token ids. large-v3 has 100 language slots (<|en|>=50259 … <|yue|>=50358), which
shifts every special up by one — so TOK_TRANSCRIBE pointed at <|translate|> and
the timestamp-begin check fired on <|notimestamps|>, yielding empty/garbage
transcripts. Fixes:
- Correct the special-token constants to the large-v3 layout (translate=50359,
  transcribe=50360, startofprev=50362, nospeech=50363, notimestamps=50364,
  timestamp_begin=50365); fix the language-detect range to 100 langs.
- Add DecodeFilters: SuppressBlank (first step), SuppressTokens (non-speech set
  derived generally from the tokenizer, not a hardcoded id list), and a faithful
  port of openai-whisper ApplyTimestampRules (pairing, monotonic, BOS, tie-break)
  applied at every decode step on a host f32 logit vector.

P1 — long-form. New rmlx_audio::transcribe engine: sliding 30 s windows in
timestamp mode with timestamp-driven seek, previous-text conditioning
(<|startofprev|>), real per-segment times, and multi-segment vtt/srt/json/txt.
Drops filler hallucinated in the zero-pad tail. The HTTP endpoint and the CLI
share this one engine. verbose_json now reports real duration + segments.

P2 — rmlx transcribe CLI. Arch-dispatched on config.json (whisper today, clean
seam for future ASR). Decodes any container to 16 kHz mono internally (enabled
symphonia isomp4+aac features so .m4a works; downmix + linear resample).

P3 — npz alignment_heads dtype parsing (b1/u1/i1→U8, u4/i4→U32/I32);
verbose_json duration.

Tests: replace the RMLX_TEST_MODEL_WHISPER env knob with RMLX_O_MODELS_ROOT
auto-discovery; add DecodeFilters + npz-dtype unit tests and a real-model
integration suite (say-clip determinism + full-file long-form WER) that scans a
gitignored fixtures dir. Full 48-min real meeting recording: normalized
WER ≈ 0.079, 473 segments, RTF ≈ 0.09, deterministic at temp=0.

Docs: SERVER.md (long-form endpoint), CLI.md (rmlx transcribe), MODELS.md
(special-token layout + RMLX_O_MODELS_ROOT auto-discovery), TESTING.md.
…ixes)

Derive the previous-text prompt cap and per-window generation budget from
n_text_ctx at runtime instead of fixed literals, so the decoder position can
never reach n_text_ctx and overrun the positional-embedding table. Previously a
full previous-text prompt (capped at n_text_ctx/2) plus a fixed 224-token budget
could push the decoder offset to 452 > 448, making the positional slice error and
abort the whole transcription.

- previous_text_cap = n_text_ctx/2 - 1 (matches openai-whisper).
- window_token_budget = min(n_text_ctx/2, n_text_ctx - prefix_len) so
  prefix_len + generated <= n_text_ctx always holds.
- greedy_decode gains a belt-and-suspenders guard: never request a positional
  row >= n_text_ctx.
- Model-free regression tests assert the worst-case (max prompt + max budget)
  decoder position stays in bounds across realistic n_text_ctx values.

Also: detect_language returns crate::tokenizer::TOK_EN instead of the bare 50259
literal (doc prose de-hardcoded); Transcriber gets a minimal Debug impl (config
dims + suppress-set sizes) replacing the reason-less allow; drop a gratuitous
per-window resolved_lang clone.
@Pushkinist Pushkinist merged commit 004df6d into main Jun 17, 2026
2 checks passed
@Pushkinist Pushkinist deleted the fix/119-whisper-transcription branch June 17, 2026 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Whisper transcription unreliable: greedy_decode lacks in-loop logit suppression; long-form + transcribe CLI missing

1 participant