fix(audio): reliable Whisper transcription + long-form + model-agnostic transcribe CLI (#119)#123
Merged
Merged
Conversation
…nscribe CLI (#119) P0 — decode correctness. The greedy decode loop applied no in-loop logit suppression, treated any timestamp token as a hard stop, and used v1/v2 special token ids. large-v3 has 100 language slots (<|en|>=50259 … <|yue|>=50358), which shifts every special up by one — so TOK_TRANSCRIBE pointed at <|translate|> and the timestamp-begin check fired on <|notimestamps|>, yielding empty/garbage transcripts. Fixes: - Correct the special-token constants to the large-v3 layout (translate=50359, transcribe=50360, startofprev=50362, nospeech=50363, notimestamps=50364, timestamp_begin=50365); fix the language-detect range to 100 langs. - Add DecodeFilters: SuppressBlank (first step), SuppressTokens (non-speech set derived generally from the tokenizer, not a hardcoded id list), and a faithful port of openai-whisper ApplyTimestampRules (pairing, monotonic, BOS, tie-break) applied at every decode step on a host f32 logit vector. P1 — long-form. New rmlx_audio::transcribe engine: sliding 30 s windows in timestamp mode with timestamp-driven seek, previous-text conditioning (<|startofprev|>), real per-segment times, and multi-segment vtt/srt/json/txt. Drops filler hallucinated in the zero-pad tail. The HTTP endpoint and the CLI share this one engine. verbose_json now reports real duration + segments. P2 — rmlx transcribe CLI. Arch-dispatched on config.json (whisper today, clean seam for future ASR). Decodes any container to 16 kHz mono internally (enabled symphonia isomp4+aac features so .m4a works; downmix + linear resample). P3 — npz alignment_heads dtype parsing (b1/u1/i1→U8, u4/i4→U32/I32); verbose_json duration. Tests: replace the RMLX_TEST_MODEL_WHISPER env knob with RMLX_O_MODELS_ROOT auto-discovery; add DecodeFilters + npz-dtype unit tests and a real-model integration suite (say-clip determinism + full-file long-form WER) that scans a gitignored fixtures dir. Full 48-min real meeting recording: normalized WER ≈ 0.079, 473 segments, RTF ≈ 0.09, deterministic at temp=0. Docs: SERVER.md (long-form endpoint), CLI.md (rmlx transcribe), MODELS.md (special-token layout + RMLX_O_MODELS_ROOT auto-discovery), TESTING.md.
…ixes) Derive the previous-text prompt cap and per-window generation budget from n_text_ctx at runtime instead of fixed literals, so the decoder position can never reach n_text_ctx and overrun the positional-embedding table. Previously a full previous-text prompt (capped at n_text_ctx/2) plus a fixed 224-token budget could push the decoder offset to 452 > 448, making the positional slice error and abort the whole transcription. - previous_text_cap = n_text_ctx/2 - 1 (matches openai-whisper). - window_token_budget = min(n_text_ctx/2, n_text_ctx - prefix_len) so prefix_len + generated <= n_text_ctx always holds. - greedy_decode gains a belt-and-suspenders guard: never request a positional row >= n_text_ctx. - Model-free regression tests assert the worst-case (max prompt + max budget) decoder position stays in bounds across realistic n_text_ctx values. Also: detect_language returns crate::tokenizer::TOK_EN instead of the bare 50259 literal (doc prose de-hardcoded); Transcriber gets a minimal Debug impl (config dims + suppress-set sizes) replacing the reason-less allow; drop a gratuitous per-window resolved_lang clone.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #119.
Problem
POST /v1/audio/transcriptionswas unreliable on real speech — mostly emptytranscripts, sometimes wrong-script garbage, only occasionally correct. Model
load + mel front-end were fine; the defect was in
greedy_decode's logithandling (no in-loop suppression, timestamp tokens treated as a hard stop, raw
logits to the sampler). There was also no long-form support (audio >30s
silently truncated), no
transcribeCLI, and the referenced smoke test did notexist.
Root cause (deeper than the issue stated)
whisper-large-v3 has 100 language slots (
<|en|>=50259 …<|yue|>=50358),shifting every special token +1 vs the v1/v2 layout the constants assumed. So
TOK_TRANSCRIBEactually pointed at<|translate|>and the>= TOK_TIMESTAMP_BEGINhalt fired on
<|notimestamps|>→ immediate stop (empty/garbage). Fixed thespecial-token constants to the large-v3 layout and added faithful in-loop logit
filters.
Changes
DecodeFilters(SuppressBlank+SuppressTokensApplyTimestampRules) applied every step on a host f32 logitvector; suppress set derived generally from the tokenizer (not a hardcoded id
list); large-v3 special-token constants corrected; timestamp tokens no longer
an unconditional halt.
rmlx_audio::transcribe::Transcriber: 30s slidingwindows, timestamp-driven seek,
<|startofprev|>previous-text conditioning,real cumulative per-segment timestamps, multi-segment vtt/srt/json/txt,
trailing-silence hallucination drop. Decode bounds derived from
n_text_ctxat runtime (prompt cap
n/2-1, per-window budgetmin(n/2, n-prefix), plus adecode-loop offset ceiling) so the positional table can never overflow.
rmlx transcribeCLI —rmlx transcribe <audio> --model <snapshot> [--tokenizer --format vtt|srt|json|txt --language --translate --output],arch-dispatched on
config.jsonmodel_type(clean seam for future ASR;Whisper today, not hardcoded). Decodes arbitrary containers to 16kHz mono
internally (stereo downmix + linear resampler; enabled existing
symphoniaisomp4+aacfeatures for m4a/MP4-AAC). HTTP endpoint routed through the samelong-form engine — one transcription core, not two.
alignment_headsdtype parsing;verbose_jsonnow reportsreal
duration+segments.RMLX_TEST_MODEL_WHISPERenv var in favor of
RMLX_O_MODELS_ROOTauto-discovery (skips gracefully whenabsent). Added a portable
say-generated determinism smoke + a localreal-recording long-form WER regression that scans a gitignored fixtures dir
(any user drops their own
audio + .transcript.vtt). Model-free regressiontests prove the positional-table bound across
n_text_ctxsweeps.Proof on real audio
Full 48-minute meeting recording,
whisper-large-v3-mlx, temp=0:diff -qclean).say-clip smoke: WER 0.000, deterministic.make lint(-D warnings) andmake testgreen. Reviewed by the rust-revieweragent; the one HIGH (positional overflow) and nits are fixed in the second commit.
🤖 Generated with Claude Code