fix(gemma4): wire native audio input through serve/engine path (#122) by Pushkinist · Pull Request #125 · Pushkinist/rMLX

Pushkinist · 2026-06-17T12:02:20Z

Closes #122.

Problem

Gemma4 native audio input via /v1/chat/completions input_audio parts was
silently dropped. The endpoint parsed + bounds-checked input_audio, but the
audio was never encoded or injected — the model answered "no audio was provided".
The Conformer audio_tower / AudioEncoder (crates/rmlx-models/src/gemma4/audio/mod.rs)
were fully implemented in rmlx-models but had no caller in rmlx-server —
only the vision tower was wired. e4b/26b mxfp8 ship 751 audio_tower.* weights.

Fix (general — mirrors the vision-tower flow)

Load: load_gemma4_audio_bundle loads the Conformer audio_tower +
embed_audio projector + USM feature extractor at startup when audio_tower.*
weights are present, alongside the vision tower. Extracted a shared
load_multimodal_embedder (used by both embed_vision and embed_audio) —
the vision loader now calls it, behavior-identical.
Inject: build_audio_prompt decodes input_audio (base64 → 16kHz mono via
the existing rmlx-audio symphonia/resampler path → log-mel), splices
<|audio|> placeholders, runs AudioEncoder, and build_audio_inputs_embeds
scatters the audio soft tokens at the audio-token positions via slice_update.
The placeholder count T_sub = num_output_frames(mel_frames) (two stride-2
SSCP convs) must equal the encoder output frames — enforced by an Err (not a
panic) on mismatch. audio_token_id comes from config.json (not hardcoded).
Clear errors, no silent drops: a model with no audio tower returns a clear
503 (mirroring vision's "no vision tower"); a request combining an image AND
audio in one turn is rejected with a clear error rather than dropping the
audio (reject_combined_image_audio + tracing::warn); >1 audio clip rejected.

Proof (real model, e4b mxfp8, single-MLX)

input	before	after
`say` "The launch is scheduled for Tuesday at noon."	"no audio was provided"	HTTP 200 → "The launch is scheduled for Tuesday at noon." (audio tower load line present, `prompt_len` 18→75, `audio_soft_tokens=55`)
`say` "The quick brown fox…" (re-confirm)	—	HTTP 200 → verbatim transcription
image-only (red PNG, "what color?")	Red	Red (vision path intact)
text-only ("2+2?")	4	4
audio on a text-only model (Bonsai)	silent "no audio"	HTTP 503 "no audio tower"
image + audio in one request	audio silently dropped	HTTP 503 "combined image + audio … send them in separate turns"

make lint (-D warnings) + make test green (1098 passed); model-free unit
tests cover the soft-token frame-count invariant and the combined-input guard;
model-gated integration tests (e4b) cover the real audio path. Reviewed by the
rust-reviewer agent; the HIGH (combined-input silent drop) + doc nits fixed in
the second commit. No new deps, no new env vars.

🤖 Generated with Claude Code

The Gemma4 Conformer audio tower (load_audio_tower / AudioEncoder) was fully implemented in rmlx-models but had no caller in rmlx-server: the OpenAI endpoint parsed + bounds-checked input_audio parts, then dropped them — the model hallucinated "no audio was provided". Load the audio tower + embed_audio projector + USM feature extractor alongside the vision tower at startup (when audio_config + audio_tower.* weights are present), and route input_audio through: decode (rmlx-audio symphonia) -> 16 kHz mono -> log-mel front-end -> Conformer AudioEncoder -> embed_audio -> scatter at <|audio|> positions. Mirrors the vision-tower flow (build_inputs_embeds + generate_image fused embeds). The prompt is spliced with <|audio> + T_sub x <|audio|> + <audio|>; T_sub comes from AudioEncoder::num_output_frames so the scatter aligns by construction. Submitting input_audio to a model without an audio tower now returns a clear 503 (no silent drop), mirroring vision. - rmlx-models: build_audio_inputs_embeds + AudioEncoder::num_output_frames; extract load_multimodal_embedder (shared by embed_vision / embed_audio). - rmlx-server: engine/audio.rs (AudioBundle, load + build_audio_prompt); ArchGenerator loads the audio bundle and routes the audio prompt. - Tests: num_output_frames vs forward parity (model-free); model-gated end-to-end audio-prompt build + multi-clip rejection. - Docs: SERVER.md multimodal content parts; MODELS.md Gemma4 audio input. Fixes #122.

…audio The audio branch was gated by `&& image_inputs.is_none()` and fell through to `None`, so a request carrying BOTH an image and an audio clip kept only the image — the audio was silently dropped with no error and no tracing, reintroducing the exact silent-drop class the audio wiring exists to eliminate. Reject the unsupported combination with a CLEAR request-level error (`Error::Other`, surfaced as a proper HTTP error through the same channel as the other request rejections) plus a `tracing::warn!`. The decision is extracted into `reject_combined_image_audio` so the routing guard is covered by model-free unit tests (combined -> rejected; audio-only / image-only / text-only -> unchanged). Also complete three truncated rustdoc phrases in the gemma4 audio/vision modules.

Pushkinist added 2 commits June 17, 2026 18:35

Pushkinist merged commit f2b970b into main Jun 17, 2026
2 checks passed

Pushkinist deleted the fix/122-gemma4-audio-input branch June 17, 2026 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gemma4): wire native audio input through serve/engine path (#122)#125

fix(gemma4): wire native audio input through serve/engine path (#122)#125
Pushkinist merged 2 commits into
mainfrom
fix/122-gemma4-audio-input

Pushkinist commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Pushkinist commented Jun 17, 2026

Problem

Fix (general — mirrors the vision-tower flow)

Proof (real model, e4b mxfp8, single-MLX)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant