Skip to content

fix(gemma4): wire native audio input through serve/engine path (#122)#125

Merged
Pushkinist merged 2 commits into
mainfrom
fix/122-gemma4-audio-input
Jun 17, 2026
Merged

fix(gemma4): wire native audio input through serve/engine path (#122)#125
Pushkinist merged 2 commits into
mainfrom
fix/122-gemma4-audio-input

Conversation

@Pushkinist

Copy link
Copy Markdown
Owner

Closes #122.

Problem

Gemma4 native audio input via /v1/chat/completions input_audio parts was
silently dropped. The endpoint parsed + bounds-checked input_audio, but the
audio was never encoded or injected — the model answered "no audio was provided".
The Conformer audio_tower / AudioEncoder (crates/rmlx-models/src/gemma4/audio/mod.rs)
were fully implemented in rmlx-models but had no caller in rmlx-server
only the vision tower was wired. e4b/26b mxfp8 ship 751 audio_tower.* weights.

Fix (general — mirrors the vision-tower flow)

  • Load: load_gemma4_audio_bundle loads the Conformer audio_tower +
    embed_audio projector + USM feature extractor at startup when audio_tower.*
    weights are present, alongside the vision tower. Extracted a shared
    load_multimodal_embedder (used by both embed_vision and embed_audio) —
    the vision loader now calls it, behavior-identical.
  • Inject: build_audio_prompt decodes input_audio (base64 → 16kHz mono via
    the existing rmlx-audio symphonia/resampler path → log-mel), splices
    <|audio|> placeholders, runs AudioEncoder, and build_audio_inputs_embeds
    scatters the audio soft tokens at the audio-token positions via slice_update.
    The placeholder count T_sub = num_output_frames(mel_frames) (two stride-2
    SSCP convs) must equal the encoder output frames — enforced by an Err (not a
    panic) on mismatch. audio_token_id comes from config.json (not hardcoded).
  • Clear errors, no silent drops: a model with no audio tower returns a clear
    503 (mirroring vision's "no vision tower"); a request combining an image AND
    audio in one turn is rejected with a clear error rather than dropping the
    audio (reject_combined_image_audio + tracing::warn); >1 audio clip rejected.

Proof (real model, e4b mxfp8, single-MLX)

input before after
say "The launch is scheduled for Tuesday at noon." "no audio was provided" HTTP 200 → "The launch is scheduled for Tuesday at noon." (audio tower load line present, prompt_len 18→75, audio_soft_tokens=55)
say "The quick brown fox…" (re-confirm) HTTP 200 → verbatim transcription
image-only (red PNG, "what color?") Red Red (vision path intact)
text-only ("2+2?") 4 4
audio on a text-only model (Bonsai) silent "no audio" HTTP 503 "no audio tower"
image + audio in one request audio silently dropped HTTP 503 "combined image + audio … send them in separate turns"

make lint (-D warnings) + make test green (1098 passed); model-free unit
tests cover the soft-token frame-count invariant and the combined-input guard;
model-gated integration tests (e4b) cover the real audio path. Reviewed by the
rust-reviewer agent; the HIGH (combined-input silent drop) + doc nits fixed in
the second commit. No new deps, no new env vars.

🤖 Generated with Claude Code

The Gemma4 Conformer audio tower (load_audio_tower / AudioEncoder) was
fully implemented in rmlx-models but had no caller in rmlx-server: the
OpenAI endpoint parsed + bounds-checked input_audio parts, then dropped
them — the model hallucinated "no audio was provided".

Load the audio tower + embed_audio projector + USM feature extractor
alongside the vision tower at startup (when audio_config + audio_tower.*
weights are present), and route input_audio through:
decode (rmlx-audio symphonia) -> 16 kHz mono -> log-mel front-end ->
Conformer AudioEncoder -> embed_audio -> scatter at <|audio|> positions.
Mirrors the vision-tower flow (build_inputs_embeds + generate_image fused
embeds). The prompt is spliced with <|audio> + T_sub x <|audio|> +
<audio|>; T_sub comes from AudioEncoder::num_output_frames so the scatter
aligns by construction. Submitting input_audio to a model without an audio
tower now returns a clear 503 (no silent drop), mirroring vision.

- rmlx-models: build_audio_inputs_embeds + AudioEncoder::num_output_frames;
  extract load_multimodal_embedder (shared by embed_vision / embed_audio).
- rmlx-server: engine/audio.rs (AudioBundle, load + build_audio_prompt);
  ArchGenerator loads the audio bundle and routes the audio prompt.
- Tests: num_output_frames vs forward parity (model-free); model-gated
  end-to-end audio-prompt build + multi-clip rejection.
- Docs: SERVER.md multimodal content parts; MODELS.md Gemma4 audio input.

Fixes #122.
…audio

The audio branch was gated by `&& image_inputs.is_none()` and fell through to
`None`, so a request carrying BOTH an image and an audio clip kept only the
image — the audio was silently dropped with no error and no tracing,
reintroducing the exact silent-drop class the audio wiring exists to eliminate.

Reject the unsupported combination with a CLEAR request-level error
(`Error::Other`, surfaced as a proper HTTP error through the same channel as
the other request rejections) plus a `tracing::warn!`. The decision is
extracted into `reject_combined_image_audio` so the routing guard is covered by
model-free unit tests (combined -> rejected; audio-only / image-only /
text-only -> unchanged). Also complete three truncated rustdoc phrases in the
gemma4 audio/vision modules.
@Pushkinist Pushkinist merged commit f2b970b into main Jun 17, 2026
2 checks passed
@Pushkinist Pushkinist deleted the fix/122-gemma4-audio-input branch June 17, 2026 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gemma4 audio input silently dropped: audio_tower never loaded in serve, input_audio is a no-op

1 participant