fix(gemma4): wire native audio input through serve/engine path (#122)#125
Merged
Conversation
The Gemma4 Conformer audio tower (load_audio_tower / AudioEncoder) was fully implemented in rmlx-models but had no caller in rmlx-server: the OpenAI endpoint parsed + bounds-checked input_audio parts, then dropped them — the model hallucinated "no audio was provided". Load the audio tower + embed_audio projector + USM feature extractor alongside the vision tower at startup (when audio_config + audio_tower.* weights are present), and route input_audio through: decode (rmlx-audio symphonia) -> 16 kHz mono -> log-mel front-end -> Conformer AudioEncoder -> embed_audio -> scatter at <|audio|> positions. Mirrors the vision-tower flow (build_inputs_embeds + generate_image fused embeds). The prompt is spliced with <|audio> + T_sub x <|audio|> + <audio|>; T_sub comes from AudioEncoder::num_output_frames so the scatter aligns by construction. Submitting input_audio to a model without an audio tower now returns a clear 503 (no silent drop), mirroring vision. - rmlx-models: build_audio_inputs_embeds + AudioEncoder::num_output_frames; extract load_multimodal_embedder (shared by embed_vision / embed_audio). - rmlx-server: engine/audio.rs (AudioBundle, load + build_audio_prompt); ArchGenerator loads the audio bundle and routes the audio prompt. - Tests: num_output_frames vs forward parity (model-free); model-gated end-to-end audio-prompt build + multi-clip rejection. - Docs: SERVER.md multimodal content parts; MODELS.md Gemma4 audio input. Fixes #122.
…audio The audio branch was gated by `&& image_inputs.is_none()` and fell through to `None`, so a request carrying BOTH an image and an audio clip kept only the image — the audio was silently dropped with no error and no tracing, reintroducing the exact silent-drop class the audio wiring exists to eliminate. Reject the unsupported combination with a CLEAR request-level error (`Error::Other`, surfaced as a proper HTTP error through the same channel as the other request rejections) plus a `tracing::warn!`. The decision is extracted into `reject_combined_image_audio` so the routing guard is covered by model-free unit tests (combined -> rejected; audio-only / image-only / text-only -> unchanged). Also complete three truncated rustdoc phrases in the gemma4 audio/vision modules.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #122.
Problem
Gemma4 native audio input via
/v1/chat/completionsinput_audioparts wassilently dropped. The endpoint parsed + bounds-checked
input_audio, but theaudio was never encoded or injected — the model answered "no audio was provided".
The Conformer
audio_tower/AudioEncoder(crates/rmlx-models/src/gemma4/audio/mod.rs)were fully implemented in
rmlx-modelsbut had no caller inrmlx-server—only the vision tower was wired. e4b/26b mxfp8 ship 751
audio_tower.*weights.Fix (general — mirrors the vision-tower flow)
load_gemma4_audio_bundleloads the Conformeraudio_tower+embed_audioprojector + USM feature extractor at startup whenaudio_tower.*weights are present, alongside the vision tower. Extracted a shared
load_multimodal_embedder(used by bothembed_visionandembed_audio) —the vision loader now calls it, behavior-identical.
build_audio_promptdecodesinput_audio(base64 → 16kHz mono viathe existing
rmlx-audiosymphonia/resampler path → log-mel), splices<|audio|>placeholders, runsAudioEncoder, andbuild_audio_inputs_embedsscatters the audio soft tokens at the audio-token positions via
slice_update.The placeholder count
T_sub = num_output_frames(mel_frames)(two stride-2SSCP convs) must equal the encoder output frames — enforced by an
Err(not apanic) on mismatch.
audio_token_idcomes fromconfig.json(not hardcoded).503 (mirroring vision's "no vision tower"); a request combining an image AND
audio in one turn is rejected with a clear error rather than dropping the
audio (
reject_combined_image_audio+tracing::warn); >1 audio clip rejected.Proof (real model, e4b mxfp8, single-MLX)
say"The launch is scheduled for Tuesday at noon."prompt_len18→75,audio_soft_tokens=55)say"The quick brown fox…" (re-confirm)make lint(-D warnings) +make testgreen (1098 passed); model-free unittests cover the soft-token frame-count invariant and the combined-input guard;
model-gated integration tests (e4b) cover the real audio path. Reviewed by the
rust-reviewer agent; the HIGH (combined-input silent drop) + doc nits fixed in
the second commit. No new deps, no new env vars.
🤖 Generated with Claude Code