Skip to content

fix(model): detect audio format from magic bytes instead of hardcoding .mp3#348

Open
Kota-Maeda wants to merge 1 commit into
langgenius:mainfrom
Kota-Maeda:fix/speech2text-audio-format-detection
Open

fix(model): detect audio format from magic bytes instead of hardcoding .mp3#348
Kota-Maeda wants to merge 1 commit into
langgenius:mainfrom
Kota-Maeda:fix/speech2text-audio-format-detection

Conversation

@Kota-Maeda

Copy link
Copy Markdown

Closes #347

Summary

invoke_speech_to_text wrote the incoming audio to a
NamedTemporaryFile(suffix=".mp3") regardless of the real audio container.
Since OpenAI-compatible / Azure OpenAI Whisper endpoints infer the format from
the multipart filename extension, non-mp3 content (m4a/AAC, wav, ogg, flac,
webm) was mislabeled as .mp3 and rejected with Invalid file format, which
wrongly blames the user's input file.

The model-invoke payload carries only raw bytes (no filename), so this detects
the container from the leading magic bytes and picks the correct suffix
(wav/flac/ogg/m4a/webm), falling back to .mp3 for unknown/undetectable
content. This fixes transcription for every speech2text provider that goes
through this shared dispatch.

Changes

  • Add _detect_audio_suffix(header) that sniffs the container from magic bytes.
  • Use it in invoke_speech_to_text instead of the hardcoded .mp3 suffix
    (unhexlify the payload once and reuse it).
  • Add unit tests for the detection helper and an end-to-end test asserting the
    dispatch labels the temp file by format and writes the full payload.

Pull Request Checklist

Compatibility Check

  • I have checked whether this change affects the backward compatibility of the plugin declared in README.md
  • I have checked whether this change affects the forward compatibility of the plugin declared in README.md
  • If this change introduces a breaking change, I have discussed it with the project maintainer and specified the release version in the README.md — N/A, not a breaking change
  • I have described the compatibility impact and the corresponding version number in the PR description
  • I have checked whether the plugin version is updated in the README.md — N/A, this is an SDK fix with no plugin version in README.md

Compatibility impact: fully backward compatible. Unknown/undetectable audio
still uses .mp3, identical to the previous behavior; only previously-broken
non-mp3 inputs change. No SDK API, manifest, or schema change.

Available Checks

  • just build has passed
  • Relevant documentation has been updated (if necessary) — N/A, no docs/schema change

Note: As an external contributor I do not have permission to self-assign this PR or the linked issue (#347); please assign as appropriate.

…g .mp3

invoke_speech_to_text wrote incoming audio to a NamedTemporaryFile with a
hardcoded .mp3 suffix, so non-mp3 content (m4a/AAC, wav, ogg, flac, webm) was
labeled .mp3 and rejected by OpenAI/Azure Whisper with "Invalid file format".
The model-invoke payload carries only raw bytes (no filename), so detect the
container from the leading magic bytes and pick the matching suffix, falling
back to .mp3 for unknown content (backward compatible).

Adds unit tests for the detection helper and an end-to-end test asserting the
dispatch labels the temp file by format and writes the full payload.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces dynamic audio format detection for speech-to-text dispatching by sniffing the leading magic bytes of raw audio payloads (supporting WAV, FLAC, OGG, M4A, and WebM, with a fallback to MP3). It also adds comprehensive unit and end-to-end tests for this detection. Feedback was provided regarding a potential PermissionError on Windows when reopening a tempfile.NamedTemporaryFile while it is still open, suggesting a cross-platform workaround.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +567 to 571
audio_bytes = binascii.unhexlify(data.file)
suffix = _detect_audio_suffix(audio_bytes[:16])
with tempfile.NamedTemporaryFile(suffix=suffix, mode="wb", delete=True) as temp:
temp.write(audio_bytes)
temp.flush()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

On Windows platforms, attempting to open a file created by tempfile.NamedTemporaryFile a second time (via pathlib.Path(temp.name).open("rb")) while the temporary file object is still open will raise a PermissionError.

To ensure cross-platform compatibility (especially for Windows users/developers), we can use a custom context manager wrapper that leverages tempfile.TemporaryDirectory to manage the lifecycle of the temporary file, writing and closing the file immediately so it can be safely opened again.

        class WinSafeTempFile:
            def __init__(self, suffix: str) -> None:
                self._dir = tempfile.TemporaryDirectory()
                self.name = str(pathlib.Path(self._dir.name) / f"temp{suffix}")
            def __enter__(self) -> "WinSafeTempFile":
                return self
            def __exit__(self, exc_type: Any, exc_val: Any, exc_tb: Any) -> None:
                self._dir.cleanup()
            def write(self, b: bytes) -> None:
                pathlib.Path(self.name).write_bytes(b)
            def flush(self) -> None:
                pass

        audio_bytes = binascii.unhexlify(data.file)
        suffix = _detect_audio_suffix(audio_bytes[:16])
        with WinSafeTempFile(suffix=suffix) as temp:
            temp.write(audio_bytes)
            temp.flush()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speech-to-text hardcodes ".mp3" temp file suffix, breaking m4a/wav/ogg transcription

1 participant