Skip to content

Speech-to-text hardcodes ".mp3" temp file suffix, breaking m4a/wav/ogg transcription #347

@Kota-Maeda

Description

@Kota-Maeda

Summary

invoke_speech_to_text in src/dify_plugin/core/plugin_executor.py writes the
incoming audio to a NamedTemporaryFile(suffix=".mp3") regardless of the real
audio container, so non-mp3 content (m4a/AAC, wav, ogg, flac, webm) is sent
labeled as .mp3 and rejected by OpenAI-compatible / Azure OpenAI Whisper with:

Invalid file format. Supported formats: ['flac','m4a','mp3','mp4','mpeg','mpga','oga','ogg','wav','webm']

These formats are officially supported by Whisper, but the SDK silently
mislabels them. Because the error blames the input file, users wrongly think
their audio is corrupted when the defect is in the SDK.

Affected path

src/dify_plugin/core/plugin_executor.pyinvoke_speech_to_text (the hardcoded suffix=".mp3").
This dispatch is shared by every speech2text provider, so it impacts all of them
(e.g. the openai_api_compatible model in dify-official-plugins).

Root cause

The model-invoke payload carries only the raw bytes (no filename/extension):
ModelInvokeSpeech2TextRequest.file is a hex string. Whisper endpoints infer the
format from the multipart filename extension, so the hardcoded .mp3 makes them
decode non-mp3 audio as mp3 and fail.

Proposed fix

Detect the container from the leading magic bytes and pick the matching suffix
(wav/flac/ogg/m4a/webm), falling back to .mp3 for unknown/undetectable content
so existing behavior is preserved (backward compatible).

Repro

Configure an OpenAI-compatible / Azure Whisper speech2text model and send a valid
.m4a (AAC) file → BadRequest: Invalid file format.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions