Summary
invoke_speech_to_text in src/dify_plugin/core/plugin_executor.py writes the
incoming audio to a NamedTemporaryFile(suffix=".mp3") regardless of the real
audio container, so non-mp3 content (m4a/AAC, wav, ogg, flac, webm) is sent
labeled as .mp3 and rejected by OpenAI-compatible / Azure OpenAI Whisper with:
Invalid file format. Supported formats: ['flac','m4a','mp3','mp4','mpeg','mpga','oga','ogg','wav','webm']
These formats are officially supported by Whisper, but the SDK silently
mislabels them. Because the error blames the input file, users wrongly think
their audio is corrupted when the defect is in the SDK.
Affected path
src/dify_plugin/core/plugin_executor.py → invoke_speech_to_text (the hardcoded suffix=".mp3").
This dispatch is shared by every speech2text provider, so it impacts all of them
(e.g. the openai_api_compatible model in dify-official-plugins).
Root cause
The model-invoke payload carries only the raw bytes (no filename/extension):
ModelInvokeSpeech2TextRequest.file is a hex string. Whisper endpoints infer the
format from the multipart filename extension, so the hardcoded .mp3 makes them
decode non-mp3 audio as mp3 and fail.
Proposed fix
Detect the container from the leading magic bytes and pick the matching suffix
(wav/flac/ogg/m4a/webm), falling back to .mp3 for unknown/undetectable content
so existing behavior is preserved (backward compatible).
Repro
Configure an OpenAI-compatible / Azure Whisper speech2text model and send a valid
.m4a (AAC) file → BadRequest: Invalid file format.
Summary
invoke_speech_to_textinsrc/dify_plugin/core/plugin_executor.pywrites theincoming audio to a
NamedTemporaryFile(suffix=".mp3")regardless of the realaudio container, so non-mp3 content (m4a/AAC, wav, ogg, flac, webm) is sent
labeled as
.mp3and rejected by OpenAI-compatible / Azure OpenAI Whisper with:These formats are officially supported by Whisper, but the SDK silently
mislabels them. Because the error blames the input file, users wrongly think
their audio is corrupted when the defect is in the SDK.
Affected path
src/dify_plugin/core/plugin_executor.py→invoke_speech_to_text(the hardcodedsuffix=".mp3").This dispatch is shared by every speech2text provider, so it impacts all of them
(e.g. the
openai_api_compatiblemodel in dify-official-plugins).Root cause
The model-invoke payload carries only the raw bytes (no filename/extension):
ModelInvokeSpeech2TextRequest.fileis a hex string. Whisper endpoints infer theformat from the multipart filename extension, so the hardcoded
.mp3makes themdecode non-mp3 audio as mp3 and fail.
Proposed fix
Detect the container from the leading magic bytes and pick the matching suffix
(wav/flac/ogg/m4a/webm), falling back to
.mp3for unknown/undetectable contentso existing behavior is preserved (backward compatible).
Repro
Configure an OpenAI-compatible / Azure Whisper speech2text model and send a valid
.m4a(AAC) file →BadRequest: Invalid file format.