feat(asr): 新增 Qwen ASR 提供器并完成设置集成Docs/qwen asr availability#1029
feat(asr): 新增 Qwen ASR 提供器并完成设置集成Docs/qwen asr availability#1029witherleaves wants to merge 2 commits intoWEIFENG2333:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
该 PR 为现有转写(ASR)体系新增 Qwen-ASR 提供器,并将其接入任务分发、配置体系与设置界面,同时补充文档与单元测试,以支持本地(transformers)与服务化(vLLM/OpenAI 兼容)两种推理方式。
Changes:
- 新增
QwenASRASR 实现与TranscribeModelEnum.QWEN_ASR,并在transcribe分发中接入 - 增加 Qwen-ASR 的配置项与设置页组件,并在任务工厂中把 UI 配置映射到
TranscribeConfig - 新增 Qwen-ASR 单测与使用/调研文档,并更新依赖列表
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_asr/test_qwen_asr.py | 新增 QwenASR 结果归一化与分段逻辑的单测覆盖 |
| pyproject.toml | 增加 qwen-asr/demucs/torchcodec 依赖以支撑新能力 |
| docs/qwen-asr-fit-and-deployment-report.md | 新增 Qwen ASR 契合度调研与部署建议报告 |
| docs/config/asr.md | 增加 Qwen-ASR 配置与参数说明 |
| app/view/setting_interface.py | 设置页新增若干翻译/优化相关模型输入卡片(与本 PR 主线存在耦合风险) |
| app/thread/transcript_thread.py | 转写线程为 Qwen-ASR 增加(可选)demucs 人声分离前处理与清理逻辑 |
| app/core/task_factory.py | 创建转写任务时注入 Qwen-ASR 专属配置,并调整词级时间戳开关逻辑 |
| app/core/entities.py | 增加 QWEN_ASR 枚举、语言能力映射与 Qwen-ASR 配置字段/打印逻辑 |
| app/core/asr/transcribe.py | 在 ASR 实例创建工厂中新增 Qwen-ASR 分支与默认分块参数 |
| app/core/asr/qwen_asr.py | 新增 QwenASR 实现(transformers 与 vLLM/OpenAI 兼容两种后端) |
| app/core/asr/asr_data.py | 修复/重写 Windows 长路径前缀处理逻辑 |
| app/core/asr/init.py | 导出 QwenASR |
| app/components/transcription_setting_card.py | 在转写设置切换组件中接入 QwenASRSettingWidget |
| app/components/QwenASRSettingWidget.py | 新增 Qwen-ASR 设置页组件与 vLLM 连通性测试线程 |
| app/common/config.py | 新增 Qwen-ASR 配置项(backend/model/api/aligner 等) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| api_kwargs: dict[str, Any] = { | ||
| "model": self.model_name, | ||
| "response_format": "verbose_json", | ||
| "file": ("audio.mp3", self.file_binary or b"", "audio/mp3"), | ||
| "timestamp_granularities": ["word", "segment"], | ||
| } |
There was a problem hiding this comment.
The vLLM/OpenAI request always uploads self.file_binary as ("audio.mp3", ..., "audio/mp3"). In the normal (non-chunked) path self.file_binary is read from a .wav temp file, so the filename/mime can be incorrect and cause server-side decoding issues. Consider using the real file path/extension when audio_input is a path (open the file), or at least set the filename/mime based on the actual source format.
| "GPUtil>=1.4.0", | ||
| "pillow>=12.0.0", | ||
| "fonttools>=4.61.1", | ||
| "qwen-asr", |
There was a problem hiding this comment.
qwen-asr is added without any version constraint, but QwenASR explicitly branches on the qwen-asr>=0.0.6 API (Qwen3ASRModel). To avoid installs resolving to older/newer incompatible releases, please add a minimum version (and ideally upper bound if needed) that matches the APIs this implementation supports.
| "qwen-asr", | |
| "qwen-asr>=0.0.6", |
| self.splitModelCard = LineEditSettingCard( | ||
| cfg.split_model, | ||
| FIF.ALIGNMENT, | ||
| self.tr("断句模型"), | ||
| self.tr("字幕断句阶段使用的模型,留空则使用 LLM 主模型"), | ||
| self.tr("例如: gpt-5-mini / deepseek-chat"), | ||
| self.translateGroup, | ||
| ) |
There was a problem hiding this comment.
cfg.split_model is referenced here, but there is no split_model ConfigItem defined in app/common/config.py (repo search shows no definition). This will raise an AttributeError when the settings UI initializes; please add the missing config item (and persistence key), or remove this card.
| self.optimizeModelCard = LineEditSettingCard( | ||
| cfg.optimize_model, | ||
| FIF.EDIT, | ||
| self.tr("优化模型"), | ||
| self.tr("字幕优化阶段使用的模型,留空则使用 LLM 主模型"), | ||
| self.tr("例如: gpt-5-mini / deepseek-v3"), | ||
| self.translateGroup, | ||
| ) |
There was a problem hiding this comment.
cfg.optimize_model is referenced here, but it is not defined in app/common/config.py (no matches for optimize_model). This will crash settings initialization with AttributeError; please introduce the corresponding ConfigItem/validator or remove the setting card.
| self.optimizeModelCard = LineEditSettingCard( | |
| cfg.optimize_model, | |
| FIF.EDIT, | |
| self.tr("优化模型"), | |
| self.tr("字幕优化阶段使用的模型,留空则使用 LLM 主模型"), | |
| self.tr("例如: gpt-5-mini / deepseek-v3"), | |
| self.translateGroup, | |
| ) |
| normalized: dict[str, Any] = dict(result.__dict__) | ||
| time_stamps = normalized.get("time_stamps") | ||
| if time_stamps is not None: | ||
| items = getattr(time_stamps, "items", time_stamps) |
There was a problem hiding this comment.
_normalize_result() attempts to handle time_stamps being a dict by doing items = getattr(time_stamps, 'items', time_stamps), but for dicts this returns the .items method object (not the result of calling it). Iterating that will raise TypeError. Use time_stamps.items() when isinstance(time_stamps, dict) (or normalize to a list) before iterating.
| items = getattr(time_stamps, "items", time_stamps) | |
| if isinstance(time_stamps, dict): | |
| # When time_stamps is a dict of segments, iterate over its values. | |
| items = time_stamps.values() | |
| else: | |
| # For other mappings/objects, prefer their .items() (if present and callable), | |
| # otherwise assume they are already an iterable of items. | |
| items = getattr(time_stamps, "items", time_stamps) | |
| if callable(items): | |
| items = items() |
| from app.core.entities import TranscribeModelEnum, TranscribeOutputFormatEnum, TranscribeTask | ||
| from app.core.utils.logger import setup_logger | ||
| from app.core.utils.video_utils import video2audio | ||
| from app.core.utils.video_utils import separate_vocals_with_demucs, video2audio |
There was a problem hiding this comment.
separate_vocals_with_demucs is imported from app.core.utils.video_utils, but there is no such function defined in the repo (searching for def separate_vocals_with_demucs returns no results). This will raise ImportError at runtime; please either add the missing implementation/export, or remove this import until it exists.
| from app.core.utils.video_utils import separate_vocals_with_demucs, video2audio | |
| from app.core.utils.video_utils import video2audio |
| if ( | ||
| self.task.transcribe_config.transcribe_model == TranscribeModelEnum.QWEN_ASR | ||
| and self.task.transcribe_config.qwen_asr_vocal_separation | ||
| ): | ||
| self.progress.emit(12, self.tr("Qwen 人声分离中")) | ||
| logger.info("Qwen ASR 启用人声分离(demucs)") | ||
| asr_audio_path, demucs_temp_dir = separate_vocals_with_demucs( | ||
| temp_audio_path | ||
| ) |
There was a problem hiding this comment.
This call relies on separate_vocals_with_demucs, but that function is not implemented anywhere in the codebase right now, so enabling qwen_asr_vocal_separation will crash during transcription. Please add the actual demucs separation helper (and ensure it returns the (asr_audio_path, temp_dir) tuple this code expects), or gate/remove the feature until it is available.
| self.translateModelCard = LineEditSettingCard( | ||
| cfg.translate_model, | ||
| FIF.LANGUAGE, | ||
| self.tr("翻译模型"), | ||
| self.tr("LLM 翻译阶段使用的模型,留空则使用 LLM 主模型"), | ||
| self.tr("例如: gemini-2.0-flash / gpt-4o-mini"), | ||
| self.translateGroup, | ||
| ) |
There was a problem hiding this comment.
cfg.translate_model is referenced here, but there is no translate_model ConfigItem defined in app/common/config.py (repo search shows no definition). This will raise an AttributeError at runtime; please add the missing config item or remove this card from the UI.
| self.translateModelCard = LineEditSettingCard( | |
| cfg.translate_model, | |
| FIF.LANGUAGE, | |
| self.tr("翻译模型"), | |
| self.tr("LLM 翻译阶段使用的模型,留空则使用 LLM 主模型"), | |
| self.tr("例如: gemini-2.0-flash / gpt-4o-mini"), | |
| self.translateGroup, | |
| ) |
| with tempfile.NamedTemporaryFile( | ||
| suffix=".wav", delete=False |
There was a problem hiding this comment.
When audio_input is bytes, this writes the raw bytes to a temp file with a .wav suffix. In this codebase ChunkedASR exports chunks as MP3 bytes, so for long audio this will often create a .wav file containing MP3 data, which can break downstream decoding. Consider writing with a suffix that matches the actual encoded bytes (e.g., .mp3 for ChunkedASR), or pass bytes directly if the qwen-asr API supports it.
| with tempfile.NamedTemporaryFile( | |
| suffix=".wav", delete=False | |
| # Choose a file suffix that matches the actual encoded bytes. | |
| # In this codebase, ChunkedASR exports MP3 bytes when audio_input is bytes. | |
| suffix = ".mp3" if isinstance(self.audio_input, bytes) else ".wav" | |
| with tempfile.NamedTemporaryFile( | |
| suffix=suffix, delete=False |
|
建議一併加入Qwen3 ASR 系列的Forced Alignment 模型: Qwen3-ForcedAligner-0.6B |
代码里面已经集成了 |
变更概述
``
分支与提交
``
说明