feat(asr): 新增 Qwen ASR 提供器并完成设置集成Docs/qwen asr availability by witherleaves · Pull Request #1029 · WEIFENG2333/VideoCaptioner

witherleaves · 2026-03-05T07:26:23Z

变更概述

新增 Qwen ASR 提供器实现
在 ASR 设置界面中接入 Qwen ASR 选项及相关配置项
将 Qwen ASR 接入转写流程与任务工厂调度逻辑
更新 ASR 文档与依赖锁文件
补充 Qwen ASR 相关单元测试
``

分支与提交

分支：docs/qwen-asr-availability
提交：cbfbf0f
``

说明

本次改动聚焦 Qwen ASR 功能接入与可用性，不包含与该功能无关的重构。

Copilot

Pull request overview

该 PR 为现有转写（ASR）体系新增 Qwen-ASR 提供器，并将其接入任务分发、配置体系与设置界面，同时补充文档与单元测试，以支持本地（transformers）与服务化（vLLM/OpenAI 兼容）两种推理方式。

Changes:

新增 QwenASR ASR 实现与 TranscribeModelEnum.QWEN_ASR，并在 transcribe 分发中接入
增加 Qwen-ASR 的配置项与设置页组件，并在任务工厂中把 UI 配置映射到 TranscribeConfig
新增 Qwen-ASR 单测与使用/调研文档，并更新依赖列表

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
tests/test_asr/test_qwen_asr.py	新增 QwenASR 结果归一化与分段逻辑的单测覆盖
pyproject.toml	增加 qwen-asr/demucs/torchcodec 依赖以支撑新能力
docs/qwen-asr-fit-and-deployment-report.md	新增 Qwen ASR 契合度调研与部署建议报告
docs/config/asr.md	增加 Qwen-ASR 配置与参数说明
app/view/setting_interface.py	设置页新增若干翻译/优化相关模型输入卡片（与本 PR 主线存在耦合风险）
app/thread/transcript_thread.py	转写线程为 Qwen-ASR 增加（可选）demucs 人声分离前处理与清理逻辑
app/core/task_factory.py	创建转写任务时注入 Qwen-ASR 专属配置，并调整词级时间戳开关逻辑
app/core/entities.py	增加 QWEN_ASR 枚举、语言能力映射与 Qwen-ASR 配置字段/打印逻辑
app/core/asr/transcribe.py	在 ASR 实例创建工厂中新增 Qwen-ASR 分支与默认分块参数
app/core/asr/qwen_asr.py	新增 QwenASR 实现（transformers 与 vLLM/OpenAI 兼容两种后端）
app/core/asr/asr_data.py	修复/重写 Windows 长路径前缀处理逻辑
app/core/asr/init.py	导出 QwenASR
app/components/transcription_setting_card.py	在转写设置切换组件中接入 QwenASRSettingWidget
app/components/QwenASRSettingWidget.py	新增 Qwen-ASR 设置页组件与 vLLM 连通性测试线程
app/common/config.py	新增 Qwen-ASR 配置项（backend/model/api/aligner 等）

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-05T07:32:05Z

app/core/asr/qwen_asr.py

+        api_kwargs: dict[str, Any] = {
+            "model": self.model_name,
+            "response_format": "verbose_json",
+            "file": ("audio.mp3", self.file_binary or b"", "audio/mp3"),
+            "timestamp_granularities": ["word", "segment"],
+        }


The vLLM/OpenAI request always uploads self.file_binary as ("audio.mp3", ..., "audio/mp3"). In the normal (non-chunked) path self.file_binary is read from a .wav temp file, so the filename/mime can be incorrect and cause server-side decoding issues. Consider using the real file path/extension when audio_input is a path (open the file), or at least set the filename/mime based on the actual source format.

Copilot · 2026-03-05T07:32:05Z

pyproject.toml

    "GPUtil>=1.4.0",
    "pillow>=12.0.0",
    "fonttools>=4.61.1",
+    "qwen-asr",


qwen-asr is added without any version constraint, but QwenASR explicitly branches on the qwen-asr>=0.0.6 API (Qwen3ASRModel). To avoid installs resolving to older/newer incompatible releases, please add a minimum version (and ideally upper bound if needed) that matches the APIs this implementation supports.

Suggested change

"qwen-asr",

"qwen-asr>=0.0.6",

Copilot · 2026-03-05T07:32:06Z

app/view/setting_interface.py

+        self.splitModelCard = LineEditSettingCard(
+            cfg.split_model,
+            FIF.ALIGNMENT,
+            self.tr("断句模型"),
+            self.tr("字幕断句阶段使用的模型，留空则使用 LLM 主模型"),
+            self.tr("例如: gpt-5-mini / deepseek-chat"),
+            self.translateGroup,
+        )


cfg.split_model is referenced here, but there is no split_model ConfigItem defined in app/common/config.py (repo search shows no definition). This will raise an AttributeError when the settings UI initializes; please add the missing config item (and persistence key), or remove this card.

Copilot · 2026-03-05T07:32:06Z

app/view/setting_interface.py

+        self.optimizeModelCard = LineEditSettingCard(
+            cfg.optimize_model,
+            FIF.EDIT,
+            self.tr("优化模型"),
+            self.tr("字幕优化阶段使用的模型，留空则使用 LLM 主模型"),
+            self.tr("例如: gpt-5-mini / deepseek-v3"),
+            self.translateGroup,
+        )


cfg.optimize_model is referenced here, but it is not defined in app/common/config.py (no matches for optimize_model). This will crash settings initialization with AttributeError; please introduce the corresponding ConfigItem/validator or remove the setting card.

Suggested change

self.optimizeModelCard = LineEditSettingCard(

cfg.optimize_model,

FIF.EDIT,

self.tr("优化模型"),

self.tr("字幕优化阶段使用的模型，留空则使用 LLM 主模型"),

self.tr("例如: gpt-5-mini / deepseek-v3"),

self.translateGroup,

)

Copilot · 2026-03-05T07:32:06Z

app/core/asr/qwen_asr.py

+            normalized: dict[str, Any] = dict(result.__dict__)
+            time_stamps = normalized.get("time_stamps")
+            if time_stamps is not None:
+                items = getattr(time_stamps, "items", time_stamps)


_normalize_result() attempts to handle time_stamps being a dict by doing items = getattr(time_stamps, 'items', time_stamps), but for dicts this returns the .items method object (not the result of calling it). Iterating that will raise TypeError. Use time_stamps.items() when isinstance(time_stamps, dict) (or normalize to a list) before iterating.

Suggested change

items = getattr(time_stamps, "items", time_stamps)

if isinstance(time_stamps, dict):

# When time_stamps is a dict of segments, iterate over its values.

items = time_stamps.values()

else:

# For other mappings/objects, prefer their .items() (if present and callable),

# otherwise assume they are already an iterable of items.

items = getattr(time_stamps, "items", time_stamps)

if callable(items):

items = items()

Copilot · 2026-03-05T07:32:07Z

app/thread/transcript_thread.py

+from app.core.entities import TranscribeModelEnum, TranscribeOutputFormatEnum, TranscribeTask
 from app.core.utils.logger import setup_logger
-from app.core.utils.video_utils import video2audio
+from app.core.utils.video_utils import separate_vocals_with_demucs, video2audio


separate_vocals_with_demucs is imported from app.core.utils.video_utils, but there is no such function defined in the repo (searching for def separate_vocals_with_demucs returns no results). This will raise ImportError at runtime; please either add the missing implementation/export, or remove this import until it exists.

Suggested change

from app.core.utils.video_utils import separate_vocals_with_demucs, video2audio

from app.core.utils.video_utils import video2audio

Copilot · 2026-03-05T07:32:07Z

app/thread/transcript_thread.py

+            if (
+                self.task.transcribe_config.transcribe_model == TranscribeModelEnum.QWEN_ASR
+                and self.task.transcribe_config.qwen_asr_vocal_separation
+            ):
+                self.progress.emit(12, self.tr("Qwen 人声分离中"))
+                logger.info("Qwen ASR 启用人声分离（demucs）")
+                asr_audio_path, demucs_temp_dir = separate_vocals_with_demucs(
+                    temp_audio_path
+                )


This call relies on separate_vocals_with_demucs, but that function is not implemented anywhere in the codebase right now, so enabling qwen_asr_vocal_separation will crash during transcription. Please add the actual demucs separation helper (and ensure it returns the (asr_audio_path, temp_dir) tuple this code expects), or gate/remove the feature until it is available.

Copilot · 2026-03-05T07:32:07Z

app/view/setting_interface.py

+        self.translateModelCard = LineEditSettingCard(
+            cfg.translate_model,
+            FIF.LANGUAGE,
+            self.tr("翻译模型"),
+            self.tr("LLM 翻译阶段使用的模型，留空则使用 LLM 主模型"),
+            self.tr("例如: gemini-2.0-flash / gpt-4o-mini"),
+            self.translateGroup,
+        )


cfg.translate_model is referenced here, but there is no translate_model ConfigItem defined in app/common/config.py (repo search shows no definition). This will raise an AttributeError at runtime; please add the missing config item or remove this card from the UI.

Suggested change

self.translateModelCard = LineEditSettingCard(

cfg.translate_model,

FIF.LANGUAGE,

self.tr("翻译模型"),

self.tr("LLM 翻译阶段使用的模型，留空则使用 LLM 主模型"),

self.tr("例如: gemini-2.0-flash / gpt-4o-mini"),

self.translateGroup,

)

Copilot · 2026-03-05T07:32:07Z

app/core/asr/qwen_asr.py

+        with tempfile.NamedTemporaryFile(
+            suffix=".wav", delete=False


When audio_input is bytes, this writes the raw bytes to a temp file with a .wav suffix. In this codebase ChunkedASR exports chunks as MP3 bytes, so for long audio this will often create a .wav file containing MP3 data, which can break downstream decoding. Consider writing with a suffix that matches the actual encoded bytes (e.g., .mp3 for ChunkedASR), or pass bytes directly if the qwen-asr API supports it.

Suggested change

with tempfile.NamedTemporaryFile(

suffix=".wav", delete=False

# Choose a file suffix that matches the actual encoded bytes.

# In this codebase, ChunkedASR exports MP3 bytes when audio_input is bytes.

suffix = ".mp3" if isinstance(self.audio_input, bytes) else ".wav"

with tempfile.NamedTemporaryFile(

suffix=suffix, delete=False

lex0301 · 2026-03-05T11:54:25Z

建議一併加入Qwen3 ASR 系列的Forced Alignment 模型： Qwen3-ForcedAligner-0.6B
產出的字幕時間會更準確

witherleaves · 2026-03-06T03:19:39Z

建議一併加入Qwen3 ASR 系列的Forced Alignment 模型： Qwen3-ForcedAligner-0.6B 產出的字幕時間會更準確生成的字幕时间会更准确

代码里面已经集成了

WEIFENG2333 added 2 commits March 5, 2026 14:44

docs: add qwen asr availability report

766ca87

feat(asr): keep qwen asr focused changes

4f827fc

Copilot AI review requested due to automatic review settings March 5, 2026 07:26

Copilot started reviewing on behalf of witherleaves March 5, 2026 07:26 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(asr): 新增 Qwen ASR 提供器并完成设置集成Docs/qwen asr availability#1029

feat(asr): 新增 Qwen ASR 提供器并完成设置集成Docs/qwen asr availability#1029
witherleaves wants to merge 2 commits intoWEIFENG2333:masterfrom
witherleaves:docs/qwen-asr-availability

witherleaves commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

lex0301 commented Mar 5, 2026

Uh oh!

witherleaves commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-                items = getattr(time_stamps, "items", time_stamps)
+                if isinstance(time_stamps, dict):
+                    # When time_stamps is a dict of segments, iterate over its values.
+                    items = time_stamps.values()
+                else:
+                    # For other mappings/objects, prefer their .items() (if present and callable),
+                    # otherwise assume they are already an iterable of items.
+                    items = getattr(time_stamps, "items", time_stamps)
+                    if callable(items):
+                        items = items()

	from app.core.utils.video_utils import separate_vocals_with_demucs, video2audio
	from app.core.utils.video_utils import video2audio

		with tempfile.NamedTemporaryFile(
		suffix=".wav", delete=False

-        with tempfile.NamedTemporaryFile(
-            suffix=".wav", delete=False
+        # Choose a file suffix that matches the actual encoded bytes.
+        # In this codebase, ChunkedASR exports MP3 bytes when audio_input is bytes.
+        suffix = ".mp3" if isinstance(self.audio_input, bytes) else ".wav"
+        with tempfile.NamedTemporaryFile(
+            suffix=suffix, delete=False

Conversation

witherleaves commented Mar 5, 2026

变更概述

分支与提交

说明

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

lex0301 commented Mar 5, 2026

Uh oh!

witherleaves commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants