Skip to content

feat(asr): 新增 Qwen ASR 提供器并完成设置集成Docs/qwen asr availability#1029

Open
witherleaves wants to merge 2 commits intoWEIFENG2333:masterfrom
witherleaves:docs/qwen-asr-availability
Open

feat(asr): 新增 Qwen ASR 提供器并完成设置集成Docs/qwen asr availability#1029
witherleaves wants to merge 2 commits intoWEIFENG2333:masterfrom
witherleaves:docs/qwen-asr-availability

Conversation

@witherleaves
Copy link

变更概述

  • 新增 Qwen ASR 提供器实现
  • 在 ASR 设置界面中接入 Qwen ASR 选项及相关配置项
  • 将 Qwen ASR 接入转写流程与任务工厂调度逻辑
  • 更新 ASR 文档与依赖锁文件
  • 补充 Qwen ASR 相关单元测试
    ``

分支与提交

  • 分支:docs/qwen-asr-availability
  • 提交:cbfbf0f
    ``

说明

  • 本次改动聚焦 Qwen ASR 功能接入与可用性,不包含与该功能无关的重构。

Copilot AI review requested due to automatic review settings March 5, 2026 07:26
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 为现有转写(ASR)体系新增 Qwen-ASR 提供器,并将其接入任务分发、配置体系与设置界面,同时补充文档与单元测试,以支持本地(transformers)与服务化(vLLM/OpenAI 兼容)两种推理方式。

Changes:

  • 新增 QwenASR ASR 实现与 TranscribeModelEnum.QWEN_ASR,并在 transcribe 分发中接入
  • 增加 Qwen-ASR 的配置项与设置页组件,并在任务工厂中把 UI 配置映射到 TranscribeConfig
  • 新增 Qwen-ASR 单测与使用/调研文档,并更新依赖列表

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/test_asr/test_qwen_asr.py 新增 QwenASR 结果归一化与分段逻辑的单测覆盖
pyproject.toml 增加 qwen-asr/demucs/torchcodec 依赖以支撑新能力
docs/qwen-asr-fit-and-deployment-report.md 新增 Qwen ASR 契合度调研与部署建议报告
docs/config/asr.md 增加 Qwen-ASR 配置与参数说明
app/view/setting_interface.py 设置页新增若干翻译/优化相关模型输入卡片(与本 PR 主线存在耦合风险)
app/thread/transcript_thread.py 转写线程为 Qwen-ASR 增加(可选)demucs 人声分离前处理与清理逻辑
app/core/task_factory.py 创建转写任务时注入 Qwen-ASR 专属配置,并调整词级时间戳开关逻辑
app/core/entities.py 增加 QWEN_ASR 枚举、语言能力映射与 Qwen-ASR 配置字段/打印逻辑
app/core/asr/transcribe.py 在 ASR 实例创建工厂中新增 Qwen-ASR 分支与默认分块参数
app/core/asr/qwen_asr.py 新增 QwenASR 实现(transformers 与 vLLM/OpenAI 兼容两种后端)
app/core/asr/asr_data.py 修复/重写 Windows 长路径前缀处理逻辑
app/core/asr/init.py 导出 QwenASR
app/components/transcription_setting_card.py 在转写设置切换组件中接入 QwenASRSettingWidget
app/components/QwenASRSettingWidget.py 新增 Qwen-ASR 设置页组件与 vLLM 连通性测试线程
app/common/config.py 新增 Qwen-ASR 配置项(backend/model/api/aligner 等)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +97 to +102
api_kwargs: dict[str, Any] = {
"model": self.model_name,
"response_format": "verbose_json",
"file": ("audio.mp3", self.file_binary or b"", "audio/mp3"),
"timestamp_granularities": ["word", "segment"],
}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vLLM/OpenAI request always uploads self.file_binary as ("audio.mp3", ..., "audio/mp3"). In the normal (non-chunked) path self.file_binary is read from a .wav temp file, so the filename/mime can be incorrect and cause server-side decoding issues. Consider using the real file path/extension when audio_input is a path (open the file), or at least set the filename/mime based on the actual source format.

Copilot uses AI. Check for mistakes.
"GPUtil>=1.4.0",
"pillow>=12.0.0",
"fonttools>=4.61.1",
"qwen-asr",
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qwen-asr is added without any version constraint, but QwenASR explicitly branches on the qwen-asr>=0.0.6 API (Qwen3ASRModel). To avoid installs resolving to older/newer incompatible releases, please add a minimum version (and ideally upper bound if needed) that matches the APIs this implementation supports.

Suggested change
"qwen-asr",
"qwen-asr>=0.0.6",

Copilot uses AI. Check for mistakes.
Comment on lines +119 to +126
self.splitModelCard = LineEditSettingCard(
cfg.split_model,
FIF.ALIGNMENT,
self.tr("断句模型"),
self.tr("字幕断句阶段使用的模型,留空则使用 LLM 主模型"),
self.tr("例如: gpt-5-mini / deepseek-chat"),
self.translateGroup,
)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cfg.split_model is referenced here, but there is no split_model ConfigItem defined in app/common/config.py (repo search shows no definition). This will raise an AttributeError when the settings UI initializes; please add the missing config item (and persistence key), or remove this card.

Copilot uses AI. Check for mistakes.
Comment on lines +127 to +134
self.optimizeModelCard = LineEditSettingCard(
cfg.optimize_model,
FIF.EDIT,
self.tr("优化模型"),
self.tr("字幕优化阶段使用的模型,留空则使用 LLM 主模型"),
self.tr("例如: gpt-5-mini / deepseek-v3"),
self.translateGroup,
)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cfg.optimize_model is referenced here, but it is not defined in app/common/config.py (no matches for optimize_model). This will crash settings initialization with AttributeError; please introduce the corresponding ConfigItem/validator or remove the setting card.

Suggested change
self.optimizeModelCard = LineEditSettingCard(
cfg.optimize_model,
FIF.EDIT,
self.tr("优化模型"),
self.tr("字幕优化阶段使用的模型,留空则使用 LLM 主模型"),
self.tr("例如: gpt-5-mini / deepseek-v3"),
self.translateGroup,
)

Copilot uses AI. Check for mistakes.
normalized: dict[str, Any] = dict(result.__dict__)
time_stamps = normalized.get("time_stamps")
if time_stamps is not None:
items = getattr(time_stamps, "items", time_stamps)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_normalize_result() attempts to handle time_stamps being a dict by doing items = getattr(time_stamps, 'items', time_stamps), but for dicts this returns the .items method object (not the result of calling it). Iterating that will raise TypeError. Use time_stamps.items() when isinstance(time_stamps, dict) (or normalize to a list) before iterating.

Suggested change
items = getattr(time_stamps, "items", time_stamps)
if isinstance(time_stamps, dict):
# When time_stamps is a dict of segments, iterate over its values.
items = time_stamps.values()
else:
# For other mappings/objects, prefer their .items() (if present and callable),
# otherwise assume they are already an iterable of items.
items = getattr(time_stamps, "items", time_stamps)
if callable(items):
items = items()

Copilot uses AI. Check for mistakes.
from app.core.entities import TranscribeModelEnum, TranscribeOutputFormatEnum, TranscribeTask
from app.core.utils.logger import setup_logger
from app.core.utils.video_utils import video2audio
from app.core.utils.video_utils import separate_vocals_with_demucs, video2audio
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate_vocals_with_demucs is imported from app.core.utils.video_utils, but there is no such function defined in the repo (searching for def separate_vocals_with_demucs returns no results). This will raise ImportError at runtime; please either add the missing implementation/export, or remove this import until it exists.

Suggested change
from app.core.utils.video_utils import separate_vocals_with_demucs, video2audio
from app.core.utils.video_utils import video2audio

Copilot uses AI. Check for mistakes.
Comment on lines +110 to +118
if (
self.task.transcribe_config.transcribe_model == TranscribeModelEnum.QWEN_ASR
and self.task.transcribe_config.qwen_asr_vocal_separation
):
self.progress.emit(12, self.tr("Qwen 人声分离中"))
logger.info("Qwen ASR 启用人声分离(demucs)")
asr_audio_path, demucs_temp_dir = separate_vocals_with_demucs(
temp_audio_path
)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call relies on separate_vocals_with_demucs, but that function is not implemented anywhere in the codebase right now, so enabling qwen_asr_vocal_separation will crash during transcription. Please add the actual demucs separation helper (and ensure it returns the (asr_audio_path, temp_dir) tuple this code expects), or gate/remove the feature until it is available.

Copilot uses AI. Check for mistakes.
Comment on lines +135 to +142
self.translateModelCard = LineEditSettingCard(
cfg.translate_model,
FIF.LANGUAGE,
self.tr("翻译模型"),
self.tr("LLM 翻译阶段使用的模型,留空则使用 LLM 主模型"),
self.tr("例如: gemini-2.0-flash / gpt-4o-mini"),
self.translateGroup,
)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cfg.translate_model is referenced here, but there is no translate_model ConfigItem defined in app/common/config.py (repo search shows no definition). This will raise an AttributeError at runtime; please add the missing config item or remove this card from the UI.

Suggested change
self.translateModelCard = LineEditSettingCard(
cfg.translate_model,
FIF.LANGUAGE,
self.tr("翻译模型"),
self.tr("LLM 翻译阶段使用的模型,留空则使用 LLM 主模型"),
self.tr("例如: gemini-2.0-flash / gpt-4o-mini"),
self.translateGroup,
)

Copilot uses AI. Check for mistakes.
Comment on lines +329 to +330
with tempfile.NamedTemporaryFile(
suffix=".wav", delete=False
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When audio_input is bytes, this writes the raw bytes to a temp file with a .wav suffix. In this codebase ChunkedASR exports chunks as MP3 bytes, so for long audio this will often create a .wav file containing MP3 data, which can break downstream decoding. Consider writing with a suffix that matches the actual encoded bytes (e.g., .mp3 for ChunkedASR), or pass bytes directly if the qwen-asr API supports it.

Suggested change
with tempfile.NamedTemporaryFile(
suffix=".wav", delete=False
# Choose a file suffix that matches the actual encoded bytes.
# In this codebase, ChunkedASR exports MP3 bytes when audio_input is bytes.
suffix = ".mp3" if isinstance(self.audio_input, bytes) else ".wav"
with tempfile.NamedTemporaryFile(
suffix=suffix, delete=False

Copilot uses AI. Check for mistakes.
@lex0301
Copy link

lex0301 commented Mar 5, 2026

建議一併加入Qwen3 ASR 系列的Forced Alignment 模型: Qwen3-ForcedAligner-0.6B
產出的字幕時間會更準確

@witherleaves
Copy link
Author

建議一併加入Qwen3 ASR 系列的Forced Alignment 模型: Qwen3-ForcedAligner-0.6B  產出的字幕時間會更準確  生成的字幕时间会更准确

代码里面已经集成了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants