[bugfix]: fix answer extraction regex and evaluator bugs in MMLU-Pro by lvhua6352 · Pull Request #269 · AISBench/benchmark

lvhua6352 · 2026-05-02T11:06:01Z

Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。

PR Type / PR类型

Related Issue | 关联 Issue
Fixes #(issue ID / issue 编号) / Relates to #(issue ID / issue 编号)
Fixes #125

🔍 Motivation / 变更动机

Please describe the motivation of this PR and the goal you want to achieve through this PR.
请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。

使用 ais_bench 对 Qwen3-Next-80B-A3B-Instruct 模型在MMLU-Pro 数据集上进行精度评测，评测结果与官方差异较大，发现框架答案提取与匹配逻辑存在问题。

答案提取正则误匹配：match_answer_pattern 使用的正则 (?i)ANSWER\s*:\s*([A-P]) 过于宽松，会错误匹配 Final Answer: 和 reasoning 中的 Answer:，导致模型实际输出正确答案却被误判为错误。
评估器逻辑缺陷：MMLUProBaseEvaluator.is_equal() 缺少精确匹配保护，且使用无限制的 split('. ')，导致模型输出完整选项文本时被误判，以及多句号参考文本触发 ValueError 崩溃。
文件名语义不准确：0-shot 配置使用了包含 Chain-of-Thought（逻辑链/逐步推理）提示的
prompt 模板（Think step by step before answering），原文件名 mmlu_pro_gen_0_shot_str.py
未能体现 CoT 特性，不利于与其它纯 0-shot 配置区分。

📝 Modification / 修改内容

Please briefly describe what modification is made in this PR.
请简要描述此拉取请求中进行的修改。

Bug 1: Fix match_answer_pattern regex to use word boundaries (\b)

根因：正则 (?i)ANSWER\s*:\s*([A-P]) 无单词边界保护，触发两类误匹配：

案例 A — "Final Answer:" 误匹配

模型输出片段：

### Final Answer:
ANSWER: E

正则匹配过程：(?i)ANSWER 匹配了 Final Answer: 中的子串 Answer:，\s* 吸收换行， ([A-P]) 捕获了下一行 ANSWER: E 的首字母 A，导致正确答案 E 被误判为 A。

案例 B — reasoning 中 "Answer:" 被提前捕获

模型在推理中搜索标准答案时输出：

What is released at the neuromuscular junction?
Answer: Acetylcholine
...
ANSWER: D

正则取了第一个 Answer: Acetylcholine 中的 A，忽略了最终正确的 ANSWER: D。

修复：将正则改为 (?i)\bANSWER\s*:\s*([A-P])\b

Final Answer:\nANSWER: E 中的 A后面是字母 N，不是单词边界，不再误匹配
Answer: Acetylcholine 中的 A 后面是 c（单词字符），不是单词边界，不再误匹配

Bug 2: Fix MMLUProBaseEvaluator.is_equal()

案例 A — 模型输出完整选项文本时被误判

当模型直接输出完整答案如 "A. gross error." 时，与 refer 中的完整文本 "A. gross error.\n" 应当视为完全匹配。原代码直接执行 refer.split('. ') 后分别比较选项字母和文本内容，无法处理 pred 直接等于完整 refer 的情况，导致正确输出被误判为错误。

修复：增加前置精确匹配判断：

if pred.strip() == refer.strip():
    return True

案例 B — 多句号参考文本触发 ValueError

参考文本如 "B. False, True. This is because..." 包含多个 . ，原代码 refer.split('. ') 返回超过 2 个元素，导致 refer_option, refer_string = ... 触发 ValueError: too many values to unpack。

修复：将 split('. ') 改为 split('. ', 1)，限制只分割一次。

3. Rename mmlu_pro_gen_0_shot_str.py to mmlu_pro_gen_0_shot_cot_str.py

0-shot 配置的 prompt 模板明确要求模型逐步推理：

Think step by step before answering.

这属于标准的 Chain-of-Thought (CoT) / 逻辑链提示策略。重命名为 ..._cot_str.py 可以更准确地
反映该配置的特性，便于与纯直接回答（non-CoT）的 0-shot 配置区分。同时更新了所有引用：

all_dataset_configs.py
README.md
README_en.md

📐 Associated Test Results / 关联测试结果

Please provide links to the related test results, such as CI pipelines, test reports, etc.
请提供相关测试结果的链接，例如 CI 管道、测试报告等。

修改后的完整评测结果如下，与官方分数80.6基本一致。

dataset	metric	mode	vllm-api-stream-chat
mmlu_pro_math	accuracy	gen	90.45
mmlu_pro_physics	accuracy	gen	83.68
mmlu_pro_chemistry	accuracy	gen	79.59
mmlu_pro_law	accuracy	gen	60.85
mmlu_pro_engineering	accuracy	gen	56.76
mmlu_pro_other	accuracy	gen	76.08
mmlu_pro_economics	accuracy	gen	84.36
mmlu_pro_health	accuracy	gen	76.53
mmlu_pro_psychology	accuracy	gen	80.58
mmlu_pro_business	accuracy	gen	82.51
mmlu_pro_biology	accuracy	gen	89.82
mmlu_pro_philosophy	accuracy	gen	74.75
mmlu_pro_computer_science	accuracy	gen	83.66
mmlu_pro_history	accuracy	gen	69.29
mmlu_pro	naive_average	gen	77.78
mmlu_pro-weighted	weighted_average	gen	78.03

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

Does the modification introduce changes that break the backward compatibility of the downstream repositories? If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
是否引入了会破坏下游存储库向后兼容性的更改？如果是，请描述它如何破坏兼容性，以及下游项目应该如何修改其代码以保持与此 PR 的兼容性。

⚠️ Performance degradation (Optional) / 性能下降（可选）

If the modification introduces performance degradation, please describe the impact of the performance degradation and the expected performance improvement.
如果引入了性能下降，请描述性能下降的影响和预期的性能改进。

🌟 Use cases (Optional) / 使用案例（可选）

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。

✅ Checklist / 检查列表

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues. / 使用预提交或其他 linting 工具来修复潜在的 lint 问题。
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。
All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects. / 如果此拉取请求对下游或其他相关项目有潜在影响，应在那些项目中测试此 PR。
CLA has been signed and all committers have signed the CLA in this PR. / CLA 已签署，且本 PR 中的所有提交者均已签署 CLA。

👥 Collaboration Info / 协作信息

Suggested Reviewers / 建议审核人: @xxx
Relevant Module Owners / 相关模块负责人: @xxx
Other Collaboration Notes / 其他协作说明：

🌟 Useful CI Command / 实用的CI命令

Command / 命令	Introduction / 介绍
`/gemini review`	Performs a code review for the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 执行代码审核。
`/gemini summary`	Provides a summary of the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 提供摘要。
`/gemini help`	Displays a list of available commands of Gemini. / 显示 Gemini 可用命令的列表。
`/readthedocs build`	Triggers a build of the documentation for the current pull request in its current state by Read the Docs. / 触发当前拉取请求在当前状态下由 Read the Docs 构建文档。

**PR Type** - [x] Bugfix（Bug 修复） **Related Issue** N/A ## 🔍 Motivation / 变更动机修复 mmlu_pro 数据集在 0-shot CoT 和 5-shot 评估中的多个关键 bug： 1. **答案提取正则误匹配**：`match_answer_pattern` 使用的正则 `(?i)ANSWER\s*:\s*([A-P])` 过于宽松，会错误匹配 `Final Answer:` 和 reasoning 中的 `Answer:`，导致模型实际输出正确答案却被误判为错误。 2. **评估器逻辑缺陷**：`MMLUProBaseEvaluator.is_equal()` 缺少精确匹配保护，且使用无限制的 `split('. ')`，导致模型输出完整选项文本时被误判，以及多句号参考文本触发 ValueError 崩溃。 3. **文件名语义不准确**：0-shot 配置使用了包含 Chain-of-Thought（逻辑链/逐步推理）提示的 prompt 模板（`Think step by step before answering`），原文件名 `mmlu_pro_gen_0_shot_str.py` 未能体现 CoT 特性，不利于与其他纯 0-shot 配置区分。 ## 📝 Modification / 修改内容 ### Bug 1: Fix match_answer_pattern regex to use word boundaries (\b) **根因**：正则 `(?i)ANSWER\s*:\s*([A-P])` 无单词边界保护，触发两类误匹配： **案例 A — "Final Answer:" 误匹配（7 个）** 模型输出片段： ``` ### Final Answer: ANSWER: E ``` 正则匹配过程：`(?i)ANSWER` 匹配了 `Final Answer:` 中的子串 `Answer:`，\s* 吸收换行， `([A-P])` 捕获了下一行 `ANSWER: E` 的首字母 `A`，导致正确答案 E 被误判为 A。受影响的真实测试样本： - engineering ID=5 (gold=B, 框架提取 O, 模型实际输出 ANSWER: B) - chemistry ID=4 (gold=E, 框架提取 A, 模型实际输出 ANSWER: E) - physics ID=9 (gold=J, 框架提取 A, 模型实际输出 ANSWER: J) **案例 B — reasoning 中 "Answer:" 被提前捕获（1 个）** 模型在推理中搜索标准答案时输出： ``` What is released at the neuromuscular junction? Answer: Acetylcholine ... ANSWER: D ``` 正则取了第一个 `Answer: Acetylcholine` 中的 `A`，忽略了最终正确的 `ANSWER: D`。 **修复**：将正则改为 `(?i)\bANSWER\s*:\s*([A-P])\b` - 开头 `\b`：`Final Answer:` 中的 `Answer:` 前面是字母 `l`，不是单词边界，不再误匹配 - 末尾 `\b`：`Answer: Acetylcholine` 中的 `A` 后面是 `c`（单词字符），不是单词边界，不再误匹配 **验证效果**：0-shot mmlu_pro 准确率从 76.43% 提升至 **82.14%**，8 个框架误判全部消除。 ### Bug 2: Fix MMLUProBaseEvaluator.is_equal() **案例 A — 模型输出完整选项文本时被误判** 当模型直接输出完整答案如 `"A. gross error."` 时，与 refer 中的完整文本 `"A. gross error.\n"` 应当视为完全匹配。原代码直接执行 `refer.split('. ')` 后分别比较选项字母和文本内容，无法处理 pred 直接等于完整 refer 的情况，导致正确输出被误判为错误。 **修复**：增加前置精确匹配判断： ```python if pred.strip() == refer.strip(): return True ``` **案例 B — 多句号参考文本触发 ValueError** 参考文本如 `"B. False, True. This is because..."` 包含多个 `. `，原代码 `refer.split('. ')` 返回超过 2 个元素，导致 `refer_option, refer_string = ...` 触发 `ValueError: too many values to unpack`。 **修复**：将 `split('. ')` 改为 `split('. ', 1)`，限制只分割一次。 ### 3. Rename mmlu_pro_gen_0_shot_str.py to mmlu_pro_gen_0_shot_cot_str.py 0-shot 配置的 prompt 模板明确要求模型逐步推理： ``` Think step by step before answering. ``` 这属于标准的 Chain-of-Thought (CoT) / 逻辑链提示策略。重命名为 `..._cot_str.py` 可以更准确地反映该配置的特性，便于与纯直接回答（non-CoT）的 0-shot 配置区分。同时更新了所有引用： - `all_dataset_configs.py` - `README.md` - `README_en.md` ## 📐 Associated Test Results / 关联测试结果 **0-shot mmlu_pro CoT** (10 prompts x 14 categories, max_out_len=4096, Qwen3-Next-80B-A3B-Instruct): | Metric | Before Fix | After Fix | |--------|-----------|-----------| | naive_average | 76.43% | **82.14%** | | Framework misjudgments | 8 | **0** | **5-shot mmlu_pro** (10 prompts x 14 categories): - Fixed complete-answer-text false negatives - Fixed multi-sentence reference ValueError crashes ## ⚠️ BC-breaking (Optional) / 向后不兼容变更 None. The regex change is backward-compatible for models that already output `ANSWER: X` correctly. ## ✅ Checklist / 检查列表 - [x] The modification is covered by manual testing on real model inference. - [x] All relevant documentation (README.md, README_en.md) has been updated.

gemini-code-assist

Code Review

This pull request renames the MMLU-Pro 0-shot configuration to include 'cot' and updates the documentation and imports accordingly. The answer extraction regex is refined with word boundaries, and the evaluation logic in is_equal is improved with an exact match check and limited splitting. Feedback suggests making the string comparisons case-insensitive and narrowing the exception handling to improve robustness.

github-actions Bot added bugfix feature labels May 2, 2026

gemini-code-assist Bot reviewed May 2, 2026

View reviewed changes

Comment thread ais_bench/benchmark/datasets/mmlu_pro.py

github-actions Bot removed the feature label May 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix]: fix answer extraction regex and evaluator bugs in MMLU-Pro#269

[bugfix]: fix answer extraction regex and evaluator bugs in MMLU-Pro#269
lvhua6352 wants to merge 1 commit into
AISBench:masterfrom
lvhua6352:master

lvhua6352 commented May 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lvhua6352 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Motivation / 变更动机

📝 Modification / 修改内容

Bug 1: Fix match_answer_pattern regex to use word boundaries (\b)

Bug 2: Fix MMLUProBaseEvaluator.is_equal()

3. Rename mmlu_pro_gen_0_shot_str.py to mmlu_pro_gen_0_shot_cot_str.py

📐 Associated Test Results / 关联测试结果

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

⚠️ Performance degradation (Optional) / 性能下降（可选）

🌟 Use cases (Optional) / 使用案例（可选）

✅ Checklist / 检查列表

👥 Collaboration Info / 协作信息

🌟 Useful CI Command / 实用的CI命令

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lvhua6352 commented May 2, 2026 •

edited

Loading