[bugfix]: fix answer extraction regex and evaluator bugs in MMLU-Pro#269
Open
lvhua6352 wants to merge 1 commit into
Open
[bugfix]: fix answer extraction regex and evaluator bugs in MMLU-Pro#269lvhua6352 wants to merge 1 commit into
lvhua6352 wants to merge 1 commit into
Conversation
**PR Type**
- [x] Bugfix(Bug 修复)
**Related Issue**
N/A
## 🔍 Motivation / 变更动机
修复 mmlu_pro 数据集在 0-shot CoT 和 5-shot 评估中的多个关键 bug:
1. **答案提取正则误匹配**:`match_answer_pattern` 使用的正则 `(?i)ANSWER\s*:\s*([A-P])`
过于宽松,会错误匹配 `Final Answer:` 和 reasoning 中的 `Answer:`,导致模型实际输出正确答案却被误判为错误。
2. **评估器逻辑缺陷**:`MMLUProBaseEvaluator.is_equal()` 缺少精确匹配保护,且使用无限制
的 `split('. ')`,导致模型输出完整选项文本时被误判,以及多句号参考文本触发 ValueError 崩溃。
3. **文件名语义不准确**:0-shot 配置使用了包含 Chain-of-Thought(逻辑链/逐步推理)提示的
prompt 模板(`Think step by step before answering`),原文件名 `mmlu_pro_gen_0_shot_str.py`
未能体现 CoT 特性,不利于与其他纯 0-shot 配置区分。
## 📝 Modification / 修改内容
### Bug 1: Fix match_answer_pattern regex to use word boundaries (\b)
**根因**:正则 `(?i)ANSWER\s*:\s*([A-P])` 无单词边界保护,触发两类误匹配:
**案例 A — "Final Answer:" 误匹配(7 个)**
模型输出片段:
```
### Final Answer:
ANSWER: E
```
正则匹配过程:`(?i)ANSWER` 匹配了 `Final Answer:` 中的子串 `Answer:`,\s* 吸收换行,
`([A-P])` 捕获了下一行 `ANSWER: E` 的首字母 `A`,导致正确答案 E 被误判为 A。
受影响的真实测试样本:
- engineering ID=5 (gold=B, 框架提取 O, 模型实际输出 ANSWER: B)
- chemistry ID=4 (gold=E, 框架提取 A, 模型实际输出 ANSWER: E)
- physics ID=9 (gold=J, 框架提取 A, 模型实际输出 ANSWER: J)
**案例 B — reasoning 中 "Answer:" 被提前捕获(1 个)**
模型在推理中搜索标准答案时输出:
```
What is released at the neuromuscular junction?
Answer: Acetylcholine
...
ANSWER: D
```
正则取了第一个 `Answer: Acetylcholine` 中的 `A`,忽略了最终正确的 `ANSWER: D`。
**修复**:将正则改为 `(?i)\bANSWER\s*:\s*([A-P])\b`
- 开头 `\b`:`Final Answer:` 中的 `Answer:` 前面是字母 `l`,不是单词边界,不再误匹配
- 末尾 `\b`:`Answer: Acetylcholine` 中的 `A` 后面是 `c`(单词字符),不是单词边界,不再误匹配
**验证效果**:0-shot mmlu_pro 准确率从 76.43% 提升至 **82.14%**,8 个框架误判全部消除。
### Bug 2: Fix MMLUProBaseEvaluator.is_equal()
**案例 A — 模型输出完整选项文本时被误判**
当模型直接输出完整答案如 `"A. gross error."` 时,与 refer 中的完整文本 `"A. gross error.\n"`
应当视为完全匹配。原代码直接执行 `refer.split('. ')` 后分别比较选项字母和文本内容,
无法处理 pred 直接等于完整 refer 的情况,导致正确输出被误判为错误。
**修复**:增加前置精确匹配判断:
```python
if pred.strip() == refer.strip():
return True
```
**案例 B — 多句号参考文本触发 ValueError**
参考文本如 `"B. False, True. This is because..."` 包含多个 `. `,
原代码 `refer.split('. ')` 返回超过 2 个元素,导致 `refer_option, refer_string = ...`
触发 `ValueError: too many values to unpack`。
**修复**:将 `split('. ')` 改为 `split('. ', 1)`,限制只分割一次。
### 3. Rename mmlu_pro_gen_0_shot_str.py to mmlu_pro_gen_0_shot_cot_str.py
0-shot 配置的 prompt 模板明确要求模型逐步推理:
```
Think step by step before answering.
```
这属于标准的 Chain-of-Thought (CoT) / 逻辑链提示策略。重命名为 `..._cot_str.py` 可以更准确地
反映该配置的特性,便于与纯直接回答(non-CoT)的 0-shot 配置区分。同时更新了所有引用:
- `all_dataset_configs.py`
- `README.md`
- `README_en.md`
## 📐 Associated Test Results / 关联测试结果
**0-shot mmlu_pro CoT** (10 prompts x 14 categories, max_out_len=4096, Qwen3-Next-80B-A3B-Instruct):
| Metric | Before Fix | After Fix |
|--------|-----------|-----------|
| naive_average | 76.43% | **82.14%** |
| Framework misjudgments | 8 | **0** |
**5-shot mmlu_pro** (10 prompts x 14 categories):
- Fixed complete-answer-text false negatives
- Fixed multi-sentence reference ValueError crashes
## ⚠️ BC-breaking (Optional) / 向后不兼容变更
None. The regex change is backward-compatible for models that already output `ANSWER: X` correctly.
## ✅ Checklist / 检查列表
- [x] The modification is covered by manual testing on real model inference.
- [x] All relevant documentation (README.md, README_en.md) has been updated.
Contributor
There was a problem hiding this comment.
Code Review
This pull request renames the MMLU-Pro 0-shot configuration to include 'cot' and updates the documentation and imports accordingly. The answer extraction regex is refined with word boundaries, and the evaluation logic in is_equal is improved with an exact match check and limited splitting. Feedback suggests making the string comparisons case-insensitive and narrowing the exception handling to improve robustness.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。
PR Type / PR类型
Related Issue | 关联 Issue
Fixes #(issue ID / issue 编号) / Relates to #(issue ID / issue 编号)
Fixes #125
🔍 Motivation / 变更动机
Please describe the motivation of this PR and the goal you want to achieve through this PR.
请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。
使用 ais_bench 对 Qwen3-Next-80B-A3B-Instruct 模型在MMLU-Pro 数据集上进行精度评测,评测结果与官方差异较大,发现框架答案提取与匹配逻辑存在问题。
答案提取正则误匹配:
match_answer_pattern使用的正则(?i)ANSWER\s*:\s*([A-P])过于宽松,会错误匹配Final Answer:和 reasoning 中的Answer:,导致模型实际输出正确答案却被误判为错误。评估器逻辑缺陷:
MMLUProBaseEvaluator.is_equal()缺少精确匹配保护,且使用无限制 的split('. '),导致模型输出完整选项文本时被误判,以及多句号参考文本触发 ValueError 崩溃。文件名语义不准确:0-shot 配置使用了包含 Chain-of-Thought(逻辑链/逐步推理)提示的
prompt 模板(
Think step by step before answering),原文件名mmlu_pro_gen_0_shot_str.py未能体现 CoT 特性,不利于与其它纯 0-shot 配置区分。
📝 Modification / 修改内容
Please briefly describe what modification is made in this PR.
请简要描述此拉取请求中进行的修改。
Bug 1: Fix match_answer_pattern regex to use word boundaries (\b)
根因:正则
(?i)ANSWER\s*:\s*([A-P])无单词边界保护,触发两类误匹配:案例 A — "Final Answer:" 误匹配
模型输出片段:
正则匹配过程:
(?i)ANSWER匹配了Final Answer:中的子串Answer:,\s* 吸收换行,([A-P])捕获了下一行ANSWER: E的首字母A,导致正确答案 E 被误判为 A。案例 B — reasoning 中 "Answer:" 被提前捕获
模型在推理中搜索标准答案时输出:
正则取了第一个
Answer: Acetylcholine中的A,忽略了最终正确的ANSWER: D。修复:将正则改为
(?i)\bANSWER\s*:\s*([A-P])\bFinal Answer:\nANSWER: E中的A后面是字母N,不是单词边界,不再误匹配Answer: Acetylcholine中的A后面是c(单词字符),不是单词边界,不再误匹配Bug 2: Fix MMLUProBaseEvaluator.is_equal()
案例 A — 模型输出完整选项文本时被误判
当模型直接输出完整答案如
"A. gross error."时,与 refer 中的完整文本"A. gross error.\n"应当视为完全匹配。原代码直接执行refer.split('. ')后分别比较选项字母和文本内容, 无法处理 pred 直接等于完整 refer 的情况,导致正确输出被误判为错误。修复:增加前置精确匹配判断:
案例 B — 多句号参考文本触发 ValueError
参考文本如
"B. False, True. This is because..."包含多个., 原代码refer.split('. ')返回超过 2 个元素,导致refer_option, refer_string = ...触发ValueError: too many values to unpack。修复:将
split('. ')改为split('. ', 1),限制只分割一次。3. Rename mmlu_pro_gen_0_shot_str.py to mmlu_pro_gen_0_shot_cot_str.py
0-shot 配置的 prompt 模板明确要求模型逐步推理:
这属于标准的 Chain-of-Thought (CoT) / 逻辑链提示策略。重命名为
..._cot_str.py可以更准确地反映该配置的特性,便于与纯直接回答(non-CoT)的 0-shot 配置区分。同时更新了所有引用:
all_dataset_configs.pyREADME.mdREADME_en.md📐 Associated Test Results / 关联测试结果
Please provide links to the related test results, such as CI pipelines, test reports, etc.
请提供相关测试结果的链接,例如 CI 管道、测试报告等。
修改后的完整评测结果如下,与官方分数80.6基本一致。
Does the modification introduce changes that break the backward compatibility of the downstream repositories? If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
是否引入了会破坏下游存储库向后兼容性的更改?如果是,请描述它如何破坏兼容性,以及下游项目应该如何修改其代码以保持与此 PR 的兼容性。
If the modification introduces performance degradation, please describe the impact of the performance degradation and the expected performance improvement.
如果引入了性能下降,请描述性能下降的影响和预期的性能改进。
🌟 Use cases (Optional) / 使用案例(可选)
If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
如果此拉取请求引入了新功能,最好在此处列出一些用例并更新文档。
✅ Checklist / 检查列表
Before PR:
After PR:
👥 Collaboration Info / 协作信息
🌟 Useful CI Command / 实用的CI命令
/gemini review/gemini summary/gemini help/readthedocs build