From aea0ea3bd0150f619d53dcc5663af74ecdc1b9ca Mon Sep 17 00:00:00 2001 From: Developer Date: Fri, 1 May 2026 17:36:44 +0000 Subject: [PATCH] fix(mmlu_pro): fix answer extraction regex and evaluator bugs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit **PR Type** - [x] Bugfix(Bug 修复) **Related Issue** N/A ## 🔍 Motivation / 变更动机 修复 mmlu_pro 数据集在 0-shot CoT 和 5-shot 评估中的多个关键 bug: 1. **答案提取正则误匹配**:`match_answer_pattern` 使用的正则 `(?i)ANSWER\s*:\s*([A-P])` 过于宽松,会错误匹配 `Final Answer:` 和 reasoning 中的 `Answer:`,导致模型实际输出正确答案却被误判为错误。 2. **评估器逻辑缺陷**:`MMLUProBaseEvaluator.is_equal()` 缺少精确匹配保护,且使用无限制 的 `split('. ')`,导致模型输出完整选项文本时被误判,以及多句号参考文本触发 ValueError 崩溃。 3. **文件名语义不准确**:0-shot 配置使用了包含 Chain-of-Thought(逻辑链/逐步推理)提示的 prompt 模板(`Think step by step before answering`),原文件名 `mmlu_pro_gen_0_shot_str.py` 未能体现 CoT 特性,不利于与其他纯 0-shot 配置区分。 ## 📝 Modification / 修改内容 ### Bug 1: Fix match_answer_pattern regex to use word boundaries (\b) **根因**:正则 `(?i)ANSWER\s*:\s*([A-P])` 无单词边界保护,触发两类误匹配: **案例 A — "Final Answer:" 误匹配(7 个)** 模型输出片段: ``` ### Final Answer: ANSWER: E ``` 正则匹配过程:`(?i)ANSWER` 匹配了 `Final Answer:` 中的子串 `Answer:`,\s* 吸收换行, `([A-P])` 捕获了下一行 `ANSWER: E` 的首字母 `A`,导致正确答案 E 被误判为 A。 受影响的真实测试样本: - engineering ID=5 (gold=B, 框架提取 O, 模型实际输出 ANSWER: B) - chemistry ID=4 (gold=E, 框架提取 A, 模型实际输出 ANSWER: E) - physics ID=9 (gold=J, 框架提取 A, 模型实际输出 ANSWER: J) **案例 B — reasoning 中 "Answer:" 被提前捕获(1 个)** 模型在推理中搜索标准答案时输出: ``` What is released at the neuromuscular junction? Answer: Acetylcholine ... ANSWER: D ``` 正则取了第一个 `Answer: Acetylcholine` 中的 `A`,忽略了最终正确的 `ANSWER: D`。 **修复**:将正则改为 `(?i)\bANSWER\s*:\s*([A-P])\b` - 开头 `\b`:`Final Answer:` 中的 `Answer:` 前面是字母 `l`,不是单词边界,不再误匹配 - 末尾 `\b`:`Answer: Acetylcholine` 中的 `A` 后面是 `c`(单词字符),不是单词边界,不再误匹配 **验证效果**:0-shot mmlu_pro 准确率从 76.43% 提升至 **82.14%**,8 个框架误判全部消除。 ### Bug 2: Fix MMLUProBaseEvaluator.is_equal() **案例 A — 模型输出完整选项文本时被误判** 当模型直接输出完整答案如 `"A. gross error."` 时,与 refer 中的完整文本 `"A. gross error.\n"` 应当视为完全匹配。原代码直接执行 `refer.split('. ')` 后分别比较选项字母和文本内容, 无法处理 pred 直接等于完整 refer 的情况,导致正确输出被误判为错误。 **修复**:增加前置精确匹配判断: ```python if pred.strip() == refer.strip(): return True ``` **案例 B — 多句号参考文本触发 ValueError** 参考文本如 `"B. False, True. This is because..."` 包含多个 `. `, 原代码 `refer.split('. ')` 返回超过 2 个元素,导致 `refer_option, refer_string = ...` 触发 `ValueError: too many values to unpack`。 **修复**:将 `split('. ')` 改为 `split('. ', 1)`,限制只分割一次。 ### 3. Rename mmlu_pro_gen_0_shot_str.py to mmlu_pro_gen_0_shot_cot_str.py 0-shot 配置的 prompt 模板明确要求模型逐步推理: ``` Think step by step before answering. ``` 这属于标准的 Chain-of-Thought (CoT) / 逻辑链提示策略。重命名为 `..._cot_str.py` 可以更准确地 反映该配置的特性,便于与纯直接回答(non-CoT)的 0-shot 配置区分。同时更新了所有引用: - `all_dataset_configs.py` - `README.md` - `README_en.md` ## 📐 Associated Test Results / 关联测试结果 **0-shot mmlu_pro CoT** (10 prompts x 14 categories, max_out_len=4096, Qwen3-Next-80B-A3B-Instruct): | Metric | Before Fix | After Fix | |--------|-----------|-----------| | naive_average | 76.43% | **82.14%** | | Framework misjudgments | 8 | **0** | **5-shot mmlu_pro** (10 prompts x 14 categories): - Fixed complete-answer-text false negatives - Fixed multi-sentence reference ValueError crashes ## ⚠️ BC-breaking (Optional) / 向后不兼容变更 None. The regex change is backward-compatible for models that already output `ANSWER: X` correctly. ## ✅ Checklist / 检查列表 - [x] The modification is covered by manual testing on real model inference. - [x] All relevant documentation (README.md, README_en.md) has been updated. --- ais_bench/benchmark/configs/datasets/mmlu_pro/README.md | 4 ++-- .../benchmark/configs/datasets/mmlu_pro/README_en.md | 4 ++-- ...o_gen_0_shot_str.py => mmlu_pro_gen_0_shot_cot_str.py} | 2 +- ais_bench/benchmark/datasets/mmlu_pro.py | 8 ++++++-- ais_bench/configs/api_examples/all_dataset_configs.py | 2 +- 5 files changed, 12 insertions(+), 8 deletions(-) rename ais_bench/benchmark/configs/datasets/mmlu_pro/{mmlu_pro_gen_0_shot_str.py => mmlu_pro_gen_0_shot_cot_str.py} (97%) diff --git a/ais_bench/benchmark/configs/datasets/mmlu_pro/README.md b/ais_bench/benchmark/configs/datasets/mmlu_pro/README.md index 1f18821d..c1b88243 100644 --- a/ais_bench/benchmark/configs/datasets/mmlu_pro/README.md +++ b/ais_bench/benchmark/configs/datasets/mmlu_pro/README.md @@ -23,9 +23,9 @@ rm mmlu_pro.zip ``` ## 可用数据集任务 -### mmlu_pro_gen_0_shot_str +### mmlu_pro_gen_0_shot_cot_str #### 基本信息 |任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| | --- | --- | --- | --- | --- | --- | -|mmlu_pro_gen_0_shot_str|mmlu-pro数据集生成式任务|pass@1|0-shot|字符串格式|[mmlu_pro_gen_0_shot_str.py](mmlu_pro_gen_0_shot_str.py)| +|mmlu_pro_gen_0_shot_cot_str|mmlu-pro数据集生成式任务|pass@1|0-shot|字符串格式|[mmlu_pro_gen_0_shot_cot_str.py](mmlu_pro_gen_0_shot_cot_str.py)| |mmlu_pro_gen_5_shot_str|mmlu-pro数据集生成式任务|pass@1|0-shot|字符串格式|[mmlu_pro_gen_5_shot_str.py](mmlu_pro_gen_5_shot_str.py)| diff --git a/ais_bench/benchmark/configs/datasets/mmlu_pro/README_en.md b/ais_bench/benchmark/configs/datasets/mmlu_pro/README_en.md index 857b90f0..0c4681d9 100644 --- a/ais_bench/benchmark/configs/datasets/mmlu_pro/README_en.md +++ b/ais_bench/benchmark/configs/datasets/mmlu_pro/README_en.md @@ -23,11 +23,11 @@ rm mmlu_pro.zip ``` ## Available Dataset Tasks -### mmlu_pro_gen_0_shot_str +### mmlu_pro_gen_0_shot_cot_str #### Basic Information | Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | | --- | --- | --- | --- | --- | --- | -| mmlu_pro_gen_0_shot_str | Generative task for the mmlu-pro dataset | pass@1 | 0-shot | String format | [mmlu_pro_gen_0_shot_str.py](mmlu_pro_gen_0_shot_str.py) | +| mmlu_pro_gen_0_shot_cot_str | Generative task for the mmlu-pro dataset | pass@1 | 0-shot | String format | [mmlu_pro_gen_0_shot_cot_str.py](mmlu_pro_gen_0_shot_cot_str.py) | | mmlu_pro_gen_5_shot_str | Generative task for the mmlu-pro dataset | pass@1 | 5-shot | String format | [mmlu_pro_gen_5_shot_str.py](mmlu_pro_gen_5_shot_str.py) | diff --git a/ais_bench/benchmark/configs/datasets/mmlu_pro/mmlu_pro_gen_0_shot_str.py b/ais_bench/benchmark/configs/datasets/mmlu_pro/mmlu_pro_gen_0_shot_cot_str.py similarity index 97% rename from ais_bench/benchmark/configs/datasets/mmlu_pro/mmlu_pro_gen_0_shot_str.py rename to ais_bench/benchmark/configs/datasets/mmlu_pro/mmlu_pro_gen_0_shot_cot_str.py index 3b8d3d2c..bb15912b 100644 --- a/ais_bench/benchmark/configs/datasets/mmlu_pro/mmlu_pro_gen_0_shot_str.py +++ b/ais_bench/benchmark/configs/datasets/mmlu_pro/mmlu_pro_gen_0_shot_cot_str.py @@ -44,7 +44,7 @@ evaluator=dict(type=AccEvaluator), pred_postprocessor=dict( type=match_answer_pattern, - answer_pattern=r'(?i)ANSWER\s*:\s*([A-P])') + answer_pattern=r'(?i)\bANSWER\s*:\s*([A-P])\b') ) mmlu_pro_datasets.append( diff --git a/ais_bench/benchmark/datasets/mmlu_pro.py b/ais_bench/benchmark/datasets/mmlu_pro.py index f27f63c7..92b65e6b 100644 --- a/ais_bench/benchmark/datasets/mmlu_pro.py +++ b/ais_bench/benchmark/datasets/mmlu_pro.py @@ -43,12 +43,16 @@ class MMLUProBaseEvaluator(BaseEvaluator): def is_equal(self, pred, refer): try: - refer_option, refer_string = refer.split('. ') + # Handle exact match first + if pred.strip() == refer.strip(): + return True + # Limit split to 1 to avoid ValueError when refer contains multiple '. ' + refer_option, refer_string = refer.split('. ', 1) if pred in CHOICES and refer_option == pred: return True elif refer_string.strip() == pred: return True - else : + else: return False except Exception: pass diff --git a/ais_bench/configs/api_examples/all_dataset_configs.py b/ais_bench/configs/api_examples/all_dataset_configs.py index aa24c823..aa6cfedb 100644 --- a/ais_bench/configs/api_examples/all_dataset_configs.py +++ b/ais_bench/configs/api_examples/all_dataset_configs.py @@ -33,7 +33,7 @@ from ais_bench.benchmark.configs.datasets.mmlu.mmlu_gen_5_shot_str import mmlu_datasets as mmlu_5_shot_str # mmlu_pro - from ais_bench.benchmark.configs.datasets.mmlu_pro.mmlu_pro_gen_0_shot_str import mmlu_pro_datasets as mmlu_pro_0_shot_str + from ais_bench.benchmark.configs.datasets.mmlu_pro.mmlu_pro_gen_0_shot_cot_str import mmlu_pro_datasets as mmlu_pro_0_shot_str from ais_bench.benchmark.configs.datasets.mmlu_pro.mmlu_pro_gen_5_shot_str import mmlu_pro_datasets as mmlu_pro_5_shot_str # boolq