Fix KeyError on category-5 (adversarial) QA: read ground truth from adversarial_answer#41
Open
kamaalg wants to merge 1 commit into
Open
Conversation
…dversarial_answer A clean checkout cannot evaluate any model on the category-5 (adversarial) questions. data/locomo10.json stores cat-5 ground truth under adversarial_answer, but the code reads 'answer'; 444 of 446 cat-5 questions have no 'answer' key, so evaluation crashes with KeyError: 'answer' at two sites (task_eval/evaluation.py scoring, and the cat-5 distractor blocks in gpt/claude/gemini/hf model utils). Read cat-5 ground truth defensively (answer -> adversarial_answer fallback); zero behavioural change for categories 1-4. Adds task_eval/test_cat5_eval.py (no network/keys), proven fails-then-passes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A clean checkout cannot evaluate any model on the category-5 (adversarial) questions.
data/locomo10.jsonstores cat-5 ground truth under the keyadversarial_answer, but the code readsanswer. 444 of 446 cat-5 questions have noanswerkey, so evaluation crashes withKeyError: 'answer'. Category 5 is the 2nd-largest category, so the adversarial half of the benchmark is unrunnable as shipped.Two crash sites, both reproduced on live HEAD (
3eb6f2c) with no API key:task_eval/evaluation.py:200,202—answer = str(line['answer'])runs for every QA line before the cat-5 branch →KeyError.gpt_utils.py,claude_utils.py,gemini_utils.py,hf_llm_utils.pybuild the multiple-choice distractor fromqa['answer']→KeyError.README.MDdocuments only ananswerkey, confirming the data/code/doc mismatch. Theadversarial_answeris exactly the "tempting wrong" option the(a) Not mentioned / (b) <distractor>MC is designed to present.Fix
Read cat-5 ground truth defensively:
gold = line.get('answer', line.get('adversarial_answer'))in the scorer, andadv_answer = qa.get('adversarial_answer', qa.get('answer'))for the distractor in the four model utils. Zero behavioural change for categories 1–4. (+34 / −18 across 5 files.)Test
Adds
task_eval/test_cat5_eval.py(no network, no keys). On unmodifiedevaluation.py→KeyError: 'answer'(fails); with the fix → passes. Also verified: the fixed scorer runs all 1986 QA (incl. all 446 cat-5) and the fixed generation builds all 446 cat-5 prompts with 0 crashes.