Skip to content

Fix KeyError on category-5 (adversarial) QA: read ground truth from adversarial_answer#41

Open
kamaalg wants to merge 1 commit into
snap-research:mainfrom
kamaalg:fix/cat5-adversarial-answer-key
Open

Fix KeyError on category-5 (adversarial) QA: read ground truth from adversarial_answer#41
kamaalg wants to merge 1 commit into
snap-research:mainfrom
kamaalg:fix/cat5-adversarial-answer-key

Conversation

@kamaalg

@kamaalg kamaalg commented May 30, 2026

Copy link
Copy Markdown

Problem

A clean checkout cannot evaluate any model on the category-5 (adversarial) questions. data/locomo10.json stores cat-5 ground truth under the key adversarial_answer, but the code reads answer. 444 of 446 cat-5 questions have no answer key, so evaluation crashes with KeyError: 'answer'. Category 5 is the 2nd-largest category, so the adversarial half of the benchmark is unrunnable as shipped.

Two crash sites, both reproduced on live HEAD (3eb6f2c) with no API key:

  • Scoring: task_eval/evaluation.py:200,202answer = str(line['answer']) runs for every QA line before the cat-5 branch → KeyError.
  • Generation: the cat-5 distractor blocks in gpt_utils.py, claude_utils.py, gemini_utils.py, hf_llm_utils.py build the multiple-choice distractor from qa['answer']KeyError.

README.MD documents only an answer key, confirming the data/code/doc mismatch. The adversarial_answer is exactly the "tempting wrong" option the (a) Not mentioned / (b) <distractor> MC is designed to present.

Fix

Read cat-5 ground truth defensively: gold = line.get('answer', line.get('adversarial_answer')) in the scorer, and adv_answer = qa.get('adversarial_answer', qa.get('answer')) for the distractor in the four model utils. Zero behavioural change for categories 1–4. (+34 / −18 across 5 files.)

Test

Adds task_eval/test_cat5_eval.py (no network, no keys). On unmodified evaluation.pyKeyError: 'answer' (fails); with the fix → passes. Also verified: the fixed scorer runs all 1986 QA (incl. all 446 cat-5) and the fixed generation builds all 446 cat-5 prompts with 0 crashes.

…dversarial_answer

A clean checkout cannot evaluate any model on the category-5 (adversarial)
questions. data/locomo10.json stores cat-5 ground truth under adversarial_answer,
but the code reads 'answer'; 444 of 446 cat-5 questions have no 'answer' key, so
evaluation crashes with KeyError: 'answer' at two sites (task_eval/evaluation.py
scoring, and the cat-5 distractor blocks in gpt/claude/gemini/hf model utils).

Read cat-5 ground truth defensively (answer -> adversarial_answer fallback);
zero behavioural change for categories 1-4. Adds task_eval/test_cat5_eval.py
(no network/keys), proven fails-then-passes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant