Fix KeyError on category-5 (adversarial) QA: read ground truth from adversarial_answer by kamaalg · Pull Request #41 · snap-research/locomo

kamaalg · 2026-05-30T21:49:26Z

Problem

A clean checkout cannot evaluate any model on the category-5 (adversarial) questions. data/locomo10.json stores cat-5 ground truth under the key adversarial_answer, but the code reads answer. 444 of 446 cat-5 questions have no answer key, so evaluation crashes with KeyError: 'answer'. Category 5 is the 2nd-largest category, so the adversarial half of the benchmark is unrunnable as shipped.

Two crash sites, both reproduced on live HEAD (3eb6f2c) with no API key:

Scoring: task_eval/evaluation.py:200,202 — answer = str(line['answer']) runs for every QA line before the cat-5 branch → KeyError.
Generation: the cat-5 distractor blocks in gpt_utils.py, claude_utils.py, gemini_utils.py, hf_llm_utils.py build the multiple-choice distractor from qa['answer'] → KeyError.

README.MD documents only an answer key, confirming the data/code/doc mismatch. The adversarial_answer is exactly the "tempting wrong" option the (a) Not mentioned / (b) <distractor> MC is designed to present.

Fix

Read cat-5 ground truth defensively: gold = line.get('answer', line.get('adversarial_answer')) in the scorer, and adv_answer = qa.get('adversarial_answer', qa.get('answer')) for the distractor in the four model utils. Zero behavioural change for categories 1–4. (+34 / −18 across 5 files.)

Test

Adds task_eval/test_cat5_eval.py (no network, no keys). On unmodified evaluation.py → KeyError: 'answer' (fails); with the fix → passes. Also verified: the fixed scorer runs all 1986 QA (incl. all 446 cat-5) and the fixed generation builds all 446 cat-5 prompts with 0 crashes.

…dversarial_answer A clean checkout cannot evaluate any model on the category-5 (adversarial) questions. data/locomo10.json stores cat-5 ground truth under adversarial_answer, but the code reads 'answer'; 444 of 446 cat-5 questions have no 'answer' key, so evaluation crashes with KeyError: 'answer' at two sites (task_eval/evaluation.py scoring, and the cat-5 distractor blocks in gpt/claude/gemini/hf model utils). Read cat-5 ground truth defensively (answer -> adversarial_answer fallback); zero behavioural change for categories 1-4. Adds task_eval/test_cat5_eval.py (no network/keys), proven fails-then-passes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix KeyError on category-5 (adversarial) QA: read ground truth from adversarial_answer#41

Fix KeyError on category-5 (adversarial) QA: read ground truth from adversarial_answer#41
kamaalg wants to merge 1 commit into
snap-research:mainfrom
kamaalg:fix/cat5-adversarial-answer-key

kamaalg commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kamaalg commented May 30, 2026

Problem

Fix

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant