test: 36 tests for lib/eval/entity_coverage.py#21
Open
hai-pilgrim wants to merge 6 commits into
Open
Conversation
…ompaction - install.sh now auto-configures ~/.claude/settings.json (creates pluginDirs entry, idempotent across create/add/already-exists cases) - uninstall.sh now cleans up the settings.json pluginDirs entry - Add compact-session.sh: self-contained script that finds JSONL, runs supercompact, backs up original, replaces, and reports results - Simplify /supercompact command from 5-step multi-bash prompt to single script call with CLAUDE_PLUGIN_ROOT fallback to hardcoded install path - Simplify PreCompact hook to backup-only (removes wasted supercompact run that Claude's LLM compaction immediately overwrites) - Update README: accurate hook description, file tree with compact-session.sh, update/upgrade docs, standalone binary limitations clearly stated Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
export_json: file creation, empty array, result structure (method, budget, model_key, composite, ndcg), speed/token counts, dimension scores with score and probe_count, multiple results, valid JSON. export_trace: file creation, path location, filename contains method/budget, JSON has method/budget, empty answers → empty entries, matching probe included, unmatched answer skipped, auto-creates trace dir.
Tests cover ProbeAnswer/JudgeResult dataclass defaults and field storage, ANSWER_MODELS/JUDGE_MODEL constants, generate_answers with empty probe set, score_answers missing-probe path (no API call), missing OPENROUTER_API_KEY error, and _score_one_answer JSON parsing, markdown fence stripping, score clamping, and bad-JSON fallback.
Tests cover DIFFICULTY_WEIGHTS constant, DimensionScore/AggregateResult dataclass defaults and fields, dimension_map property, _dcg (empty, single, sorting by weight, zero scores, position discounting), and aggregate() (empty answers, single/multiple models, score 0-1 normalisation, missing probes skipped, empty-dimension zero-mean, perfect/zero/partial NDCG).
Tests cover DIFFICULTY_WEIGHTS constant, ProbeCoverage/DimensionCoverage/ EvidenceCoverageResult dataclasses, dimension_map property, to_dict keys, _dcg (empty, single, zero-score, weight-sorted), and compute_evidence_coverage (empty probe set, probe with no evidence_turns skipped, full coverage → 1.0, zero coverage → 0.0, partial coverage value, kept/dropped lists, multi-probe mean, NDCG perfect/zero/partial).
Tests cover EntitySet (total_count, all_entities, default dict), ENTITY_TYPES constants (presence, positive weights), extract_entities for exceptions, URLs, ports (with range filtering), file paths, CamelCase class names, pip/npm packages, and HTTP status codes. Also covers compute_coverage (empty-suffix → 1.0, empty-kept → 0.0, identical sets → 1.0, breakdown structure, half coverage, type mismatch → 0.0, weighted vs unweighted divergence).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tests/test_eval_entity_coverage.pywith 36 pytest tests forlib/eval/entity_coverage.pyEntitySetdataclass:total_countproperty,all_entities()method, default empty dictENTITY_TYPESconstant: required keys present, all weights positive,file_path/errorat highest weightextract_entitiesfor: exceptions (ValueError,ModuleNotFoundError), HTTPS/HTTP URLs, port numbers (colon and keyword forms, range filtering), absolute file paths, CamelCase class names, pip/npm packages, HTTP status codes (404, 500)compute_coverage: empty-suffix →(1.0, 1.0, {}), empty-kept →0.0, identical sets →1.0, breakdown structure, half-covered, type-mismatch →0.0, weighted vs unweighted divergence by type importanceTest plan
uv run pytest tests/test_eval_entity_coverage.py🤖 Opened by hai-pilgrim as part of the Pilgrim wandering-agent contribution run.