test: 11 tests for lib/eval/report.py print_results (rich terminal output) by hai-pilgrim · Pull Request #34 · heiervang-technologies/supercompact

hai-pilgrim · 2026-03-29T04:01:25Z

Summary

Adds tests/test_eval_report_print.py with 11 pytest tests for the print_results function in lib/eval/report.py
Patches report_mod.console with a Rich Console backed by io.StringIO to capture terminal output without side effects
Tests cover: method name in output, dimension names, composite score formatting, multiple results, NDCG row presence, zero-probe "—" placeholder, budget column, and return value

Test plan

🤖 Pilgrim wandering — Claude Code

…ompaction - install.sh now auto-configures ~/.claude/settings.json (creates pluginDirs entry, idempotent across create/add/already-exists cases) - uninstall.sh now cleans up the settings.json pluginDirs entry - Add compact-session.sh: self-contained script that finds JSONL, runs supercompact, backs up original, replaces, and reports results - Simplify /supercompact command from 5-step multi-bash prompt to single script call with CLAUDE_PLUGIN_ROOT fallback to hardcoded install path - Simplify PreCompact hook to backup-only (removes wasted supercompact run that Claude's LLM compaction immediately overwrites) - Update README: accurate hook description, file tree with compact-session.sh, update/upgrade docs, standalone binary limitations clearly stated Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

export_json: file creation, empty array, result structure (method, budget, model_key, composite, ndcg), speed/token counts, dimension scores with score and probe_count, multiple results, valid JSON. export_trace: file creation, path location, filename contains method/budget, JSON has method/budget, empty answers → empty entries, matching probe included, unmatched answer skipped, auto-creates trace dir.

Tests cover ProbeAnswer/JudgeResult dataclass defaults and field storage, ANSWER_MODELS/JUDGE_MODEL constants, generate_answers with empty probe set, score_answers missing-probe path (no API call), missing OPENROUTER_API_KEY error, and _score_one_answer JSON parsing, markdown fence stripping, score clamping, and bad-JSON fallback.

Tests cover DIFFICULTY_WEIGHTS constant, DimensionScore/AggregateResult dataclass defaults and fields, dimension_map property, _dcg (empty, single, sorting by weight, zero scores, position discounting), and aggregate() (empty answers, single/multiple models, score 0-1 normalisation, missing probes skipped, empty-dimension zero-mean, perfect/zero/partial NDCG).

Tests cover DIFFICULTY_WEIGHTS constant, ProbeCoverage/DimensionCoverage/ EvidenceCoverageResult dataclasses, dimension_map property, to_dict keys, _dcg (empty, single, zero-score, weight-sorted), and compute_evidence_coverage (empty probe set, probe with no evidence_turns skipped, full coverage → 1.0, zero coverage → 0.0, partial coverage value, kept/dropped lists, multi-probe mean, NDCG perfect/zero/partial).

Tests cover EntitySet (total_count, all_entities, default dict), ENTITY_TYPES constants (presence, positive weights), extract_entities for exceptions, URLs, ports (with range filtering), file paths, CamelCase class names, pip/npm packages, and HTTP status codes. Also covers compute_coverage (empty-suffix → 1.0, empty-kept → 0.0, identical sets → 1.0, breakdown structure, half coverage, type mismatch → 0.0, weighted vs unweighted divergence).

Tests cover Turn dataclass (defaults, custom, append), _is_user_message (string content, text block, tool_result block, non-user type, source UUID injection, empty list), extract_text (string content, multiple records joined, empty turn, text/thinking/tool_use/tool_result blocks, input truncation at 500 chars, nested tool_result list, non-dict block skipped, multiple blocks concatenated), and parse_jsonl (empty file, single user, user+assistant, sequential indexing).

Tests cover ScoredTurn dataclass, build_query (single turn, last-3 slicing, fewer-than-3, max_chars truncation, empty input, separator), random_scores (count, ScoredTurn type, 0-1 range, token lookup, missing → 0), Probe dataclass (fields, default evidence_turns/difficulty), ProbeSet (defaults, to_dict structure, from_dict roundtrip, missing optional fields), and _format_turns_for_prompt (content included, header format, ordering, truncation, empty list).

Tests cover SelectionResult defaults, select_turns empty inputs, user turns always kept (even over budget), user token accumulation, short system turns always kept within threshold, last system turn always kept regardless of budget, high-score system turn kept within budget, token sum accounting, budget stored in result, and kept_turns returned in original index order.

Tests cover write_summary_text (file creation, empty → empty file, User/Assistant labels, turn index in output, multiple turns with separator, long text truncated, blank turn skipped), write_compacted_jsonl (file creation, empty → empty file, each line valid JSON, records in turn order, multiple records per turn), and write_scores_csv (file creation, header row, score precision, kept flag True/False, sorted by turn index).

… dedup_scores) Tests cover SuffixAutomaton invariants (initial state, extend adds states, match_repeated_length returns one per char, all zeros for unique chars, > 0 for repeated text), _turn_unique_ratio (empty text → 1.0, unique → 1.0, repeated → lower, result in [0,1]), and dedup_scores (empty system → [], returns ScoredTurn list, scores in [0,1], tokens from token_counts, one result per system turn, repeated content scores < 1.0).

Tests cover empty system turns → [], empty all turns → [], returns ScoredTurn list, scores normalized to [0,1], max score exactly 1.0, tokens from token_counts, rare entity yields higher score than common entity, and no-entity turns handled gracefully.

Tests cover empty system turns → [], empty all turns → [], returns ScoredTurn list, scores normalized to [0,1], max score exactly 1.0, one result per system turn, tokens wired from token_counts, exclusive entities score ≥ shared entities, no-entity plain text handled gracefully.

…t_token_pool) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…uard, constants) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…, plot_type_breakdown) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…edScorer) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tput) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marksverdhei and others added 19 commits February 13, 2026 13:59

test: 16 tests for lib/scorer.py pure helpers (_format_instruct, _las…

91b2f25

…t_token_pool) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test: 20 tests for lib/llm_compact.py (make_synthetic_turn, API key g…

da0c5d8

…uard, constants) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test: 23 tests for lib/pareto.py (METHOD_STYLES, plot_entity_coverage…

b142006

…, plot_type_breakdown) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test: 21 tests for lib/llama_embed.py (constants, _instruct, LlamaEmb…

a1b46f0

…edScorer) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test: 12 tests for lib/tokenizer.py (estimate_tokens, turn_tokens)

a84643d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test: 11 tests for lib/eval/report.py print_results (rich terminal ou…

7d5e7a8

…tput) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marksverdhei force-pushed the main branch from bb4fb00 to 2a4c770 Compare April 1, 2026 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: 11 tests for lib/eval/report.py print_results (rich terminal output)#34

test: 11 tests for lib/eval/report.py print_results (rich terminal output)#34
hai-pilgrim wants to merge 19 commits into
heiervang-technologies:mainfrom
hai-pilgrim:test/eval-report-print

hai-pilgrim commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hai-pilgrim commented Mar 29, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants