Skip to content

test: 11 tests for lib/eval/report.py print_results (rich terminal output)#34

Open
hai-pilgrim wants to merge 19 commits into
heiervang-technologies:mainfrom
hai-pilgrim:test/eval-report-print
Open

test: 11 tests for lib/eval/report.py print_results (rich terminal output)#34
hai-pilgrim wants to merge 19 commits into
heiervang-technologies:mainfrom
hai-pilgrim:test/eval-report-print

Conversation

@hai-pilgrim
Copy link
Copy Markdown

Summary

  • Adds tests/test_eval_report_print.py with 11 pytest tests for the print_results function in lib/eval/report.py
  • Patches report_mod.console with a Rich Console backed by io.StringIO to capture terminal output without side effects
  • Tests cover: method name in output, dimension names, composite score formatting, multiple results, NDCG row presence, zero-probe "—" placeholder, budget column, and return value

Test plan

  • Empty results list does not crash
  • Method name appears in output
  • Dimension names (e.g. error_solution) appear in output
  • Composite score is formatted in output
  • Multiple methods all shown
  • Table title includes "Evaluation", "LLM", or "Judge"
  • Zero-probe-count dimension shows "—" or "-"
  • print_results returns None
  • Budget appears in column header
  • NDCG row present in output

🤖 Pilgrim wandering — Claude Code

marksverdhei and others added 19 commits February 13, 2026 13:59
…ompaction

- install.sh now auto-configures ~/.claude/settings.json (creates pluginDirs
  entry, idempotent across create/add/already-exists cases)
- uninstall.sh now cleans up the settings.json pluginDirs entry
- Add compact-session.sh: self-contained script that finds JSONL, runs
  supercompact, backs up original, replaces, and reports results
- Simplify /supercompact command from 5-step multi-bash prompt to single
  script call with CLAUDE_PLUGIN_ROOT fallback to hardcoded install path
- Simplify PreCompact hook to backup-only (removes wasted supercompact run
  that Claude's LLM compaction immediately overwrites)
- Update README: accurate hook description, file tree with compact-session.sh,
  update/upgrade docs, standalone binary limitations clearly stated

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
export_json: file creation, empty array, result structure (method, budget,
model_key, composite, ndcg), speed/token counts, dimension scores with score
and probe_count, multiple results, valid JSON.
export_trace: file creation, path location, filename contains method/budget,
JSON has method/budget, empty answers → empty entries, matching probe included,
unmatched answer skipped, auto-creates trace dir.
Tests cover ProbeAnswer/JudgeResult dataclass defaults and field storage,
ANSWER_MODELS/JUDGE_MODEL constants, generate_answers with empty probe set,
score_answers missing-probe path (no API call), missing OPENROUTER_API_KEY error,
and _score_one_answer JSON parsing, markdown fence stripping, score clamping,
and bad-JSON fallback.
Tests cover DIFFICULTY_WEIGHTS constant, DimensionScore/AggregateResult dataclass
defaults and fields, dimension_map property, _dcg (empty, single, sorting by weight,
zero scores, position discounting), and aggregate() (empty answers, single/multiple
models, score 0-1 normalisation, missing probes skipped, empty-dimension zero-mean,
perfect/zero/partial NDCG).
Tests cover DIFFICULTY_WEIGHTS constant, ProbeCoverage/DimensionCoverage/
EvidenceCoverageResult dataclasses, dimension_map property, to_dict keys,
_dcg (empty, single, zero-score, weight-sorted), and compute_evidence_coverage
(empty probe set, probe with no evidence_turns skipped, full coverage → 1.0,
zero coverage → 0.0, partial coverage value, kept/dropped lists, multi-probe
mean, NDCG perfect/zero/partial).
Tests cover EntitySet (total_count, all_entities, default dict), ENTITY_TYPES
constants (presence, positive weights), extract_entities for exceptions, URLs,
ports (with range filtering), file paths, CamelCase class names, pip/npm
packages, and HTTP status codes. Also covers compute_coverage (empty-suffix
→ 1.0, empty-kept → 0.0, identical sets → 1.0, breakdown structure, half
coverage, type mismatch → 0.0, weighted vs unweighted divergence).
Tests cover Turn dataclass (defaults, custom, append), _is_user_message
(string content, text block, tool_result block, non-user type, source UUID
injection, empty list), extract_text (string content, multiple records joined,
empty turn, text/thinking/tool_use/tool_result blocks, input truncation at 500
chars, nested tool_result list, non-dict block skipped, multiple blocks
concatenated), and parse_jsonl (empty file, single user, user+assistant,
sequential indexing).
Tests cover ScoredTurn dataclass, build_query (single turn, last-3 slicing,
fewer-than-3, max_chars truncation, empty input, separator), random_scores
(count, ScoredTurn type, 0-1 range, token lookup, missing → 0), Probe dataclass
(fields, default evidence_turns/difficulty), ProbeSet (defaults, to_dict structure,
from_dict roundtrip, missing optional fields), and _format_turns_for_prompt
(content included, header format, ordering, truncation, empty list).
Tests cover SelectionResult defaults, select_turns empty inputs, user turns
always kept (even over budget), user token accumulation, short system turns
always kept within threshold, last system turn always kept regardless of budget,
high-score system turn kept within budget, token sum accounting, budget stored
in result, and kept_turns returned in original index order.
Tests cover write_summary_text (file creation, empty → empty file, User/Assistant
labels, turn index in output, multiple turns with separator, long text truncated,
blank turn skipped), write_compacted_jsonl (file creation, empty → empty file, each
line valid JSON, records in turn order, multiple records per turn), and write_scores_csv
(file creation, header row, score precision, kept flag True/False, sorted by turn
index).
… dedup_scores)

Tests cover SuffixAutomaton invariants (initial state, extend adds states,
match_repeated_length returns one per char, all zeros for unique chars, > 0
for repeated text), _turn_unique_ratio (empty text → 1.0, unique → 1.0,
repeated → lower, result in [0,1]), and dedup_scores (empty system → [],
returns ScoredTurn list, scores in [0,1], tokens from token_counts, one result
per system turn, repeated content scores < 1.0).
Tests cover empty system turns → [], empty all turns → [], returns ScoredTurn
list, scores normalized to [0,1], max score exactly 1.0, tokens from token_counts,
rare entity yields higher score than common entity, and no-entity turns handled
gracefully.
Tests cover empty system turns → [], empty all turns → [], returns ScoredTurn
list, scores normalized to [0,1], max score exactly 1.0, one result per system
turn, tokens wired from token_counts, exclusive entities score ≥ shared entities,
no-entity plain text handled gracefully.
…t_token_pool)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…uard, constants)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, plot_type_breakdown)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…edScorer)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tput)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants