Add benchmark diagnostics metrics by SPerekrestova · Pull Request #72 · SPerekrestova/pillchecker-api

SPerekrestova · 2026-05-22T21:31:43Z

Summary

Adds benchmark-only per-record diagnostics for component timings, RxNorm calls, DDInter/OpenFDA pair attempts, NER record counts, and pipeline errors.
Adds aggregate overall/timing/RxNorm/interaction/error rollups plus schema and summary output updates.
Documents that DDInter/OpenFDA diagnostics are source-routing evidence, not clinical accuracy claims without reviewed labels.

Test Plan

rtk uv run python -m py_compile eval/run_benchmark.py eval/metrics/ner.py eval/metrics/linking.py eval/metrics/interactions.py eval/metrics/pipeline.py
rtk uv run pytest tests/eval tests/test_download_interaction_db.py tests/test_tier1_benchmark_workflow.py -q
rtk uv run pytest tests/ --ignore=tests/test_rxnorm_client.py -q
rtk git diff --check

SPerekrestova · 2026-05-22T21:35:27Z

Code review

A few concrete issues worth addressing before merge.

Bugs

1. slowest_component / slowest_component_ms are degenerate (eval/run_benchmark.py:79-95)

ELAPSED_KEYS contains the aggregate keys analyze, interactions, and total. Per the README these intentionally overlap with the components (analyze ⊇ OCR+NER+RxNorm; interactions ⊇ DDInter+OpenFDA+severity; total ⊇ everything plus overhead). Taking max() over all of them therefore almost always returns total:

timings["slowest_component_ms"] = round(max(timings.get(key, 0.0) for key in ELAPSED_KEYS), 3)
...
def slowest_component(self) -> str:
    return max(ELAPSED_KEYS, key=lambda key: self.elapsed_ms.get(key, 0.0))

Consequence: slowest_component_ms ≡ total (so the field is redundant), slowest_component is essentially always "total", and the aggregate timing.slowest_component_counts / summary.slowest_component collapse to {"total": N} — none of these surface anything useful. The existing test only asserts membership in the full set, so it can't catch this. Exclude analyze, interactions, total from the max.

2. critical_path is max() of components, not the path length (eval/run_benchmark.py:90)

timings["critical_path"] = round(max((timings.get(key, 0.0) for key in critical_keys), default=0.0), 3)

In a sequential pipeline the critical path duration is the sum of the dependent stages, not the largest single stage. As written this is just "slowest of these seven components"; either rename it (e.g. slowest_critical_component_ms) or switch to sum(...).

3. Same exception recorded multiple times in pipeline_errors

When something inside _resolve_pair raises, record_error is called three times for the same exception:

_async_wrapper's finally block: trace.record_error(key, error) (e.g. "openfda"),
_resolve_pair_wrapper's except block: trace.record_error("interactions", exc),
_run_one's except block: trace.record_error("interactions", exc).

pipeline.errors() aggregates errors_ (the run-level list, deduped by raise site), but the per-record pipeline_errors list — which is reported in predictions and feeds the errors rollup downstream consumers may build — gets the same failure 2-3 times with different stage values. At minimum, drop the duplicate record_error call from _resolve_pair_wrapper (the outer wrappers already cover it), or pick one canonical recording site.

4. results carries linking and rxnorm as aliases of the same object (eval/run_benchmark.py:474-482)

linking_results = linking_metrics.compute(predictions, records)
results = {
    ...
    "linking": linking_results,
    "rxnorm": linking_results,
    ...
}

Both keys point to the exact same dict — every consumer now has to decide which to use, the schema makes both required with identical fields, and summary_metrics does an awkward results.get("rxnorm", results.get("linking", {})) fallback that can never hit the fallback. If the intent is to rename linking → rxnorm, just rename; if both must coexist for back-compat, split them so each carries only its relevant fields.

5. build_manifest keyword arg ordering (eval/run_benchmark.py:710-720)

def build_manifest(
    *,
    ...
    sample_size: int,
    output_prefix: str,
    concurrency: int | None = None,
    results: dict,
    random_seed: int | str | None = None,
    ...

Python permits defaulted keyword-only args before required ones, but interleaving them like this (concurrency default sandwiched between required output_prefix and results) is confusing and easy to break on the next addition. Move concurrency after the required block.

Minor

pipeline.timing() uses prediction.get("component_timings_ms", prediction.get("elapsed_ms", {})) — the inner get always runs even when the outer key is present; minor wastefulness.
top_unknown_pairs keys by the raw (drug_a, drug_b) tuple, so ("A","B") and ("B","A") count separately across records. Normalize to a canonical order before tallying if you want true pair counts.
New RxNorm wrapping (approximate_term, search_by_name, get_drug_details) all increment elapsed_ms["rxnorm"], which silently changes the meaning of that bucket vs. prior runs. Worth calling out in the README so the next-vs-previous comparison isn't misread.

Generated by Claude Code

SPerekrestova · 2026-05-22T21:49:56Z

Code review

Reviewed the diff against main. One concrete bug that affects the new diagnostic output, plus one observation worth weighing.

Bug: `get_drug_details` pollutes `canonicalization_collisions`

In eval/run_benchmark.py, _record_rxnorm_attempt is reused for all four wrapped RxNorm methods and uses query = args[0] if args else None. For get_rxcui, approximate_term, and search_by_name, args[0] is a drug name — fine. But rxnorm_client.get_drug_details(rxcui) takes a RxCUI string as args[0] (see app/services/drug_analyzer.py:156). So each successful get_drug_details attempt is recorded with query="<rxcui>" and rxcui="<rxcui>".

In eval/metrics/linking.py::_rxnorm_diagnostics:

queries_by_rxcui[str(rxcui)].add(query)
...
if len({query.casefold() for query in queries if query}) > 1

The rxcui string ends up in queries_by_rxcui[rxcui] alongside legitimate name queries that resolved to the same rxcui (e.g. {"Advil", "5640"}), tripping the >1‑distinct check and reporting the rxcui itself as a "canonicalization collision." The collisions table will surface spurious entries any time the rxnorm_fallback path fires.

Two ways to fix: special‑case get_drug_details in _record_rxnorm_attempt (e.g. don't set query, or store the input under a different key), or filter method == "get_drug_details" out of the queries_by_rxcui accumulation in _rxnorm_diagnostics.

Observation: trace-level error dedup drops stage info

BenchmarkTrace.record_error dedupes on (exc.__class__.__name__, str(exc)) for the whole record. When the same exception bubbles up through multiple stages (component → analyze / interactions), only the first stage is kept. The test test_trace_pipeline_errors_dedupe_same_exception enshrines this, so it's intentional — but if two unrelated stages happen to raise the same class+message, the second one is silently dropped. If you want to keep that signal, include stage in the signature.

Otherwise LGTM — schema/test coverage is solid and the new instrumentation cleans up the wrapper signatures nicely.

Generated by Claude Code

Add benchmark diagnostics metrics

ac47d1d

Address benchmark diagnostics review

ca23aa2

SPerekrestova self-assigned this May 22, 2026

Fix RxNorm diagnostics collisions

5b61a96

SPerekrestova merged commit e9f2464 into main May 22, 2026
5 checks passed

SPerekrestova deleted the codex/benchmark-metrics-expansion branch May 22, 2026 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark diagnostics metrics#72

Add benchmark diagnostics metrics#72
SPerekrestova merged 3 commits into
mainfrom
codex/benchmark-metrics-expansion

SPerekrestova commented May 22, 2026

Uh oh!

SPerekrestova commented May 22, 2026

Uh oh!

SPerekrestova commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SPerekrestova commented May 22, 2026

Summary

Test Plan

Uh oh!

SPerekrestova commented May 22, 2026

Code review

Bugs

Minor

Uh oh!

SPerekrestova commented May 22, 2026

Code review

Bug: get_drug_details pollutes canonicalization_collisions

Observation: trace-level error dedup drops stage info

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug: `get_drug_details` pollutes `canonicalization_collisions`