Skip to content

Add benchmark diagnostics metrics#72

Merged
SPerekrestova merged 3 commits into
mainfrom
codex/benchmark-metrics-expansion
May 22, 2026
Merged

Add benchmark diagnostics metrics#72
SPerekrestova merged 3 commits into
mainfrom
codex/benchmark-metrics-expansion

Conversation

@SPerekrestova
Copy link
Copy Markdown
Owner

Summary

  • Adds benchmark-only per-record diagnostics for component timings, RxNorm calls, DDInter/OpenFDA pair attempts, NER record counts, and pipeline errors.
  • Adds aggregate overall/timing/RxNorm/interaction/error rollups plus schema and summary output updates.
  • Documents that DDInter/OpenFDA diagnostics are source-routing evidence, not clinical accuracy claims without reviewed labels.

Test Plan

  • rtk uv run python -m py_compile eval/run_benchmark.py eval/metrics/ner.py eval/metrics/linking.py eval/metrics/interactions.py eval/metrics/pipeline.py
  • rtk uv run pytest tests/eval tests/test_download_interaction_db.py tests/test_tier1_benchmark_workflow.py -q
  • rtk uv run pytest tests/ --ignore=tests/test_rxnorm_client.py -q
  • rtk git diff --check

Copy link
Copy Markdown
Owner Author

Code review

A few concrete issues worth addressing before merge.

Bugs

1. slowest_component / slowest_component_ms are degenerate (eval/run_benchmark.py:79-95)

ELAPSED_KEYS contains the aggregate keys analyze, interactions, and total. Per the README these intentionally overlap with the components (analyze ⊇ OCR+NER+RxNorm; interactions ⊇ DDInter+OpenFDA+severity; total ⊇ everything plus overhead). Taking max() over all of them therefore almost always returns total:

timings["slowest_component_ms"] = round(max(timings.get(key, 0.0) for key in ELAPSED_KEYS), 3)
...
def slowest_component(self) -> str:
    return max(ELAPSED_KEYS, key=lambda key: self.elapsed_ms.get(key, 0.0))

Consequence: slowest_component_mstotal (so the field is redundant), slowest_component is essentially always "total", and the aggregate timing.slowest_component_counts / summary.slowest_component collapse to {"total": N} — none of these surface anything useful. The existing test only asserts membership in the full set, so it can't catch this. Exclude analyze, interactions, total from the max.

2. critical_path is max() of components, not the path length (eval/run_benchmark.py:90)

timings["critical_path"] = round(max((timings.get(key, 0.0) for key in critical_keys), default=0.0), 3)

In a sequential pipeline the critical path duration is the sum of the dependent stages, not the largest single stage. As written this is just "slowest of these seven components"; either rename it (e.g. slowest_critical_component_ms) or switch to sum(...).

3. Same exception recorded multiple times in pipeline_errors

When something inside _resolve_pair raises, record_error is called three times for the same exception:

  • _async_wrapper's finally block: trace.record_error(key, error) (e.g. "openfda"),
  • _resolve_pair_wrapper's except block: trace.record_error("interactions", exc),
  • _run_one's except block: trace.record_error("interactions", exc).

pipeline.errors() aggregates errors_ (the run-level list, deduped by raise site), but the per-record pipeline_errors list — which is reported in predictions and feeds the errors rollup downstream consumers may build — gets the same failure 2-3 times with different stage values. At minimum, drop the duplicate record_error call from _resolve_pair_wrapper (the outer wrappers already cover it), or pick one canonical recording site.

4. results carries linking and rxnorm as aliases of the same object (eval/run_benchmark.py:474-482)

linking_results = linking_metrics.compute(predictions, records)
results = {
    ...
    "linking": linking_results,
    "rxnorm": linking_results,
    ...
}

Both keys point to the exact same dict — every consumer now has to decide which to use, the schema makes both required with identical fields, and summary_metrics does an awkward results.get("rxnorm", results.get("linking", {})) fallback that can never hit the fallback. If the intent is to rename linkingrxnorm, just rename; if both must coexist for back-compat, split them so each carries only its relevant fields.

5. build_manifest keyword arg ordering (eval/run_benchmark.py:710-720)

def build_manifest(
    *,
    ...
    sample_size: int,
    output_prefix: str,
    concurrency: int | None = None,
    results: dict,
    random_seed: int | str | None = None,
    ...

Python permits defaulted keyword-only args before required ones, but interleaving them like this (concurrency default sandwiched between required output_prefix and results) is confusing and easy to break on the next addition. Move concurrency after the required block.

Minor

  • pipeline.timing() uses prediction.get("component_timings_ms", prediction.get("elapsed_ms", {})) — the inner get always runs even when the outer key is present; minor wastefulness.
  • top_unknown_pairs keys by the raw (drug_a, drug_b) tuple, so ("A","B") and ("B","A") count separately across records. Normalize to a canonical order before tallying if you want true pair counts.
  • New RxNorm wrapping (approximate_term, search_by_name, get_drug_details) all increment elapsed_ms["rxnorm"], which silently changes the meaning of that bucket vs. prior runs. Worth calling out in the README so the next-vs-previous comparison isn't misread.

Generated by Claude Code

@SPerekrestova SPerekrestova self-assigned this May 22, 2026
Copy link
Copy Markdown
Owner Author

Code review

Reviewed the diff against main. One concrete bug that affects the new diagnostic output, plus one observation worth weighing.

Bug: get_drug_details pollutes canonicalization_collisions

In eval/run_benchmark.py, _record_rxnorm_attempt is reused for all four wrapped RxNorm methods and uses query = args[0] if args else None. For get_rxcui, approximate_term, and search_by_name, args[0] is a drug name — fine. But rxnorm_client.get_drug_details(rxcui) takes a RxCUI string as args[0] (see app/services/drug_analyzer.py:156). So each successful get_drug_details attempt is recorded with query="<rxcui>" and rxcui="<rxcui>".

In eval/metrics/linking.py::_rxnorm_diagnostics:

queries_by_rxcui[str(rxcui)].add(query)
...
if len({query.casefold() for query in queries if query}) > 1

The rxcui string ends up in queries_by_rxcui[rxcui] alongside legitimate name queries that resolved to the same rxcui (e.g. {"Advil", "5640"}), tripping the >1‑distinct check and reporting the rxcui itself as a "canonicalization collision." The collisions table will surface spurious entries any time the rxnorm_fallback path fires.

Two ways to fix: special‑case get_drug_details in _record_rxnorm_attempt (e.g. don't set query, or store the input under a different key), or filter method == "get_drug_details" out of the queries_by_rxcui accumulation in _rxnorm_diagnostics.

Observation: trace-level error dedup drops stage info

BenchmarkTrace.record_error dedupes on (exc.__class__.__name__, str(exc)) for the whole record. When the same exception bubbles up through multiple stages (component → analyze / interactions), only the first stage is kept. The test test_trace_pipeline_errors_dedupe_same_exception enshrines this, so it's intentional — but if two unrelated stages happen to raise the same class+message, the second one is silently dropped. If you want to keep that signal, include stage in the signature.

Otherwise LGTM — schema/test coverage is solid and the new instrumentation cleans up the wrapper signatures nicely.


Generated by Claude Code

@SPerekrestova SPerekrestova merged commit e9f2464 into main May 22, 2026
5 checks passed
@SPerekrestova SPerekrestova deleted the codex/benchmark-metrics-expansion branch May 22, 2026 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant