Skip to content

Add LoCoMo and BEAM benchmark harnesses#204

Merged
ved015 merged 5 commits into
mainfrom
benchmark/locomo-beam-python-20260529
May 29, 2026
Merged

Add LoCoMo and BEAM benchmark harnesses#204
ved015 merged 5 commits into
mainfrom
benchmark/locomo-beam-python-20260529

Conversation

@ved015
Copy link
Copy Markdown
Contributor

@ved015 ved015 commented May 29, 2026

Summary

  • add modular Python-only benchmark harnesses for LoCoMo and BEAM 1M
  • add shared benchmark helpers for downloads, JSON/JSONL/parquet loading, local metrics, and XMem API calls
  • document smoke checks, required keys, outputs, and artifact ignore paths

References

  • LoCoMo: snap-research/locomo data/locomo10.json format with conversation sessions and annotated qa
  • BEAM: Hugging Face Mohammadta/BEAM parquet splits, defaulting to 1M

Validation

  • python -m compileall benchmarks/common benchmarks/locomo benchmarks/beam
  • python -m benchmarks.locomo.run --dataset-path /private/tmp/locomo-mini.json --dry-run --output-dir /private/tmp/locomo-mini-out-2
  • python -m benchmarks.beam.run --dataset-path /private/tmp/beam-mini.json --dry-run --output-dir /private/tmp/beam-mini-out-2
  • git diff --check

No real XMem API benchmark run was performed.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

Warnings
⚠️

📦 This PR changes 1970 lines (additions + deletions). Large PRs are harder to review thoroughly — consider splitting it.

Messages
📖

✅ Targeting main. Please squash commits before merging to keep the git history clean.

Generated by 🚫 dangerJS against f6fd5cd

@github-actions
Copy link
Copy Markdown
Contributor

✅ Staging Deployment Report

Item Value
Branch benchmark/locomo-beam-python-20260529
Commit 5b32162
Environment Staging
Health http://3.6.255.148:8001/health
API Docs http://3.6.255.148:8001/docs
Smoke Tests success

🟢 Staging is live and healthy! Test your changes at the staging URL above.

Ready to ship? Comment /promote on this PR to merge to main and deploy to production.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces benchmark harnesses for the Python XMem API, adding support for the BEAM and LoCoMo datasets. It establishes a shared common module with utilities for file I/O, evaluation metrics, and an asynchronous HTTP client, alongside dataset-specific loaders, runners, and documentation. The review feedback highlights several opportunities to improve performance and robustness, such as optimizing literal coercion by attempting JSON parsing first, refactoring the token F1 metric to use collections.Counter for linear-time complexity, preventing the swallowing of API error messages by delaying raise_for_status(), and deduplicating session numbers during LoCoMo parsing.

Comment thread benchmarks/beam/dataset.py
Comment thread benchmarks/common/xmem.py Outdated
Comment thread benchmarks/common/metrics.py
Comment thread benchmarks/locomo/dataset.py Outdated
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 29, 2026

Greptile Summary

This PR introduces modular Python benchmark harnesses for the LoCoMo and BEAM 1M datasets alongside shared utilities (common/io.py, common/metrics.py, common/xmem.py) used by both harnesses. The addition also ships a GPT-4o LLM-as-judge evaluator for BEAM.

  • New harnesses: benchmarks/locomo/ and benchmarks/beam/ each add config, dataset loading, runner orchestration, and a CLI entry point mirroring the existing longmemeval structure.
  • Shared layer: benchmarks/common/ consolidates download, JSON/JSONL/Parquet IO, local metrics (token-F1, contains, exact), and an async httpx-based XMem API client with retry/backoff and v2 job polling.
  • BEAM evaluator: benchmarks/beam/evaluate.py adds an OpenAI LLM-as-judge pass-rate pipeline driven by per-question rubric items from the dataset.

Confidence Score: 5/5

Safe to merge; all changes are additive benchmark tooling with no effect on production code paths.

The PR adds entirely new benchmark harnesses that are never imported by the main application. The shared utilities are well-structured, the download fix (atomic temp-file rename) and ingest-pair fix (non-overlapping role-aware pairs) from prior review rounds are present in this revision. The two findings here are quality concerns in the benchmark scoring layer — redundant API calls and a generic fallback rubric — neither of which affects correctness of the main codebase.

benchmarks/locomo/runner.py (redundant per-question ingest) and benchmarks/beam/evaluate.py (generic fallback rubric criterion) warrant a second look before running full-scale benchmark evaluations.

Important Files Changed

Filename Overview
benchmarks/locomo/runner.py Orchestrates per-example ingest+retrieve; user_id is derived from question_id, causing full re-ingest of shared session history for every QA item in a sample.
benchmarks/locomo/dataset.py Parses LoCoMo sessions and builds ingest items; exchange-pair logic correctly uses role-aware non-overlapping pairs after the previous fix.
benchmarks/beam/runner.py Orchestrates BEAM per-example ingest+retrieve with balanced question-type sampling; same per-question user_id re-ingest pattern as LoCoMo.
benchmarks/beam/evaluate.py LLM-as-judge BEAM evaluator; falls back to a generic rubric criterion when both rubric list and reference_answer are empty, which can silently inflate pass rates.
benchmarks/common/xmem.py Async XMem HTTP client with retry/backoff, job polling, and proper error propagation; logic is correct.
benchmarks/common/io.py Atomic download via temp-file rename, JSON/JSONL/Parquet readers, append-JSONL writer; all issues from prior review are addressed.
benchmarks/common/metrics.py Local token-F1, exact-match, and contains metrics with article normalisation; straightforward and correct.
benchmarks/beam/dataset.py Parses BEAM parquet rows including probing-question normalisation; exchange-pair logic is non-overlapping after previous fix.

Sequence Diagram

sequenceDiagram
    participant CLI as run.py (CLI)
    participant Runner as Runner (beam/locomo)
    participant DS as dataset.py
    participant XMem as XMemApiClient
    participant API as XMem HTTP API

    CLI->>Runner: run()
    Runner->>DS: load_examples(path)
    DS-->>Runner: [Example...]
    Runner->>DS: select_examples / _sample_examples
    DS-->>Runner: [filtered Examples]

    loop For each Example
        Runner->>DS: build_ingest_items(example, user_id)
        DS-->>Runner: [IngestItem...]
        loop Batches of batch_size
            Runner->>XMem: batch_ingest_v2(items)
            XMem->>API: POST /v2/memory/batch-ingest
            API-->>XMem: "{status_url}"
            Runner->>XMem: poll_job(status_url)
            XMem->>API: GET status_url (repeated)
            API-->>XMem: "{status: succeeded}"
        end
        Runner->>XMem: "retrieve({query, user_id, top_k})"
        XMem->>API: POST /v1/memory/retrieve
        API-->>XMem: "{answer, sources, confidence}"
        XMem-->>Runner: ApiCallResult
        Runner->>Runner: score_answer + append_jsonl
    end

    Runner->>Runner: summarize_results + write_json
Loading

Fix All in Cursor Fix All in Codex Fix All in Claude Code

Reviews (5): Last reviewed commit: "Fix BEAM sampling and evaluation" | Re-trigger Greptile

Comment thread benchmarks/beam/dataset.py
Comment thread benchmarks/locomo/dataset.py
Comment thread benchmarks/common/io.py
Copy link
Copy Markdown
Member

Addressed the useful bot feedback in 843c2a0:

  • benchmarks/common/io.py: downloads now write to a temporary file and atomically replace the final path only after a complete download.
  • benchmarks/common/xmem.py: API error payloads are parsed before raise_for_status() so benchmark users see XMem's error message when available.
  • benchmarks/beam/dataset.py: BEAM string payloads now try json.loads() before falling back to ast.literal_eval().

Validation rerun locally:

  • python -m compileall benchmarks/common benchmarks/locomo benchmarks/beam
  • git diff --check
  • line-length scan for benchmark Python files
  • BEAM literal parser smoke check
  • LoCoMo dry run with a local mini dataset
  • BEAM dry run with a local mini dataset

@github-actions
Copy link
Copy Markdown
Contributor

✅ Staging Deployment Report

Item Value
Branch benchmark/locomo-beam-python-20260529
Commit 54f5eb4
Environment Staging
Health http://3.6.255.148:8001/health
API Docs http://3.6.255.148:8001/docs
Smoke Tests success

🟢 Staging is live and healthy! Test your changes at the staging URL above.

Ready to ship? Comment /promote on this PR to merge to main and deploy to production.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Staging Deployment Report

Item Value
Branch benchmark/locomo-beam-python-20260529
Commit 6bd5f4f
Environment Staging
Health http://3.6.255.148:8001/health
API Docs http://3.6.255.148:8001/docs
Smoke Tests success

🟢 Staging is live and healthy! Test your changes at the staging URL above.

Ready to ship? Comment /promote on this PR to merge to main and deploy to production.

Copy link
Copy Markdown
Member

Pushed a second review batch in 553540c:

  • Fixed BEAM and LoCoMo ingest builders to emit non-overlapping complete exchange pairs instead of sliding windows.
  • Added role-aware pairing when assistant-like roles are present, with stride-2 fallback for speaker-labeled dialogs.
  • Deduplicated parsed LoCoMo session numbers before sorting.
  • Switched token_f1 overlap to Counter multiset intersection.

Validation rerun locally:

  • python -m compileall benchmarks/common benchmarks/locomo benchmarks/beam
  • git diff --check
  • line-length scan for benchmark Python files
  • exchange pairing + metrics smoke check
  • LoCoMo dry run with a local mini dataset
  • BEAM dry run with a local mini dataset

@github-actions
Copy link
Copy Markdown
Contributor

✅ Staging Deployment Report

Item Value
Branch benchmark/locomo-beam-python-20260529
Commit f79bfd7
Environment Staging
Health http://3.6.255.148:8001/health
API Docs http://3.6.255.148:8001/docs
Smoke Tests success

🟢 Staging is live and healthy! Test your changes at the staging URL above.

Ready to ship? Comment /promote on this PR to merge to main and deploy to production.

Copy link
Copy Markdown
Member

Added the changelog entry in 6338c22.

Final workflow status on the latest PR branch is green:

  • Danger PR Review Bot: success
  • Deploy PR to Staging: success
  • Security Scan: success
  • Test Suite: success
  • PR Labeler: success

Only the large-PR-size warning remains, which is expected for adding two benchmark harnesses plus docs.

Copy link
Copy Markdown
Member

Pushed f6fd5cd with final BEAM runner fixes after stopping the live benchmark run per request.

What changed:

  • Fixed BEAM parsing for the official probing_questions structure, including ideal_response, rubric, and the outer BEAM ability category names.
  • Added --sample-percent-per-question-type, --sample-min-per-question-type, and --sample-seed so users can run balanced 1%/10% category slices.
  • Added benchmarks.beam.evaluate for rubric-based OpenAI judge evaluation and evaluation_summary.json pass-rate reporting.
  • Updated the BEAM README with balanced-slice and judge-evaluation usage.

Validation:

  • python -m compileall benchmarks/common benchmarks/beam
  • git diff --check
  • line-length scan for benchmark Python files
  • BEAM 1M dry run with --sample-percent-per-question-type 1: selected 10 examples, exactly 1 from each of the 10 BEAM categories, 11,048 ingest pairs.

No real benchmark results were committed or posted. The live 1% run was interrupted before any result row/evaluation summary was produced.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Staging Deployment Report

Item Value
Branch benchmark/locomo-beam-python-20260529
Commit f3ebd32
Environment Staging
Health http://3.6.255.148:8001/health
API Docs http://3.6.255.148:8001/docs
Smoke Tests success

🟢 Staging is live and healthy! Test your changes at the staging URL above.

Ready to ship? Comment /promote on this PR to merge to main and deploy to production.

@ved015 ved015 merged commit 89ac998 into main May 29, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants