Add LoCoMo and BEAM benchmark harnesses by ved015 · Pull Request #204 · XortexAI/XMem

ved015 · 2026-05-29T08:43:43Z

Summary

add modular Python-only benchmark harnesses for LoCoMo and BEAM 1M
add shared benchmark helpers for downloads, JSON/JSONL/parquet loading, local metrics, and XMem API calls
document smoke checks, required keys, outputs, and artifact ignore paths

References

LoCoMo: snap-research/locomo data/locomo10.json format with conversation sessions and annotated qa
BEAM: Hugging Face Mohammadta/BEAM parquet splits, defaulting to 1M

Validation

python -m compileall benchmarks/common benchmarks/locomo benchmarks/beam
python -m benchmarks.locomo.run --dataset-path /private/tmp/locomo-mini.json --dry-run --output-dir /private/tmp/locomo-mini-out-2
python -m benchmarks.beam.run --dataset-path /private/tmp/beam-mini.json --dry-run --output-dir /private/tmp/beam-mini-out-2
git diff --check

No real XMem API benchmark run was performed.

github-actions · 2026-05-29T08:44:06Z

	Warnings
⚠️	📦 This PR changes 1970 lines (additions + deletions). Large PRs are harder to review thoroughly — consider splitting it.

	Messages
📖	✅ Targeting `main`. Please squash commits before merging to keep the git history clean.

Generated by 🚫 dangerJS against f6fd5cd

github-actions · 2026-05-29T08:44:45Z

✅ Staging Deployment Report

Item	Value
Branch	`benchmark/locomo-beam-python-20260529`
Commit	`5b32162`
Environment	Staging
Health	http://3.6.255.148:8001/health
API Docs	http://3.6.255.148:8001/docs
Smoke Tests	success

🟢 Staging is live and healthy! Test your changes at the staging URL above.

Ready to ship? Comment /promote on this PR to merge to main and deploy to production.

gemini-code-assist

Code Review

This pull request introduces benchmark harnesses for the Python XMem API, adding support for the BEAM and LoCoMo datasets. It establishes a shared common module with utilities for file I/O, evaluation metrics, and an asynchronous HTTP client, alongside dataset-specific loaders, runners, and documentation. The review feedback highlights several opportunities to improve performance and robustness, such as optimizing literal coercion by attempting JSON parsing first, refactoring the token F1 metric to use collections.Counter for linear-time complexity, preventing the swallowing of API error messages by delaying raise_for_status(), and deduplicating session numbers during LoCoMo parsing.

greptile-apps · 2026-05-29T08:49:22Z

Greptile Summary

This PR introduces modular Python benchmark harnesses for the LoCoMo and BEAM 1M datasets alongside shared utilities (common/io.py, common/metrics.py, common/xmem.py) used by both harnesses. The addition also ships a GPT-4o LLM-as-judge evaluator for BEAM.

New harnesses: benchmarks/locomo/ and benchmarks/beam/ each add config, dataset loading, runner orchestration, and a CLI entry point mirroring the existing longmemeval structure.
Shared layer: benchmarks/common/ consolidates download, JSON/JSONL/Parquet IO, local metrics (token-F1, contains, exact), and an async httpx-based XMem API client with retry/backoff and v2 job polling.
BEAM evaluator: benchmarks/beam/evaluate.py adds an OpenAI LLM-as-judge pass-rate pipeline driven by per-question rubric items from the dataset.

Confidence Score: 5/5

Safe to merge; all changes are additive benchmark tooling with no effect on production code paths.

The PR adds entirely new benchmark harnesses that are never imported by the main application. The shared utilities are well-structured, the download fix (atomic temp-file rename) and ingest-pair fix (non-overlapping role-aware pairs) from prior review rounds are present in this revision. The two findings here are quality concerns in the benchmark scoring layer — redundant API calls and a generic fallback rubric — neither of which affects correctness of the main codebase.

benchmarks/locomo/runner.py (redundant per-question ingest) and benchmarks/beam/evaluate.py (generic fallback rubric criterion) warrant a second look before running full-scale benchmark evaluations.

Important Files Changed

Filename	Overview
benchmarks/locomo/runner.py	Orchestrates per-example ingest+retrieve; user_id is derived from question_id, causing full re-ingest of shared session history for every QA item in a sample.
benchmarks/locomo/dataset.py	Parses LoCoMo sessions and builds ingest items; exchange-pair logic correctly uses role-aware non-overlapping pairs after the previous fix.
benchmarks/beam/runner.py	Orchestrates BEAM per-example ingest+retrieve with balanced question-type sampling; same per-question user_id re-ingest pattern as LoCoMo.
benchmarks/beam/evaluate.py	LLM-as-judge BEAM evaluator; falls back to a generic rubric criterion when both rubric list and reference_answer are empty, which can silently inflate pass rates.
benchmarks/common/xmem.py	Async XMem HTTP client with retry/backoff, job polling, and proper error propagation; logic is correct.
benchmarks/common/io.py	Atomic download via temp-file rename, JSON/JSONL/Parquet readers, append-JSONL writer; all issues from prior review are addressed.
benchmarks/common/metrics.py	Local token-F1, exact-match, and contains metrics with article normalisation; straightforward and correct.
benchmarks/beam/dataset.py	Parses BEAM parquet rows including probing-question normalisation; exchange-pair logic is non-overlapping after previous fix.

Sequence Diagram

sequenceDiagram
    participant CLI as run.py (CLI)
    participant Runner as Runner (beam/locomo)
    participant DS as dataset.py
    participant XMem as XMemApiClient
    participant API as XMem HTTP API

    CLI->>Runner: run()
    Runner->>DS: load_examples(path)
    DS-->>Runner: [Example...]
    Runner->>DS: select_examples / _sample_examples
    DS-->>Runner: [filtered Examples]

    loop For each Example
        Runner->>DS: build_ingest_items(example, user_id)
        DS-->>Runner: [IngestItem...]
        loop Batches of batch_size
            Runner->>XMem: batch_ingest_v2(items)
            XMem->>API: POST /v2/memory/batch-ingest
            API-->>XMem: "{status_url}"
            Runner->>XMem: poll_job(status_url)
            XMem->>API: GET status_url (repeated)
            API-->>XMem: "{status: succeeded}"
        end
        Runner->>XMem: "retrieve({query, user_id, top_k})"
        XMem->>API: POST /v1/memory/retrieve
        API-->>XMem: "{answer, sources, confidence}"
        XMem-->>Runner: ApiCallResult
        Runner->>Runner: score_answer + append_jsonl
    end

    Runner->>Runner: summarize_results + write_json

_{Reviews (5): Last reviewed commit: "Fix BEAM sampling and evaluation" | Re-trigger Greptile}

ishaanxgupta · 2026-05-29T08:52:54Z

Addressed the useful bot feedback in 843c2a0:

benchmarks/common/io.py: downloads now write to a temporary file and atomically replace the final path only after a complete download.
benchmarks/common/xmem.py: API error payloads are parsed before raise_for_status() so benchmark users see XMem's error message when available.
benchmarks/beam/dataset.py: BEAM string payloads now try json.loads() before falling back to ast.literal_eval().

Validation rerun locally:

python -m compileall benchmarks/common benchmarks/locomo benchmarks/beam
git diff --check
line-length scan for benchmark Python files
BEAM literal parser smoke check
LoCoMo dry run with a local mini dataset
BEAM dry run with a local mini dataset

github-actions · 2026-05-29T08:53:42Z

✅ Staging Deployment Report

Item	Value
Branch	`benchmark/locomo-beam-python-20260529`
Commit	`54f5eb4`
Environment	Staging
Health	http://3.6.255.148:8001/health
API Docs	http://3.6.255.148:8001/docs
Smoke Tests	success

🟢 Staging is live and healthy! Test your changes at the staging URL above.

Ready to ship? Comment /promote on this PR to merge to main and deploy to production.

github-actions · 2026-05-29T09:01:52Z

✅ Staging Deployment Report

Item	Value
Branch	`benchmark/locomo-beam-python-20260529`
Commit	`6bd5f4f`
Environment	Staging
Health	http://3.6.255.148:8001/health
API Docs	http://3.6.255.148:8001/docs
Smoke Tests	success

🟢 Staging is live and healthy! Test your changes at the staging URL above.

Ready to ship? Comment /promote on this PR to merge to main and deploy to production.

ishaanxgupta · 2026-05-29T09:02:09Z

Pushed a second review batch in 553540c:

Fixed BEAM and LoCoMo ingest builders to emit non-overlapping complete exchange pairs instead of sliding windows.
Added role-aware pairing when assistant-like roles are present, with stride-2 fallback for speaker-labeled dialogs.
Deduplicated parsed LoCoMo session numbers before sorting.
Switched token_f1 overlap to Counter multiset intersection.

Validation rerun locally:

python -m compileall benchmarks/common benchmarks/locomo benchmarks/beam
git diff --check
line-length scan for benchmark Python files
exchange pairing + metrics smoke check
LoCoMo dry run with a local mini dataset
BEAM dry run with a local mini dataset

github-actions · 2026-05-29T09:09:45Z

✅ Staging Deployment Report

Item	Value
Branch	`benchmark/locomo-beam-python-20260529`
Commit	`f79bfd7`
Environment	Staging
Health	http://3.6.255.148:8001/health
API Docs	http://3.6.255.148:8001/docs
Smoke Tests	success

🟢 Staging is live and healthy! Test your changes at the staging URL above.

Ready to ship? Comment /promote on this PR to merge to main and deploy to production.

ishaanxgupta · 2026-05-29T09:14:48Z

Added the changelog entry in 6338c22.

Final workflow status on the latest PR branch is green:

Danger PR Review Bot: success
Deploy PR to Staging: success
Security Scan: success
Test Suite: success
PR Labeler: success

Only the large-PR-size warning remains, which is expected for adding two benchmark harnesses plus docs.

ishaanxgupta · 2026-05-29T09:32:11Z

Pushed f6fd5cd with final BEAM runner fixes after stopping the live benchmark run per request.

What changed:

Fixed BEAM parsing for the official probing_questions structure, including ideal_response, rubric, and the outer BEAM ability category names.
Added --sample-percent-per-question-type, --sample-min-per-question-type, and --sample-seed so users can run balanced 1%/10% category slices.
Added benchmarks.beam.evaluate for rubric-based OpenAI judge evaluation and evaluation_summary.json pass-rate reporting.
Updated the BEAM README with balanced-slice and judge-evaluation usage.

Validation:

python -m compileall benchmarks/common benchmarks/beam
git diff --check
line-length scan for benchmark Python files
BEAM 1M dry run with --sample-percent-per-question-type 1: selected 10 examples, exactly 1 from each of the 10 BEAM categories, 11,048 ingest pairs.

No real benchmark results were committed or posted. The live 1% run was interrupted before any result row/evaluation summary was produced.

github-actions · 2026-05-29T09:32:56Z

✅ Staging Deployment Report

Item	Value
Branch	`benchmark/locomo-beam-python-20260529`
Commit	`f3ebd32`
Environment	Staging
Health	http://3.6.255.148:8001/health
API Docs	http://3.6.255.148:8001/docs
Smoke Tests	success

🟢 Staging is live and healthy! Test your changes at the staging URL above.

Ready to ship? Comment /promote on this PR to merge to main and deploy to production.

Add LoCoMo and BEAM benchmark harnesses

eb4aa06

ved015 requested a review from ishaanxgupta as a code owner May 29, 2026 08:43

ved015 temporarily deployed to staging May 29, 2026 08:43 — with GitHub Actions Inactive

github-actions Bot temporarily deployed to staging May 29, 2026 08:43 Inactive

ved015 temporarily deployed to staging May 29, 2026 08:44 — with GitHub Actions Inactive

gemini-code-assist Bot reviewed May 29, 2026

View reviewed changes

Comment thread benchmarks/beam/dataset.py

Comment thread benchmarks/common/xmem.py Outdated

Comment thread benchmarks/common/metrics.py

Comment thread benchmarks/locomo/dataset.py Outdated

greptile-apps Bot reviewed May 29, 2026

View reviewed changes

Comment thread benchmarks/beam/dataset.py

Comment thread benchmarks/locomo/dataset.py

Comment thread benchmarks/common/io.py

Address benchmark review feedback

843c2a0

ved015 temporarily deployed to staging May 29, 2026 08:52 — with GitHub Actions Inactive

github-actions Bot temporarily deployed to staging May 29, 2026 08:52 Inactive

ved015 temporarily deployed to staging May 29, 2026 08:53 — with GitHub Actions Inactive

Fix benchmark ingest pairing

553540c

ved015 temporarily deployed to staging May 29, 2026 09:00 — with GitHub Actions Inactive

github-actions Bot temporarily deployed to staging May 29, 2026 09:00 Inactive

ved015 temporarily deployed to staging May 29, 2026 09:01 — with GitHub Actions Inactive

Document benchmark additions in changelog

6338c22

ved015 temporarily deployed to staging May 29, 2026 09:08 — with GitHub Actions Inactive

github-actions Bot added the docs label May 29, 2026

github-actions Bot temporarily deployed to staging May 29, 2026 09:08 Inactive

ved015 temporarily deployed to staging May 29, 2026 09:09 — with GitHub Actions Inactive

Fix BEAM sampling and evaluation

f6fd5cd

ved015 temporarily deployed to staging May 29, 2026 09:31 — with GitHub Actions Inactive

github-actions Bot temporarily deployed to staging May 29, 2026 09:32 Inactive

ved015 temporarily deployed to staging May 29, 2026 09:32 — with GitHub Actions Inactive

ved015 merged commit 89ac998 into main May 29, 2026
15 checks passed

Conversation

ved015 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

References

Validation

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 29, 2026

✅ Staging Deployment Report

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ishaanxgupta commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

✅ Staging Deployment Report

Uh oh!

github-actions Bot commented May 29, 2026

✅ Staging Deployment Report

Uh oh!

ishaanxgupta commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

✅ Staging Deployment Report

Uh oh!

ishaanxgupta commented May 29, 2026

Uh oh!

ishaanxgupta commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

✅ Staging Deployment Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ved015 commented May 29, 2026 •

edited

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading

greptile-apps Bot commented May 29, 2026 •

edited

Loading