Add LongMemEval benchmark harness by ved015 · Pull Request #203 · XortexAI/XMem

ved015 · 2026-05-29T07:55:59Z

Summary

add a Python-only LongMemEval benchmark harness under benchmarks/longmemeval
include dataset loading, XMem API client, orchestration, local metrics, all-category runner, and usage docs
keep benchmark data/results/outputs ignored while tracking source files

Validation

python -m compileall benchmarks/longmemeval
python -m benchmarks.longmemeval.run_all_categories --dataset-path benchmarks/longmemeval/data/longmemeval_s_cleaned.json --output-root /private/tmp/xmem-lme-pr-dry --dry-run
git diff --cached --check before commit

github-actions · 2026-05-29T07:56:20Z

	Warnings
⚠️	📦 This PR changes 1540 lines (additions + deletions). Large PRs are harder to review thoroughly — consider splitting it.

	Messages
📖	📝 No `CHANGELOG.md` update detected. If this PR introduces a user-visible change, please add an entry.
📖	✅ Targeting `main`. Please squash commits before merging to keep the git history clean.

Generated by 🚫 dangerJS against ce8be33

github-actions · 2026-05-29T07:57:03Z

✅ Staging Deployment Report

Item	Value
Branch	`benchmark/longmemeval-python-20260529-fda8`
Commit	`4fdc2ad`
Environment	Staging
Health	http://3.6.255.148:8001/health
API Docs	http://3.6.255.148:8001/docs
Smoke Tests	success

🟢 Staging is live and healthy! Test your changes at the staging URL above.

Ready to ship? Comment /promote on this PR to merge to main and deploy to production.

gemini-code-assist

Code Review

This pull request introduces a new LongMemEval benchmark harness for the Python XMem API, including an async HTTP client, dataset parsing, local metrics evaluation, and runners for sequential or parallel execution across categories. The review feedback highlights several critical issues: a potential crash in log parsing within update_state_from_line, incorrect path routing for absolute URLs in the HTTP client, premature truncation of the merged predictions file during validation, and the risk of indefinite hangs when downloading datasets using urllib without a timeout.

greptile-apps · 2026-05-29T08:00:39Z

Greptile Summary

This PR adds a Python-only LongMemEval benchmark harness under benchmarks/longmemeval/, alongside a minor deploy-workflow improvement that lets DEPLOY_REF be a branch, tag, or commit SHA instead of only a branch name.

Benchmark harness (client.py, runner.py, dataset.py, metrics.py, config.py, run.py, run_all_categories.py): implements dataset loading, XMem API ingestion/retrieval, token-F1 metrics, resume support, and a parallel category launcher that spawns one child run.py process per question_type.
Previous issues addressed: urllib.request.urlretrieve is replaced with httpx.stream, and _request_path now correctly passes absolute status_url values through without prepending a slash.
One remaining quality issue: merge_predictions opens the merged output file in write mode before verifying all six category prediction files exist; if any are absent the merged file is left partially written on disk.

Confidence Score: 5/5

Safe to merge; the harness is additive (no changes to production service code), and the previously reported issues with urlretrieve and the absolute status_url path are both resolved in this revision.

The only new finding is a write-ordering issue in merge_predictions where the output file is created before missing inputs are checked, which is easily fixed and only reachable after all six category subprocesses have already succeeded. The deploy-workflow change is a clean improvement for tag/SHA deployments. No correctness or data-loss issues were found on the hot paths.

benchmarks/longmemeval/run_all_categories.py (merge_predictions ordering) and benchmarks/longmemeval/runner.py (item.dict on frozen dataclass, flagged in previous review).

Important Files Changed

Filename	Overview
.github/workflows/deploy-staging.yml	Minor deploy-script improvement: renames PR_BRANCH to DEPLOY_REF and adds git show-ref guard so tags/commit SHAs check out in detached-HEAD mode instead of failing on non-branch refs.
.gitignore	Adds negation rules to track benchmark source files while still ignoring data/, results/, and outputs/ artifacts under benchmarks/longmemeval/.
benchmarks/longmemeval/client.py	Async httpx client with retry logic; _request_path now correctly handles absolute URLs (http/https prefix check), addressing the previously reported status_url mangling issue.
benchmarks/longmemeval/config.py	Frozen dataclass BenchmarkConfig holding all runtime parameters; api_key property reads from environment at call time. No issues.
benchmarks/longmemeval/dataset.py	Dataset loading and normalization; download_dataset now uses httpx.stream (previously reported urlretrieve deprecation is fixed). Robust format-sniffing for JSON/JSONL variants.
benchmarks/longmemeval/metrics.py	Token-F1, exact-match, contains metrics plus JSONL read/write helpers. Token overlap calculation is correct. No issues.
benchmarks/longmemeval/runner.py	Main benchmark orchestration; uses item.dict on a frozen dataclass (previously flagged) instead of dataclasses.asdict; otherwise ingest/retrieve/resume logic is sound.
benchmarks/longmemeval/run.py	CLI entrypoint wiring argparse to BenchmarkConfig and LongMemEvalRunner. Clean and straightforward.
benchmarks/longmemeval/run_all_categories.py	Parallel category runner; merge_predictions opens the output file before verifying all category prediction files exist, leaving a partially-written merged file on disk if any are absent. Progress line parser is safe thanks to the inner "/" guard.

Sequence Diagram

sequenceDiagram
    participant User
    participant run_all_categories as run_all_categories.py
    participant run as run.py (×6 child procs)
    participant runner as LongMemEvalRunner
    participant client as XMemApiClient
    participant API as XMem HTTP API

    User->>run_all_categories: python -m benchmarks.longmemeval.run_all_categories
    run_all_categories->>run_all_categories: load_examples + validate_independence
    run_all_categories->>run_all_categories: spawn asyncio tasks (semaphore-limited)

    loop for each question_type category (up to max_parallel_categories)
        run_all_categories->>run: asyncio.create_subprocess_exec
        run->>runner: LongMemEvalRunner(config).run()
        loop for each example
            runner->>client: batch_ingest_v1/v2(items)
            client->>API: POST /v1/memory/batch-ingest or /v2/memory/batch-ingest
            API-->>client: ApiCallResult (v2: includes status_url)
            opt "ingest_api_version == v2"
                loop poll until terminal status
                    client->>API: GET status_url
                    API-->>client: job status
                end
            end
            runner->>client: retrieve(query, user_id, top_k)
            client->>API: POST /v1/memory/retrieve
            API-->>client: answer + sources
            runner->>runner: append_jsonl(results.jsonl + predictions.jsonl)
        end
        run-->>run_all_categories: stdout lines (progress updates)
        run_all_categories->>run_all_categories: update_state_from_line(state, line)
    end

    run_all_categories->>run_all_categories: merge_predictions(output_root)
    run_all_categories->>User: merged predictions.jsonl + final status

_{Reviews (4): Last reviewed commit: "Fix staging deploy for force-pushed PR b..." | Re-trigger Greptile}

ishaanxgupta · 2026-05-29T08:09:53Z

Addressed the useful bot review items in 2f8d90c:

handled absolute status_url values in the benchmark HTTP client
made progress-line parsing ignore non-progress bracketed logs safely
replaced urllib.request.urlretrieve with httpx.stream(..., follow_redirects=True)
removed the unused Path import from the runner

Re-ran local validation: compileall, git diff --check, focused parser/path checks, and the all-category dry run.

ishaanxgupta · 2026-05-29T08:16:41Z

Fixed the staging deploy failure shown in the workflow logs in ce8be33.

Root cause: the EC2 checkout had the PR branch locally, then the PR branch was force-pushed/amended. The workflow used git pull origin "$PR_BRANCH", so Git refused because local and remote had diverged and no pull strategy was configured.

Change: staging deploy now fetches the requested ref and checks out the fetched remote state directly with git checkout -B "$DEPLOY_REF" "origin/$DEPLOY_REF" for branches, falling back to detached FETCH_HEAD for non-branch refs. That avoids merge/rebase prompts during ephemeral staging deploys.

github-actions · 2026-05-29T08:17:31Z

✅ Staging Deployment Report

Item	Value
Branch	`benchmark/longmemeval-python-20260529-fda8`
Commit	`d9542c4`
Environment	Staging
Health	http://3.6.255.148:8001/health
API Docs	http://3.6.255.148:8001/docs
Smoke Tests	success

🟢 Staging is live and healthy! Test your changes at the staging URL above.

Ready to ship? Comment /promote on this PR to merge to main and deploy to production.

ishaanxgupta · 2026-05-29T08:26:12Z

/promote

…on (#203)

ved015 requested a review from ishaanxgupta as a code owner May 29, 2026 07:55

ved015 temporarily deployed to staging May 29, 2026 07:56 — with GitHub Actions Inactive

github-actions Bot temporarily deployed to staging May 29, 2026 07:56 Inactive

ved015 temporarily deployed to staging May 29, 2026 07:56 — with GitHub Actions Inactive

github-code-quality Bot found potential problems May 29, 2026

View reviewed changes

Comment thread benchmarks/longmemeval/runner.py Fixed

gemini-code-assist Bot reviewed May 29, 2026

View reviewed changes

Comment thread benchmarks/longmemeval/run_all_categories.py

Comment thread benchmarks/longmemeval/client.py Outdated

Comment thread benchmarks/longmemeval/run_all_categories.py

Comment thread benchmarks/longmemeval/dataset.py

Add LongMemEval benchmark harness

8cbf57c

greptile-apps Bot reviewed May 29, 2026

View reviewed changes

Comment thread benchmarks/longmemeval/runner.py

Comment thread benchmarks/longmemeval/dataset.py

ved015 force-pushed the benchmark/longmemeval-python-20260529-fda8 branch from 2596f28 to 8cbf57c Compare May 29, 2026 08:01

ved015 had a problem deploying to staging May 29, 2026 08:01 — with GitHub Actions Failure

github-actions Bot had a problem deploying to staging May 29, 2026 08:01 Failure

ved015 had a problem deploying to staging May 29, 2026 08:03 — with GitHub Actions Failure

github-actions Bot had a problem deploying to staging May 29, 2026 08:03 Failure

greptile-apps Bot reviewed May 29, 2026

View reviewed changes

Comment thread benchmarks/longmemeval/client.py

Address LongMemEval benchmark review comments

2f8d90c

ved015 had a problem deploying to staging May 29, 2026 08:09 — with GitHub Actions Failure

github-actions Bot had a problem deploying to staging May 29, 2026 08:09 Failure

Fix staging deploy for force-pushed PR branches

ce8be33

ved015 temporarily deployed to staging May 29, 2026 08:16 — with GitHub Actions Inactive

github-actions Bot temporarily deployed to staging May 29, 2026 08:16 Inactive

ved015 temporarily deployed to staging May 29, 2026 08:17 — with GitHub Actions Inactive

ishaanxgupta approved these changes May 29, 2026

View reviewed changes

github-actions Bot closed this May 29, 2026

github-actions Bot added a commit that referenced this pull request May 29, 2026

chore: promote benchmark/longmemeval-python-20260529-fda8 to producti…

11ea976

…on (#203)

Conversation

ved015 commented May 29, 2026

Summary

Validation

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 29, 2026

✅ Staging Deployment Report

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ishaanxgupta commented May 29, 2026

Uh oh!

ishaanxgupta commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

✅ Staging Deployment Report

Uh oh!

ishaanxgupta commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 29, 2026 •

edited

Loading

greptile-apps Bot commented May 29, 2026 •

edited

Loading