Skip to content

Add LongMemEval benchmark harness#203

Closed
ved015 wants to merge 3 commits into
mainfrom
benchmark/longmemeval-python-20260529-fda8
Closed

Add LongMemEval benchmark harness#203
ved015 wants to merge 3 commits into
mainfrom
benchmark/longmemeval-python-20260529-fda8

Conversation

@ved015
Copy link
Copy Markdown
Contributor

@ved015 ved015 commented May 29, 2026

Summary

  • add a Python-only LongMemEval benchmark harness under benchmarks/longmemeval
  • include dataset loading, XMem API client, orchestration, local metrics, all-category runner, and usage docs
  • keep benchmark data/results/outputs ignored while tracking source files

Validation

  • python -m compileall benchmarks/longmemeval
  • python -m benchmarks.longmemeval.run_all_categories --dataset-path benchmarks/longmemeval/data/longmemeval_s_cleaned.json --output-root /private/tmp/xmem-lme-pr-dry --dry-run
  • git diff --cached --check before commit

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

Warnings
⚠️

📦 This PR changes 1540 lines (additions + deletions). Large PRs are harder to review thoroughly — consider splitting it.

Messages
📖

📝 No CHANGELOG.md update detected. If this PR introduces a user-visible change, please add an entry.

📖

✅ Targeting main. Please squash commits before merging to keep the git history clean.

Generated by 🚫 dangerJS against ce8be33

@github-actions
Copy link
Copy Markdown
Contributor

✅ Staging Deployment Report

Item Value
Branch benchmark/longmemeval-python-20260529-fda8
Commit 4fdc2ad
Environment Staging
Health http://3.6.255.148:8001/health
API Docs http://3.6.255.148:8001/docs
Smoke Tests success

🟢 Staging is live and healthy! Test your changes at the staging URL above.

Ready to ship? Comment /promote on this PR to merge to main and deploy to production.

Comment thread benchmarks/longmemeval/runner.py Fixed
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new LongMemEval benchmark harness for the Python XMem API, including an async HTTP client, dataset parsing, local metrics evaluation, and runners for sequential or parallel execution across categories. The review feedback highlights several critical issues: a potential crash in log parsing within update_state_from_line, incorrect path routing for absolute URLs in the HTTP client, premature truncation of the merged predictions file during validation, and the risk of indefinite hangs when downloading datasets using urllib without a timeout.

Comment thread benchmarks/longmemeval/run_all_categories.py
Comment thread benchmarks/longmemeval/client.py Outdated
Comment thread benchmarks/longmemeval/run_all_categories.py
Comment thread benchmarks/longmemeval/dataset.py
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 29, 2026

Greptile Summary

This PR adds a Python-only LongMemEval benchmark harness under benchmarks/longmemeval/, alongside a minor deploy-workflow improvement that lets DEPLOY_REF be a branch, tag, or commit SHA instead of only a branch name.

  • Benchmark harness (client.py, runner.py, dataset.py, metrics.py, config.py, run.py, run_all_categories.py): implements dataset loading, XMem API ingestion/retrieval, token-F1 metrics, resume support, and a parallel category launcher that spawns one child run.py process per question_type.
  • Previous issues addressed: urllib.request.urlretrieve is replaced with httpx.stream, and _request_path now correctly passes absolute status_url values through without prepending a slash.
  • One remaining quality issue: merge_predictions opens the merged output file in write mode before verifying all six category prediction files exist; if any are absent the merged file is left partially written on disk.

Confidence Score: 5/5

Safe to merge; the harness is additive (no changes to production service code), and the previously reported issues with urlretrieve and the absolute status_url path are both resolved in this revision.

The only new finding is a write-ordering issue in merge_predictions where the output file is created before missing inputs are checked, which is easily fixed and only reachable after all six category subprocesses have already succeeded. The deploy-workflow change is a clean improvement for tag/SHA deployments. No correctness or data-loss issues were found on the hot paths.

benchmarks/longmemeval/run_all_categories.py (merge_predictions ordering) and benchmarks/longmemeval/runner.py (item.dict on frozen dataclass, flagged in previous review).

Important Files Changed

Filename Overview
.github/workflows/deploy-staging.yml Minor deploy-script improvement: renames PR_BRANCH to DEPLOY_REF and adds git show-ref guard so tags/commit SHAs check out in detached-HEAD mode instead of failing on non-branch refs.
.gitignore Adds negation rules to track benchmark source files while still ignoring data/, results/, and outputs/ artifacts under benchmarks/longmemeval/.
benchmarks/longmemeval/client.py Async httpx client with retry logic; _request_path now correctly handles absolute URLs (http/https prefix check), addressing the previously reported status_url mangling issue.
benchmarks/longmemeval/config.py Frozen dataclass BenchmarkConfig holding all runtime parameters; api_key property reads from environment at call time. No issues.
benchmarks/longmemeval/dataset.py Dataset loading and normalization; download_dataset now uses httpx.stream (previously reported urlretrieve deprecation is fixed). Robust format-sniffing for JSON/JSONL variants.
benchmarks/longmemeval/metrics.py Token-F1, exact-match, contains metrics plus JSONL read/write helpers. Token overlap calculation is correct. No issues.
benchmarks/longmemeval/runner.py Main benchmark orchestration; uses item.dict on a frozen dataclass (previously flagged) instead of dataclasses.asdict; otherwise ingest/retrieve/resume logic is sound.
benchmarks/longmemeval/run.py CLI entrypoint wiring argparse to BenchmarkConfig and LongMemEvalRunner. Clean and straightforward.
benchmarks/longmemeval/run_all_categories.py Parallel category runner; merge_predictions opens the output file before verifying all category prediction files exist, leaving a partially-written merged file on disk if any are absent. Progress line parser is safe thanks to the inner "/" guard.

Sequence Diagram

sequenceDiagram
    participant User
    participant run_all_categories as run_all_categories.py
    participant run as run.py (×6 child procs)
    participant runner as LongMemEvalRunner
    participant client as XMemApiClient
    participant API as XMem HTTP API

    User->>run_all_categories: python -m benchmarks.longmemeval.run_all_categories
    run_all_categories->>run_all_categories: load_examples + validate_independence
    run_all_categories->>run_all_categories: spawn asyncio tasks (semaphore-limited)

    loop for each question_type category (up to max_parallel_categories)
        run_all_categories->>run: asyncio.create_subprocess_exec
        run->>runner: LongMemEvalRunner(config).run()
        loop for each example
            runner->>client: batch_ingest_v1/v2(items)
            client->>API: POST /v1/memory/batch-ingest or /v2/memory/batch-ingest
            API-->>client: ApiCallResult (v2: includes status_url)
            opt "ingest_api_version == v2"
                loop poll until terminal status
                    client->>API: GET status_url
                    API-->>client: job status
                end
            end
            runner->>client: retrieve(query, user_id, top_k)
            client->>API: POST /v1/memory/retrieve
            API-->>client: answer + sources
            runner->>runner: append_jsonl(results.jsonl + predictions.jsonl)
        end
        run-->>run_all_categories: stdout lines (progress updates)
        run_all_categories->>run_all_categories: update_state_from_line(state, line)
    end

    run_all_categories->>run_all_categories: merge_predictions(output_root)
    run_all_categories->>User: merged predictions.jsonl + final status
Loading

Fix All in Cursor Fix All in Codex Fix All in Claude Code

Reviews (4): Last reviewed commit: "Fix staging deploy for force-pushed PR b..." | Re-trigger Greptile

Comment thread benchmarks/longmemeval/runner.py
Comment thread benchmarks/longmemeval/dataset.py
Comment thread benchmarks/longmemeval/client.py
Copy link
Copy Markdown
Member

Addressed the useful bot review items in 2f8d90c:

  • handled absolute status_url values in the benchmark HTTP client
  • made progress-line parsing ignore non-progress bracketed logs safely
  • replaced urllib.request.urlretrieve with httpx.stream(..., follow_redirects=True)
  • removed the unused Path import from the runner

Re-ran local validation: compileall, git diff --check, focused parser/path checks, and the all-category dry run.

Copy link
Copy Markdown
Member

Fixed the staging deploy failure shown in the workflow logs in ce8be33.

Root cause: the EC2 checkout had the PR branch locally, then the PR branch was force-pushed/amended. The workflow used git pull origin "$PR_BRANCH", so Git refused because local and remote had diverged and no pull strategy was configured.

Change: staging deploy now fetches the requested ref and checks out the fetched remote state directly with git checkout -B "$DEPLOY_REF" "origin/$DEPLOY_REF" for branches, falling back to detached FETCH_HEAD for non-branch refs. That avoids merge/rebase prompts during ephemeral staging deploys.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Staging Deployment Report

Item Value
Branch benchmark/longmemeval-python-20260529-fda8
Commit d9542c4
Environment Staging
Health http://3.6.255.148:8001/health
API Docs http://3.6.255.148:8001/docs
Smoke Tests success

🟢 Staging is live and healthy! Test your changes at the staging URL above.

Ready to ship? Comment /promote on this PR to merge to main and deploy to production.

@ishaanxgupta
Copy link
Copy Markdown
Member

/promote

@github-actions github-actions Bot closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants