Add LongMemEval benchmark harness#203
Conversation
|
✅ Staging Deployment Report
🟢 Staging is live and healthy! Test your changes at the staging URL above. Ready to ship? Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new LongMemEval benchmark harness for the Python XMem API, including an async HTTP client, dataset parsing, local metrics evaluation, and runners for sequential or parallel execution across categories. The review feedback highlights several critical issues: a potential crash in log parsing within update_state_from_line, incorrect path routing for absolute URLs in the HTTP client, premature truncation of the merged predictions file during validation, and the risk of indefinite hangs when downloading datasets using urllib without a timeout.
|
| Filename | Overview |
|---|---|
| .github/workflows/deploy-staging.yml | Minor deploy-script improvement: renames PR_BRANCH to DEPLOY_REF and adds git show-ref guard so tags/commit SHAs check out in detached-HEAD mode instead of failing on non-branch refs. |
| .gitignore | Adds negation rules to track benchmark source files while still ignoring data/, results/, and outputs/ artifacts under benchmarks/longmemeval/. |
| benchmarks/longmemeval/client.py | Async httpx client with retry logic; _request_path now correctly handles absolute URLs (http/https prefix check), addressing the previously reported status_url mangling issue. |
| benchmarks/longmemeval/config.py | Frozen dataclass BenchmarkConfig holding all runtime parameters; api_key property reads from environment at call time. No issues. |
| benchmarks/longmemeval/dataset.py | Dataset loading and normalization; download_dataset now uses httpx.stream (previously reported urlretrieve deprecation is fixed). Robust format-sniffing for JSON/JSONL variants. |
| benchmarks/longmemeval/metrics.py | Token-F1, exact-match, contains metrics plus JSONL read/write helpers. Token overlap calculation is correct. No issues. |
| benchmarks/longmemeval/runner.py | Main benchmark orchestration; uses item.dict on a frozen dataclass (previously flagged) instead of dataclasses.asdict; otherwise ingest/retrieve/resume logic is sound. |
| benchmarks/longmemeval/run.py | CLI entrypoint wiring argparse to BenchmarkConfig and LongMemEvalRunner. Clean and straightforward. |
| benchmarks/longmemeval/run_all_categories.py | Parallel category runner; merge_predictions opens the output file before verifying all category prediction files exist, leaving a partially-written merged file on disk if any are absent. Progress line parser is safe thanks to the inner "/" guard. |
Sequence Diagram
sequenceDiagram
participant User
participant run_all_categories as run_all_categories.py
participant run as run.py (×6 child procs)
participant runner as LongMemEvalRunner
participant client as XMemApiClient
participant API as XMem HTTP API
User->>run_all_categories: python -m benchmarks.longmemeval.run_all_categories
run_all_categories->>run_all_categories: load_examples + validate_independence
run_all_categories->>run_all_categories: spawn asyncio tasks (semaphore-limited)
loop for each question_type category (up to max_parallel_categories)
run_all_categories->>run: asyncio.create_subprocess_exec
run->>runner: LongMemEvalRunner(config).run()
loop for each example
runner->>client: batch_ingest_v1/v2(items)
client->>API: POST /v1/memory/batch-ingest or /v2/memory/batch-ingest
API-->>client: ApiCallResult (v2: includes status_url)
opt "ingest_api_version == v2"
loop poll until terminal status
client->>API: GET status_url
API-->>client: job status
end
end
runner->>client: retrieve(query, user_id, top_k)
client->>API: POST /v1/memory/retrieve
API-->>client: answer + sources
runner->>runner: append_jsonl(results.jsonl + predictions.jsonl)
end
run-->>run_all_categories: stdout lines (progress updates)
run_all_categories->>run_all_categories: update_state_from_line(state, line)
end
run_all_categories->>run_all_categories: merge_predictions(output_root)
run_all_categories->>User: merged predictions.jsonl + final status
Reviews (4): Last reviewed commit: "Fix staging deploy for force-pushed PR b..." | Re-trigger Greptile
2596f28 to
8cbf57c
Compare
|
Addressed the useful bot review items in
Re-ran local validation: |
|
Fixed the staging deploy failure shown in the workflow logs in Root cause: the EC2 checkout had the PR branch locally, then the PR branch was force-pushed/amended. The workflow used Change: staging deploy now fetches the requested ref and checks out the fetched remote state directly with |
✅ Staging Deployment Report
🟢 Staging is live and healthy! Test your changes at the staging URL above. Ready to ship? Comment |
|
/promote |
Summary
Validation