Add LoCoMo and BEAM benchmark harnesses#204
Conversation
✅ Staging Deployment Report
🟢 Staging is live and healthy! Test your changes at the staging URL above. Ready to ship? Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces benchmark harnesses for the Python XMem API, adding support for the BEAM and LoCoMo datasets. It establishes a shared common module with utilities for file I/O, evaluation metrics, and an asynchronous HTTP client, alongside dataset-specific loaders, runners, and documentation. The review feedback highlights several opportunities to improve performance and robustness, such as optimizing literal coercion by attempting JSON parsing first, refactoring the token F1 metric to use collections.Counter for linear-time complexity, preventing the swallowing of API error messages by delaying raise_for_status(), and deduplicating session numbers during LoCoMo parsing.
|
| Filename | Overview |
|---|---|
| benchmarks/locomo/runner.py | Orchestrates per-example ingest+retrieve; user_id is derived from question_id, causing full re-ingest of shared session history for every QA item in a sample. |
| benchmarks/locomo/dataset.py | Parses LoCoMo sessions and builds ingest items; exchange-pair logic correctly uses role-aware non-overlapping pairs after the previous fix. |
| benchmarks/beam/runner.py | Orchestrates BEAM per-example ingest+retrieve with balanced question-type sampling; same per-question user_id re-ingest pattern as LoCoMo. |
| benchmarks/beam/evaluate.py | LLM-as-judge BEAM evaluator; falls back to a generic rubric criterion when both rubric list and reference_answer are empty, which can silently inflate pass rates. |
| benchmarks/common/xmem.py | Async XMem HTTP client with retry/backoff, job polling, and proper error propagation; logic is correct. |
| benchmarks/common/io.py | Atomic download via temp-file rename, JSON/JSONL/Parquet readers, append-JSONL writer; all issues from prior review are addressed. |
| benchmarks/common/metrics.py | Local token-F1, exact-match, and contains metrics with article normalisation; straightforward and correct. |
| benchmarks/beam/dataset.py | Parses BEAM parquet rows including probing-question normalisation; exchange-pair logic is non-overlapping after previous fix. |
Sequence Diagram
sequenceDiagram
participant CLI as run.py (CLI)
participant Runner as Runner (beam/locomo)
participant DS as dataset.py
participant XMem as XMemApiClient
participant API as XMem HTTP API
CLI->>Runner: run()
Runner->>DS: load_examples(path)
DS-->>Runner: [Example...]
Runner->>DS: select_examples / _sample_examples
DS-->>Runner: [filtered Examples]
loop For each Example
Runner->>DS: build_ingest_items(example, user_id)
DS-->>Runner: [IngestItem...]
loop Batches of batch_size
Runner->>XMem: batch_ingest_v2(items)
XMem->>API: POST /v2/memory/batch-ingest
API-->>XMem: "{status_url}"
Runner->>XMem: poll_job(status_url)
XMem->>API: GET status_url (repeated)
API-->>XMem: "{status: succeeded}"
end
Runner->>XMem: "retrieve({query, user_id, top_k})"
XMem->>API: POST /v1/memory/retrieve
API-->>XMem: "{answer, sources, confidence}"
XMem-->>Runner: ApiCallResult
Runner->>Runner: score_answer + append_jsonl
end
Runner->>Runner: summarize_results + write_json
Reviews (5): Last reviewed commit: "Fix BEAM sampling and evaluation" | Re-trigger Greptile
|
Addressed the useful bot feedback in
Validation rerun locally:
|
✅ Staging Deployment Report
🟢 Staging is live and healthy! Test your changes at the staging URL above. Ready to ship? Comment |
✅ Staging Deployment Report
🟢 Staging is live and healthy! Test your changes at the staging URL above. Ready to ship? Comment |
|
Pushed a second review batch in
Validation rerun locally:
|
✅ Staging Deployment Report
🟢 Staging is live and healthy! Test your changes at the staging URL above. Ready to ship? Comment |
|
Added the changelog entry in Final workflow status on the latest PR branch is green:
Only the large-PR-size warning remains, which is expected for adding two benchmark harnesses plus docs. |
|
Pushed What changed:
Validation:
No real benchmark results were committed or posted. The live 1% run was interrupted before any result row/evaluation summary was produced. |
✅ Staging Deployment Report
🟢 Staging is live and healthy! Test your changes at the staging URL above. Ready to ship? Comment |
Summary
References
Validation
No real XMem API benchmark run was performed.