Request for more evaluation details for Table 3 on LoCoMo

Hi, thanks for open-sourcing this work.

I have been trying to reproduce the LoCoMo results reported in Table 3, especially for the smaller Qwen models (e.g., Qwen2.5-1.5B / 3B and Qwen3-1.7B / 8B). However, the results I obtain are not consistent with the numbers reported in the paper.

I would like to ask whether you could share more implementation details for the evaluations of the *other methods* in Table 3 (e.g., LoCoMo, ReadAgent, MemoryBank, MemGPT, A-Mem, LightMem, Mem0), because:

1. My reproduced numbers differ noticeably from those reported in the paper.
2. As far as I can tell, other papers do not seem to report these exact baseline results on these exact model settings, so it is hard to verify what exact setup was used.

Would it be possible to clarify or release the following details?

- Whether these baselines were re-implemented or directly run from official repos
- Decoding settings for all evaluated models:
  - temperature / top-p / max tokens
  - whether reasoning / thinking mode was enabled or disabled
- Retrieval settings:
  - retrieve\_k
  - token budget
  
- Evaluation details:
  - exact F1 / BLEU computation script
  - whether any post-processing was applied to model outputs
- Serving / inference details:
  - local inference vs API
  - any model-specific adjustments for Qwen2.5/Qwen3

If possible, it would be very helpful if the exact scripts/configs used to produce Table 3 could be released.

Thanks again — I think this would greatly improve the reproducibility of the paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for more evaluation details for Table 3 on LoCoMo #47

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request for more evaluation details for Table 3 on LoCoMo #47

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions