-
Notifications
You must be signed in to change notification settings - Fork 322
Description
Hi, thanks for open-sourcing this work.
I have been trying to reproduce the LoCoMo results reported in Table 3, especially for the smaller Qwen models (e.g., Qwen2.5-1.5B / 3B and Qwen3-1.7B / 8B). However, the results I obtain are not consistent with the numbers reported in the paper.
I would like to ask whether you could share more implementation details for the evaluations of the other methods in Table 3 (e.g., LoCoMo, ReadAgent, MemoryBank, MemGPT, A-Mem, LightMem, Mem0), because:
- My reproduced numbers differ noticeably from those reported in the paper.
- As far as I can tell, other papers do not seem to report these exact baseline results on these exact model settings, so it is hard to verify what exact setup was used.
Would it be possible to clarify or release the following details?
-
Whether these baselines were re-implemented or directly run from official repos
-
Decoding settings for all evaluated models:
- temperature / top-p / max tokens
- whether reasoning / thinking mode was enabled or disabled
-
Retrieval settings:
- retrieve_k
- token budget
-
Evaluation details:
- exact F1 / BLEU computation script
- whether any post-processing was applied to model outputs
-
Serving / inference details:
- local inference vs API
- any model-specific adjustments for Qwen2.5/Qwen3
If possible, it would be very helpful if the exact scripts/configs used to produce Table 3 could be released.
Thanks again — I think this would greatly improve the reproducibility of the paper.