Skip to content

Request for more evaluation details for Table 3 on LoCoMo #47

@zyzyzy526

Description

@zyzyzy526

Hi, thanks for open-sourcing this work.

I have been trying to reproduce the LoCoMo results reported in Table 3, especially for the smaller Qwen models (e.g., Qwen2.5-1.5B / 3B and Qwen3-1.7B / 8B). However, the results I obtain are not consistent with the numbers reported in the paper.

I would like to ask whether you could share more implementation details for the evaluations of the other methods in Table 3 (e.g., LoCoMo, ReadAgent, MemoryBank, MemGPT, A-Mem, LightMem, Mem0), because:

  1. My reproduced numbers differ noticeably from those reported in the paper.
  2. As far as I can tell, other papers do not seem to report these exact baseline results on these exact model settings, so it is hard to verify what exact setup was used.

Would it be possible to clarify or release the following details?

  • Whether these baselines were re-implemented or directly run from official repos

  • Decoding settings for all evaluated models:

    • temperature / top-p / max tokens
    • whether reasoning / thinking mode was enabled or disabled
  • Retrieval settings:

    • retrieve_k
    • token budget
  • Evaluation details:

    • exact F1 / BLEU computation script
    • whether any post-processing was applied to model outputs
  • Serving / inference details:

    • local inference vs API
    • any model-specific adjustments for Qwen2.5/Qwen3

If possible, it would be very helpful if the exact scripts/configs used to produce Table 3 could be released.

Thanks again — I think this would greatly improve the reproducibility of the paper.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions