Skip to content

GLM-4.7 Browsecomp/HLE Reproducibility | 结果复现 #120

@panademo

Description

@panademo

Hi Zhipu team, thank you so much for open-sourcing such impressive models and sharing your research!
Just a question regarding reproducibility of the GLM-4.7 search-agent benchmarks: How can the BrowseComp and HLE(w/ tools) evaluation results be replicated? Is the search-agent and "context management" framework you used for BrowseComp/HLE evaluation open-source, or do you plan to open-source it?
Also, if I can also ask the same question concerning the Code agent, that was used for SWE-Bench verified/multilingual and Terminal bench 2.0? Are there any plans to open-source that code-agent framework?
Thanks again for your great work! 🙏


你好,智谱团队,非常感谢你们开源了如此出色的模型并分享相关研究成果!
我有一个关于 GLM-4.7 搜索代理(search-agent)基准测试可复现性的问题:BrowseComp 和 HLE(with tools)的评测结果是如何复现的?你们在 BrowseComp / HLE 评测中使用的搜索代理和“上下文管理(context management)”框架是否已经开源,或者是否有计划将其开源?
另外,如果可以的话,我也想就代码代理(Code agent)提出同样的问题:该代码代理被用于 SWE-Bench(verified / multilingual)以及 Terminal Bench 2.0。是否有计划开源这一代码代理框架?
再次感谢你们出色的工作!🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions