[feature] Support run terminal-bench-2.0 with harbor#314
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the Harbor benchmark integration, adding HarborTask for running evaluation jobs and HarborSummarizer for parsing and displaying results. It also updates documentation for swe_bench to install mini-swe-agent from source, and adds an example configuration for terminal-bench-2. The feedback highlights several critical and medium-severity improvements for HarborTask, including ensuring progress bar support when resuming jobs, safely registering signal handlers in multi-threaded environments, using a threading.Event to prevent thread leaks/blocking in the progress monitor, and refactoring nested configuration access using inherited properties. Additionally, it is recommended to remove unused code and avoid hardcoding version strings in the summarizer.
| if self.work_dir: | ||
| details_dir = Path(self.work_dir) / "details" | ||
| config_path = details_dir / "config.json" | ||
| if config_path.exists(): | ||
| return self._resume_job(details_dir) |
There was a problem hiding this comment.
[reply] AISBench在子进程而非多线程环境中
| def _handle_sigterm(signum, frame): | ||
| raise KeyboardInterrupt | ||
|
|
||
| signal.signal(signal.SIGTERM, _handle_sigterm) |
There was a problem hiding this comment.
在 _run_with_tqdm 中直接调用 signal.signal(signal.SIGTERM, _handle_sigterm)。如果 ais_bench 在多线程环境(例如 ThreadPoolExecutor)中运行此任务,非主线程调用 signal.signal 会抛出 ValueError: signal only works in main thread 异常导致程序崩溃。
建议使用 try-except ValueError 保护该调用,或者确保仅在主线程中注册信号。
def _handle_sigterm(signum, frame):
raise KeyboardInterrupt
try:
signal.signal(signal.SIGTERM, _handle_sigterm)
except ValueError:
passThere was a problem hiding this comment.
[reply] AISBench在子进程而非多线程环境中
| for model_abbr in self.model_abbrs: | ||
| if metric is None: | ||
| for k in parsed_results.get(model_abbr, {}).get(dataset_abbr, {}).keys(): | ||
| row = [dataset_abbr, 'a39421', k, dataset_eval_mode.get(dataset_abbr, 'gen')] |
There was a problem hiding this comment.
【review】这里硬编码了版本号 'a39421'。建议从配置中动态获取版本号,或者使用默认值 '-',以提高代码的可维护性。
| def parse_kwargs(kwargs_list: list[str] | None) -> dict[str, Any]: | ||
| if not kwargs_list: | ||
| return {} | ||
| result = {} | ||
| for kwarg in kwargs_list: | ||
| if "=" in kwarg: | ||
| key, value = kwarg.split("=", 1) | ||
| result[key] = value | ||
| return result |
There was a problem hiding this comment.
【review】函数 parse_kwargs 被定义了但从未在代码中被调用。建议将其删除以保持代码整洁
| def _set_api_key(self): | ||
| api_key = self.cfg["models"][0].get("api_key") | ||
| if api_key is None: | ||
| api_key = DEFAULT_FAKE_API_KEY | ||
| os.environ["OPENAI_API_KEY"] = api_key |
There was a problem hiding this comment.
在 _set_api_key 中,使用了 self.cfg["models"][0]。由于基类 BaseTask 已经定义了 self.model_cfg = cfg["models"][0],建议直接使用 self.model_cfg,提高代码可读性。
| def _set_api_key(self): | |
| api_key = self.cfg["models"][0].get("api_key") | |
| if api_key is None: | |
| api_key = DEFAULT_FAKE_API_KEY | |
| os.environ["OPENAI_API_KEY"] = api_key | |
| def _set_api_key(self): | |
| api_key = self.model_cfg.get("api_key") | |
| if api_key is None: | |
| api_key = DEFAULT_FAKE_API_KEY | |
| os.environ["OPENAI_API_KEY"] = api_key |
There was a problem hiding this comment.
[review]在 _set_api_key 中,使用了 self.cfg["models"][0]。由于基类 BaseTask 已经定义了 self.model_cfg = cfg["models"][0],建议直接使用 self.model_cfg,提高代码可读性。
| self.out_dir = osp.join( | ||
| self.work_dir, self.output_subdir, self.cfg["models"][0]["abbr"] | ||
| ) | ||
| mkdir_or_exist(osp.join(self.out_dir, self.cfg["datasets"][0][0]["abbr"])) | ||
| self.out_detail_dir = osp.join( | ||
| self.out_dir, | ||
| self.cfg["datasets"][0][0]["abbr"], | ||
| ) | ||
| mkdir_or_exist(Path(self.out_detail_dir)) |
There was a problem hiding this comment.
在 _prepare_out_dir 中,使用了多层嵌套索引 self.cfg["models"][0] 和 self.cfg["datasets"][0][0]。由于基类 BaseTask 已经在 __init__ 中初始化了 self.model_cfg 和 self.dataset_cfgs,建议直接使用 self.model_cfg 和 self.dataset_cfgs[0],这样代码更简洁且不易出错。
| self.out_dir = osp.join( | |
| self.work_dir, self.output_subdir, self.cfg["models"][0]["abbr"] | |
| ) | |
| mkdir_or_exist(osp.join(self.out_dir, self.cfg["datasets"][0][0]["abbr"])) | |
| self.out_detail_dir = osp.join( | |
| self.out_dir, | |
| self.cfg["datasets"][0][0]["abbr"], | |
| ) | |
| mkdir_or_exist(Path(self.out_detail_dir)) | |
| self.out_dir = osp.join( | |
| self.work_dir, self.output_subdir, self.model_cfg["abbr"] | |
| ) | |
| mkdir_or_exist(osp.join(self.out_dir, self.dataset_cfgs[0]["abbr"])) | |
| self.out_detail_dir = osp.join( | |
| self.out_dir, | |
| self.dataset_cfgs[0]["abbr"], | |
| ) | |
| mkdir_or_exist(Path(self.out_detail_dir)) |
There was a problem hiding this comment.
[review]在 _prepare_out_dir 中,使用了多层嵌套索引 self.cfg["models"][0] 和 self.cfg["datasets"][0][0]。由于基类 BaseTask 已经在 init 中初始化了 self.model_cfg 和 self.dataset_cfgs,建议直接使用 self.model_cfg 和 self.dataset_cfgs[0],这样代码更简洁且不易出错。
| if args.get("verifier_env"): | ||
| config.verifier.env.update(parse_env_vars(args["verifier_env"])) | ||
|
|
||
| reuse_timestamp = None |
There was a problem hiding this comment.
[review]变量 reuse_timestamp 被赋值为 None 但在后续代码中从未被使用。建议将其删除。
There was a problem hiding this comment.
[review]变量 reuse_timestamp 被赋值为 None 但在后续代码中从未被使用。建议将其删除。
| def monitor_progress(): | ||
| nonlocal completed | ||
| while True: | ||
| if self.job and self.job.job_dir: | ||
| trial_count = len(list(self.job.job_dir.glob("trial_*"))) | ||
| if trial_count > completed: | ||
| pbar.update(trial_count - completed) | ||
| completed = trial_count | ||
| if self.task_state_manager: | ||
| self.task_state_manager.update_task_state({ | ||
| "finish_count": completed, | ||
| }) | ||
| time.sleep(0.5) | ||
| if completed >= total_tasks: | ||
| pbar.update(total_tasks - pbar.n) | ||
| pbar.close() | ||
| break | ||
|
|
||
| monitor_thread = threading.Thread(target=monitor_progress, daemon=True) | ||
| monitor_thread.start() |
There was a problem hiding this comment.
[reply] 在 _run_with_tqdm 中,monitor_progress 线程通过 completed >= total_tasks 来判断是否退出。如果任务因异常提前结束,或者最终生成的 trial 数量少于 total_tasks,该线程将无法自行退出,导致主线程在 monitor_thread.join(timeout=5) 处不必要地阻塞 5 秒。
建议引入一个 threading.Event(例如 stop_event)来控制线程退出,在 finally 块中将其 set,使监控线程能够立即安全退出。
PR Type / PR类型
Related Issue | 关联 Issue
Relates to #(待填写 issue 编号)
🔍 Motivation / 变更动机
将
harbor run能力接入 AISBench benchmark 框架,使 benchmark 能够运行 harbor 评测任务(如 terminal-bench-2)。核心目标:
harbor run命令行参数一一对应--reuse断点续测📝 Modification / 修改内容
新增文件
ais_bench/benchmark/tasks/custom_tasks/harbor_task.pyais_bench/configs/agent_example/harbor_terminal_bench_2_task.pyais_bench/summarizers/harbor.pyais_bench/summarizers/__init__.pyais_bench/tasks/custom_tasks/__init__.pyais_bench/docs/source_zh_cn/extended_benchmark/agent/harbor_design.md核心功能
参数映射
models中配置datasets中配置进度监控
断点续测
details/config.json是否存在HarborSummarizer
示例配置
落盘路径结构
📐 Associated Test Results / 关联测试结果
测试场景:
无向后不兼容变更。
无明显性能下降。
🌟 Use cases (Optional) / 使用案例(可选)
基本执行
cd ais_bench ais_bench ais_bench/configs/agent_example/harbor_terminal_bench_2_task.py --max-num-workers 3断点续测
ais_bench ais_bench/configs/agent_example/harbor_terminal_bench_2_task.py --max-num-workers 3 --reuse {时间戳}输出示例
✅ Checklist / 检查列表
Before PR:
After PR:
👥 Collaboration Info / 协作信息
🌟 Useful CI Command / 实用的CI命令
/gemini review/gemini summary/gemini help/readthedocs build