[feature] Support run terminal-bench-2.0 with harbor by SJTUyh · Pull Request #314 · AISBench/benchmark

SJTUyh · 2026-05-30T01:53:31Z

PR Type / PR类型

Related Issue | 关联 Issue
Relates to #（待填写 issue 编号）

🔍 Motivation / 变更动机

将 harbor run 能力接入 AISBench benchmark 框架，使 benchmark 能够运行 harbor 评测任务（如 terminal-bench-2）。

核心目标：

benchmark 配置文件的参数与 harbor run 命令行参数一一对应
落盘文件保存在 benchmark 指定路径内（参考 tau2 bench 格式）
通过 tqdm 实时显示执行进度
支持通过 --reuse 断点续测
实现 HarborSummarizer 自定义汇总输出

📝 Modification / 修改内容

新增文件

文件路径	说明
`ais_bench/benchmark/tasks/custom_tasks/harbor_task.py`	HarborTask 任务实现
`ais_bench/configs/agent_example/harbor_terminal_bench_2_task.py`	benchmark 配置文件模板
`ais_bench/summarizers/harbor.py`	HarborSummarizer 自定义汇总器
`ais_bench/summarizers/__init__.py`	注册 HarborSummarizer
`ais_bench/tasks/custom_tasks/__init__.py`	注册 HarborTask
`ais_bench/docs/source_zh_cn/extended_benchmark/agent/harbor_design.md`	详细设计文档

核心功能

参数映射
- Agent 配置（agent_name、model_names、agent_kwargs）在 models 中配置
- Job 配置（n_attempts、n_concurrent_trials、path 等）在 datasets 中配置
进度监控
- tqdm 实时显示执行进度
- TaskStateManager 更新任务状态
断点续测
- 自动检测 details/config.json 是否存在
- 存在则自动触发续测
HarborSummarizer
- 打印 harbor 特有指标（reward_distribution、exception_distribution、pass@k）

示例配置

models = [
    dict(
        abbr="terminus-2",
        agent_name="terminus-2",
        model_names=["hosted_vllm/qwen3"],
        agent_kwargs={
            "api_base": "http://192.168.9.103:2498/v1",
            "model_info": {
                "max_input_tokens": 128000,
                "max_output_tokens": 4096,
            },
        },
    )
]

datasets = []
datasets.append(
    dict(
        abbr='harbor_terminal-bench-2',
        args=dict(
            path="/path/to/terminal-bench-2/",
            n_concurrent_trials=5,
            environment_type="docker",
            environment_delete=False,
        ),
    )
)

summarizer = dict(
    attr="accuracy",
    type=HarborSummarizer,
)

落盘路径结构

outputs/default/{timestamp}/
├── results/
│   └── {model_abbr}/
│       └── {dataset_abbr}/
│           ├── details/                 # harbor 原始结果
│           │   ├── config.json
│           │   ├── result.json
│           │   └── trial_*/
│           └── {dataset_abbr}.json   # 精度结果
└── summary/
    └── summary_*.txt

📐 Associated Test Results / 关联测试结果

测试结果链接（待填写）

测试场景：

单任务执行（1 个 task，1 个 trial）
多任务并发（n_concurrent_trials > 1）
断点续测（中断后使用 --reuse 继续）
多次尝试（n_attempts > 1，验证 pass@k）

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

无向后不兼容变更。

⚠️ Performance degradation (Optional) / 性能下降（可选）

无明显性能下降。

🌟 Use cases (Optional) / 使用案例（可选）

基本执行

cd ais_bench
ais_bench ais_bench/configs/agent_example/harbor_terminal_bench_2_task.py --max-num-workers 3

断点续测

ais_bench ais_bench/configs/agent_example/harbor_terminal_bench_2_task.py --max-num-workers 3 --reuse {时间戳}

输出示例

============================================================
Dataset: harbor_terminal-bench-2
Model: terminus-2
============================================================
Total Count: 74
Errors: 54
Avg Score: 0.045

Reward Distribution:
+--------+-------+
|  Score | Count |
+========+=======+
|    0.0 |    70 |
+--------+-------+
|    1.0 |     4 |
+--------+-------+

Exception Distribution:
+----------------------------+-------+
| Exception                  | Count |
+============================+=======+
| AgentTimeoutError          |    39 |
+----------------------------+-------+
| AgentSetupTimeoutError     |    13 |
+----------------------------+-------+
| InternalServerError        |     2 |
+----------------------------+-------+

Pass@k:
+----+-----------+
| k  | Pass Rate |
+====+===========+
|  1 |    0.0541 |
+----+-----------+
|  2 |    0.0811 |
+----+-----------+

✅ Checklist / 检查列表

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues.
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
The modification is fully covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
CLA has been signed and all committers have signed the CLA in this PR.

👥 Collaboration Info / 协作信息

Suggested Reviewers / 建议审核人: @xxx
Relevant Module Owners / 相关模块负责人: @xxx
Other Collaboration Notes / 其他协作说明：

🌟 Useful CI Command / 实用的CI命令

Command / 命令	Introduction / 介绍
`/gemini review`	Performs a code review for the current pull request in its current state by Gemini.
`/gemini summary`	Provides a summary of the current pull request in its current state by Gemini.
`/gemini help`	Displays a list of available commands of Gemini.
`/readthedocs build`	Triggers a build of the documentation for the current pull request in its current state by Read the Docs.

gemini-code-assist

Code Review

This pull request introduces the Harbor benchmark integration, adding HarborTask for running evaluation jobs and HarborSummarizer for parsing and displaying results. It also updates documentation for swe_bench to install mini-swe-agent from source, and adds an example configuration for terminal-bench-2. The feedback highlights several critical and medium-severity improvements for HarborTask, including ensuring progress bar support when resuming jobs, safely registering signal handlers in multi-threaded environments, using a threading.Event to prevent thread leaks/blocking in the progress monitor, and refactoring nested configuration access using inherited properties. Additionally, it is recommended to remove unused code and avoid hardcoding version strings in the summarizer.

gemini-code-assist · 2026-05-30T01:54:47Z

+        if self.work_dir:
+            details_dir = Path(self.work_dir) / "details"
+            config_path = details_dir / "config.json"
+            if config_path.exists():
+                return self._resume_job(details_dir)


在断点续测（resume）时，直接调用了 _resume_job，它会直接运行 job.run()，这导致续测时无法显示 tqdm 进度条。
建议重构 _resume_job 和 _run_with_tqdm，使续测也能通过 _run_with_tqdm 运行，并根据已有的 trial_* 数量初始化进度条的已完成数（completed），从而在续测时也能正确显示进度。

[reply] AISBench在子进程而非多线程环境中

gemini-code-assist · 2026-05-30T01:54:47Z

+        def _handle_sigterm(signum, frame):
+            raise KeyboardInterrupt
+
+        signal.signal(signal.SIGTERM, _handle_sigterm)


在 _run_with_tqdm 中直接调用 signal.signal(signal.SIGTERM, _handle_sigterm)。如果 ais_bench 在多线程环境（例如 ThreadPoolExecutor）中运行此任务，非主线程调用 signal.signal 会抛出 ValueError: signal only works in main thread 异常导致程序崩溃。
建议使用 try-except ValueError 保护该调用，或者确保仅在主线程中注册信号。

def _handle_sigterm(signum, frame): raise KeyboardInterrupt try: signal.signal(signal.SIGTERM, _handle_sigterm) except ValueError: pass

[reply] AISBench在子进程而非多线程环境中

gemini-code-assist · 2026-05-30T01:54:48Z

+            for model_abbr in self.model_abbrs:
+                if metric is None:
+                    for k in parsed_results.get(model_abbr, {}).get(dataset_abbr, {}).keys():
+                        row = [dataset_abbr, 'a39421', k, dataset_eval_mode.get(dataset_abbr, 'gen')]


这里硬编码了版本号 'a39421'。建议从配置中动态获取版本号，或者使用默认值 '-'，以提高代码的可维护性。

【review】这里硬编码了版本号 'a39421'。建议从配置中动态获取版本号，或者使用默认值 '-'，以提高代码的可维护性。

gemini-code-assist · 2026-05-30T01:54:48Z

+def parse_kwargs(kwargs_list: list[str] | None) -> dict[str, Any]:
+    if not kwargs_list:
+        return {}
+    result = {}
+    for kwarg in kwargs_list:
+        if "=" in kwarg:
+            key, value = kwarg.split("=", 1)
+            result[key] = value
+    return result


函数 parse_kwargs 被定义了但从未在代码中被调用。建议将其删除以保持代码整洁。

【review】函数 parse_kwargs 被定义了但从未在代码中被调用。建议将其删除以保持代码整洁

gemini-code-assist · 2026-05-30T01:54:48Z

+    def _set_api_key(self):
+        api_key = self.cfg["models"][0].get("api_key")
+        if api_key is None:
+            api_key = DEFAULT_FAKE_API_KEY
+        os.environ["OPENAI_API_KEY"] = api_key


在 _set_api_key 中，使用了 self.cfg["models"][0]。由于基类 BaseTask 已经定义了 self.model_cfg = cfg["models"][0]，建议直接使用 self.model_cfg，提高代码可读性。

Suggested change

def _set_api_key(self):

api_key = self.cfg["models"][0].get("api_key")

if api_key is None:

api_key = DEFAULT_FAKE_API_KEY

os.environ["OPENAI_API_KEY"] = api_key

def _set_api_key(self):

api_key = self.model_cfg.get("api_key")

if api_key is None:

api_key = DEFAULT_FAKE_API_KEY

os.environ["OPENAI_API_KEY"] = api_key

[review]在 _set_api_key 中，使用了 self.cfg["models"][0]。由于基类 BaseTask 已经定义了 self.model_cfg = cfg["models"][0]，建议直接使用 self.model_cfg，提高代码可读性。

gemini-code-assist · 2026-05-30T01:54:48Z

+        self.out_dir = osp.join(
+            self.work_dir, self.output_subdir, self.cfg["models"][0]["abbr"]
+        )
+        mkdir_or_exist(osp.join(self.out_dir, self.cfg["datasets"][0][0]["abbr"]))
+        self.out_detail_dir = osp.join(
+            self.out_dir,
+            self.cfg["datasets"][0][0]["abbr"],
+        )
+        mkdir_or_exist(Path(self.out_detail_dir))


在 _prepare_out_dir 中，使用了多层嵌套索引 self.cfg["models"][0] 和 self.cfg["datasets"][0][0]。由于基类 BaseTask 已经在 __init__ 中初始化了 self.model_cfg 和 self.dataset_cfgs，建议直接使用 self.model_cfg 和 self.dataset_cfgs[0]，这样代码更简洁且不易出错。

Suggested change

self.out_dir = osp.join(

self.work_dir, self.output_subdir, self.cfg["models"][0]["abbr"]

)

mkdir_or_exist(osp.join(self.out_dir, self.cfg["datasets"][0][0]["abbr"]))

self.out_detail_dir = osp.join(

self.out_dir,

self.cfg["datasets"][0][0]["abbr"],

)

mkdir_or_exist(Path(self.out_detail_dir))

self.out_dir = osp.join(

self.work_dir, self.output_subdir, self.model_cfg["abbr"]

)

mkdir_or_exist(osp.join(self.out_dir, self.dataset_cfgs[0]["abbr"]))

self.out_detail_dir = osp.join(

self.out_dir,

self.dataset_cfgs[0]["abbr"],

)

mkdir_or_exist(Path(self.out_detail_dir))

[review]在 _prepare_out_dir 中，使用了多层嵌套索引 self.cfg["models"][0] 和 self.cfg["datasets"][0][0]。由于基类 BaseTask 已经在 init 中初始化了 self.model_cfg 和 self.dataset_cfgs，建议直接使用 self.model_cfg 和 self.dataset_cfgs[0]，这样代码更简洁且不易出错。

gemini-code-assist · 2026-05-30T01:54:48Z

+        if args.get("verifier_env"):
+            config.verifier.env.update(parse_env_vars(args["verifier_env"]))
+
+        reuse_timestamp = None


变量 reuse_timestamp 被赋值为 None 但在后续代码中从未被使用。建议将其删除。

[review]变量 reuse_timestamp 被赋值为 None 但在后续代码中从未被使用。建议将其删除。

gemini-code-assist · 2026-05-30T01:54:48Z

+        def monitor_progress():
+            nonlocal completed
+            while True:
+                if self.job and self.job.job_dir:
+                    trial_count = len(list(self.job.job_dir.glob("trial_*")))
+                    if trial_count > completed:
+                        pbar.update(trial_count - completed)
+                        completed = trial_count
+                        if self.task_state_manager:
+                            self.task_state_manager.update_task_state({
+                                "finish_count": completed,
+                            })
+                time.sleep(0.5)
+                if completed >= total_tasks:
+                    pbar.update(total_tasks - pbar.n)
+                    pbar.close()
+                    break
+
+        monitor_thread = threading.Thread(target=monitor_progress, daemon=True)
+        monitor_thread.start()


在 _run_with_tqdm 中，monitor_progress 线程通过 completed >= total_tasks 来判断是否退出。如果任务因异常提前结束，或者最终生成的 trial 数量少于 total_tasks，该线程将无法自行退出，导致主线程在 monitor_thread.join(timeout=5) 处不必要地阻塞 5 秒。
建议引入一个 threading.Event（例如 stop_event）来控制线程退出，在 finally 块中将其 set，使监控线程能够立即安全退出。

[reply] 在 _run_with_tqdm 中，monitor_progress 线程通过 completed >= total_tasks 来判断是否退出。如果任务因异常提前结束，或者最终生成的 trial 数量少于 total_tasks，该线程将无法自行退出，导致主线程在 monitor_thread.join(timeout=5) 处不必要地阻塞 5 秒。
建议引入一个 threading.Event（例如 stop_event）来控制线程退出，在 finally 块中将其 set，使监控线程能够立即安全退出。

SWE Bench doc fix

56f188a

SJTUyh had a problem deploying to smoke-test-approval May 30, 2026 01:53 — with GitHub Actions Error

github-actions Bot added the feature label May 30, 2026

gemini-code-assist Bot reviewed May 30, 2026

View reviewed changes

SJTUyh added 2 commits May 30, 2026 09:56

support run terminal-bench-2 with harbor

eb0c6a5

Merge branch 'master_center' into tb2_dev

7e2de7a

SJTUyh force-pushed the tb2_dev branch from a3e25dc to 7e2de7a Compare May 30, 2026 01:57

SJTUyh temporarily deployed to smoke-test-approval May 30, 2026 01:58 — with GitHub Actions Inactive

review fix

c009a48

SJTUyh had a problem deploying to smoke-test-approval May 30, 2026 02:13 — with GitHub Actions Error

add harbor dependencies

52e62ce

SJTUyh temporarily deployed to smoke-test-approval May 30, 2026 02:17 — with GitHub Actions Inactive

fix total score

5959db1

SJTUyh had a problem deploying to smoke-test-approval May 30, 2026 06:08 — with GitHub Actions Failure

fix total score

37f248c

SJTUyh temporarily deployed to smoke-test-approval May 30, 2026 06:20 — with GitHub Actions Inactive

fix total score

f340630

SJTUyh temporarily deployed to smoke-test-approval May 30, 2026 06:39 — with GitHub Actions Inactive

Conversation

SJTUyh commented May 30, 2026

🔍 Motivation / 变更动机

📝 Modification / 修改内容

新增文件

核心功能

示例配置

落盘路径结构

📐 Associated Test Results / 关联测试结果

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

⚠️ Performance degradation (Optional) / 性能下降（可选）

🌟 Use cases (Optional) / 使用案例（可选）

基本执行

断点续测

输出示例

✅ Checklist / 检查列表

👥 Collaboration Info / 协作信息

🌟 Useful CI Command / 实用的CI命令

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant