Skip to content

[feature] Support run terminal-bench-2.0 with harbor#314

Open
SJTUyh wants to merge 8 commits into
AISBench:masterfrom
SJTUyh:tb2_dev
Open

[feature] Support run terminal-bench-2.0 with harbor#314
SJTUyh wants to merge 8 commits into
AISBench:masterfrom
SJTUyh:tb2_dev

Conversation

@SJTUyh
Copy link
Copy Markdown
Collaborator

@SJTUyh SJTUyh commented May 30, 2026

PR Type / PR类型

  • Feature(功能新增)
  • Bugfix(Bug 修复)
  • Docs(文档更新)
  • CI/CD(持续集成/持续部署)
  • Refactor(代码重构)
  • Perf(性能优化)
  • Dependency(依赖项更新)
  • Test-Cases(测试用例更新)
  • Other(其他)

Related Issue | 关联 Issue
Relates to #(待填写 issue 编号)


🔍 Motivation / 变更动机

harbor run 能力接入 AISBench benchmark 框架,使 benchmark 能够运行 harbor 评测任务(如 terminal-bench-2)。

核心目标:

  1. benchmark 配置文件的参数与 harbor run 命令行参数一一对应
  2. 落盘文件保存在 benchmark 指定路径内(参考 tau2 bench 格式)
  3. 通过 tqdm 实时显示执行进度
  4. 支持通过 --reuse 断点续测
  5. 实现 HarborSummarizer 自定义汇总输出

📝 Modification / 修改内容

新增文件

文件路径 说明
ais_bench/benchmark/tasks/custom_tasks/harbor_task.py HarborTask 任务实现
ais_bench/configs/agent_example/harbor_terminal_bench_2_task.py benchmark 配置文件模板
ais_bench/summarizers/harbor.py HarborSummarizer 自定义汇总器
ais_bench/summarizers/__init__.py 注册 HarborSummarizer
ais_bench/tasks/custom_tasks/__init__.py 注册 HarborTask
ais_bench/docs/source_zh_cn/extended_benchmark/agent/harbor_design.md 详细设计文档

核心功能

  1. 参数映射

    • Agent 配置(agent_name、model_names、agent_kwargs)在 models 中配置
    • Job 配置(n_attempts、n_concurrent_trials、path 等)在 datasets 中配置
  2. 进度监控

    • tqdm 实时显示执行进度
    • TaskStateManager 更新任务状态
  3. 断点续测

    • 自动检测 details/config.json 是否存在
    • 存在则自动触发续测
  4. HarborSummarizer

    • 打印 harbor 特有指标(reward_distribution、exception_distribution、pass@k)

示例配置

models = [
    dict(
        abbr="terminus-2",
        agent_name="terminus-2",
        model_names=["hosted_vllm/qwen3"],
        agent_kwargs={
            "api_base": "http://192.168.9.103:2498/v1",
            "model_info": {
                "max_input_tokens": 128000,
                "max_output_tokens": 4096,
            },
        },
    )
]

datasets = []
datasets.append(
    dict(
        abbr='harbor_terminal-bench-2',
        args=dict(
            path="/path/to/terminal-bench-2/",
            n_concurrent_trials=5,
            environment_type="docker",
            environment_delete=False,
        ),
    )
)

summarizer = dict(
    attr="accuracy",
    type=HarborSummarizer,
)

落盘路径结构

outputs/default/{timestamp}/
├── results/
│   └── {model_abbr}/
│       └── {dataset_abbr}/
│           ├── details/                 # harbor 原始结果
│           │   ├── config.json
│           │   ├── result.json
│           │   └── trial_*/
│           └── {dataset_abbr}.json   # 精度结果
└── summary/
    └── summary_*.txt

📐 Associated Test Results / 关联测试结果

  • 测试结果链接(待填写)

测试场景:

  1. 单任务执行(1 个 task,1 个 trial)
  2. 多任务并发(n_concurrent_trials > 1)
  3. 断点续测(中断后使用 --reuse 继续)
  4. 多次尝试(n_attempts > 1,验证 pass@k)

⚠️ BC-breaking (Optional) / 向后不兼容变更(可选)

无向后不兼容变更。


⚠️ Performance degradation (Optional) / 性能下降(可选)

无明显性能下降。


🌟 Use cases (Optional) / 使用案例(可选)

基本执行

cd ais_bench
ais_bench ais_bench/configs/agent_example/harbor_terminal_bench_2_task.py --max-num-workers 3

断点续测

ais_bench ais_bench/configs/agent_example/harbor_terminal_bench_2_task.py --max-num-workers 3 --reuse {时间戳}

输出示例

============================================================
Dataset: harbor_terminal-bench-2
Model: terminus-2
============================================================
Total Count: 74
Errors: 54
Avg Score: 0.045

Reward Distribution:
+--------+-------+
|  Score | Count |
+========+=======+
|    0.0 |    70 |
+--------+-------+
|    1.0 |     4 |
+--------+-------+

Exception Distribution:
+----------------------------+-------+
| Exception                  | Count |
+============================+=======+
| AgentTimeoutError          |    39 |
+----------------------------+-------+
| AgentSetupTimeoutError     |    13 |
+----------------------------+-------+
| InternalServerError        |     2 |
+----------------------------+-------+

Pass@k:
+----+-----------+
| k  | Pass Rate |
+====+===========+
|  1 |    0.0541 |
+----+-----------+
|  2 |    0.0811 |
+----+-----------+

✅ Checklist / 检查列表

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • The modification is fully covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  • All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
  • CLA has been signed and all committers have signed the CLA in this PR.

👥 Collaboration Info / 协作信息

  • Suggested Reviewers / 建议审核人: @xxx
  • Relevant Module Owners / 相关模块负责人: @xxx
  • Other Collaboration Notes / 其他协作说明:

🌟 Useful CI Command / 实用的CI命令

Command / 命令 Introduction / 介绍
/gemini review Performs a code review for the current pull request in its current state by Gemini.
/gemini summary Provides a summary of the current pull request in its current state by Gemini.
/gemini help Displays a list of available commands of Gemini.
/readthedocs build Triggers a build of the documentation for the current pull request in its current state by Read the Docs.

@SJTUyh SJTUyh had a problem deploying to smoke-test-approval May 30, 2026 01:53 — with GitHub Actions Error
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Harbor benchmark integration, adding HarborTask for running evaluation jobs and HarborSummarizer for parsing and displaying results. It also updates documentation for swe_bench to install mini-swe-agent from source, and adds an example configuration for terminal-bench-2. The feedback highlights several critical and medium-severity improvements for HarborTask, including ensuring progress bar support when resuming jobs, safely registering signal handlers in multi-threaded environments, using a threading.Event to prevent thread leaks/blocking in the progress monitor, and refactoring nested configuration access using inherited properties. Additionally, it is recommended to remove unused code and avoid hardcoding version strings in the summarizer.

Comment on lines +180 to +184
if self.work_dir:
details_dir = Path(self.work_dir) / "details"
config_path = details_dir / "config.json"
if config_path.exists():
return self._resume_job(details_dir)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

在断点续测(resume)时,直接调用了 _resume_job,它会直接运行 job.run(),这导致续测时无法显示 tqdm 进度条
建议重构 _resume_job_run_with_tqdm,使续测也能通过 _run_with_tqdm 运行,并根据已有的 trial_* 数量初始化进度条的已完成数(completed),从而在续测时也能正确显示进度。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[reply] AISBench在子进程而非多线程环境中

Comment on lines +282 to +285
def _handle_sigterm(signum, frame):
raise KeyboardInterrupt

signal.signal(signal.SIGTERM, _handle_sigterm)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

_run_with_tqdm 中直接调用 signal.signal(signal.SIGTERM, _handle_sigterm)。如果 ais_bench 在多线程环境(例如 ThreadPoolExecutor)中运行此任务,非主线程调用 signal.signal 会抛出 ValueError: signal only works in main thread 异常导致程序崩溃。
建议使用 try-except ValueError 保护该调用,或者确保仅在主线程中注册信号。

        def _handle_sigterm(signum, frame):
            raise KeyboardInterrupt

        try:
            signal.signal(signal.SIGTERM, _handle_sigterm)
        except ValueError:
            pass

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[reply] AISBench在子进程而非多线程环境中

for model_abbr in self.model_abbrs:
if metric is None:
for k in parsed_results.get(model_abbr, {}).get(dataset_abbr, {}).keys():
row = [dataset_abbr, 'a39421', k, dataset_eval_mode.get(dataset_abbr, 'gen')]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

这里硬编码了版本号 'a39421'。建议从配置中动态获取版本号,或者使用默认值 '-',以提高代码的可维护性。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

【review】这里硬编码了版本号 'a39421'。建议从配置中动态获取版本号,或者使用默认值 '-',以提高代码的可维护性。

Comment on lines +29 to +37
def parse_kwargs(kwargs_list: list[str] | None) -> dict[str, Any]:
if not kwargs_list:
return {}
result = {}
for kwarg in kwargs_list:
if "=" in kwarg:
key, value = kwarg.split("=", 1)
result[key] = value
return result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

函数 parse_kwargs 被定义了但从未在代码中被调用。建议将其删除以保持代码整洁。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

【review】函数 parse_kwargs 被定义了但从未在代码中被调用。建议将其删除以保持代码整洁

Comment on lines +81 to +85
def _set_api_key(self):
api_key = self.cfg["models"][0].get("api_key")
if api_key is None:
api_key = DEFAULT_FAKE_API_KEY
os.environ["OPENAI_API_KEY"] = api_key
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

_set_api_key 中,使用了 self.cfg["models"][0]。由于基类 BaseTask 已经定义了 self.model_cfg = cfg["models"][0],建议直接使用 self.model_cfg,提高代码可读性。

Suggested change
def _set_api_key(self):
api_key = self.cfg["models"][0].get("api_key")
if api_key is None:
api_key = DEFAULT_FAKE_API_KEY
os.environ["OPENAI_API_KEY"] = api_key
def _set_api_key(self):
api_key = self.model_cfg.get("api_key")
if api_key is None:
api_key = DEFAULT_FAKE_API_KEY
os.environ["OPENAI_API_KEY"] = api_key

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review]在 _set_api_key 中,使用了 self.cfg["models"][0]。由于基类 BaseTask 已经定义了 self.model_cfg = cfg["models"][0],建议直接使用 self.model_cfg,提高代码可读性。

Comment on lines +88 to +96
self.out_dir = osp.join(
self.work_dir, self.output_subdir, self.cfg["models"][0]["abbr"]
)
mkdir_or_exist(osp.join(self.out_dir, self.cfg["datasets"][0][0]["abbr"]))
self.out_detail_dir = osp.join(
self.out_dir,
self.cfg["datasets"][0][0]["abbr"],
)
mkdir_or_exist(Path(self.out_detail_dir))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

_prepare_out_dir 中,使用了多层嵌套索引 self.cfg["models"][0]self.cfg["datasets"][0][0]。由于基类 BaseTask 已经在 __init__ 中初始化了 self.model_cfgself.dataset_cfgs,建议直接使用 self.model_cfgself.dataset_cfgs[0],这样代码更简洁且不易出错。

Suggested change
self.out_dir = osp.join(
self.work_dir, self.output_subdir, self.cfg["models"][0]["abbr"]
)
mkdir_or_exist(osp.join(self.out_dir, self.cfg["datasets"][0][0]["abbr"]))
self.out_detail_dir = osp.join(
self.out_dir,
self.cfg["datasets"][0][0]["abbr"],
)
mkdir_or_exist(Path(self.out_detail_dir))
self.out_dir = osp.join(
self.work_dir, self.output_subdir, self.model_cfg["abbr"]
)
mkdir_or_exist(osp.join(self.out_dir, self.dataset_cfgs[0]["abbr"]))
self.out_detail_dir = osp.join(
self.out_dir,
self.dataset_cfgs[0]["abbr"],
)
mkdir_or_exist(Path(self.out_detail_dir))

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review]在 _prepare_out_dir 中,使用了多层嵌套索引 self.cfg["models"][0] 和 self.cfg["datasets"][0][0]。由于基类 BaseTask 已经在 init 中初始化了 self.model_cfg 和 self.dataset_cfgs,建议直接使用 self.model_cfg 和 self.dataset_cfgs[0],这样代码更简洁且不易出错。

if args.get("verifier_env"):
config.verifier.env.update(parse_env_vars(args["verifier_env"]))

reuse_timestamp = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

变量 reuse_timestamp 被赋值为 None 但在后续代码中从未被使用。建议将其删除。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review]变量 reuse_timestamp 被赋值为 None 但在后续代码中从未被使用。建议将其删除。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review]变量 reuse_timestamp 被赋值为 None 但在后续代码中从未被使用。建议将其删除。

Comment on lines +261 to +280
def monitor_progress():
nonlocal completed
while True:
if self.job and self.job.job_dir:
trial_count = len(list(self.job.job_dir.glob("trial_*")))
if trial_count > completed:
pbar.update(trial_count - completed)
completed = trial_count
if self.task_state_manager:
self.task_state_manager.update_task_state({
"finish_count": completed,
})
time.sleep(0.5)
if completed >= total_tasks:
pbar.update(total_tasks - pbar.n)
pbar.close()
break

monitor_thread = threading.Thread(target=monitor_progress, daemon=True)
monitor_thread.start()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

_run_with_tqdm 中,monitor_progress 线程通过 completed >= total_tasks 来判断是否退出。如果任务因异常提前结束,或者最终生成的 trial 数量少于 total_tasks,该线程将无法自行退出,导致主线程在 monitor_thread.join(timeout=5) 处不必要地阻塞 5 秒。
建议引入一个 threading.Event(例如 stop_event)来控制线程退出,在 finally 块中将其 set,使监控线程能够立即安全退出。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[reply] 在 _run_with_tqdm 中,monitor_progress 线程通过 completed >= total_tasks 来判断是否退出。如果任务因异常提前结束,或者最终生成的 trial 数量少于 total_tasks,该线程将无法自行退出,导致主线程在 monitor_thread.join(timeout=5) 处不必要地阻塞 5 秒。
建议引入一个 threading.Event(例如 stop_event)来控制线程退出,在 finally 块中将其 set,使监控线程能够立即安全退出。

@SJTUyh SJTUyh had a problem deploying to smoke-test-approval May 30, 2026 02:13 — with GitHub Actions Error
@SJTUyh SJTUyh temporarily deployed to smoke-test-approval May 30, 2026 02:17 — with GitHub Actions Inactive
@SJTUyh SJTUyh had a problem deploying to smoke-test-approval May 30, 2026 06:08 — with GitHub Actions Failure
@SJTUyh SJTUyh temporarily deployed to smoke-test-approval May 30, 2026 06:20 — with GitHub Actions Inactive
@SJTUyh SJTUyh temporarily deployed to smoke-test-approval May 30, 2026 06:39 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant