Any plans to support Claw-Eval benchmark?

Hi team! First of all, thank you so much for EcoClaw — the idea of routing to the cheapest capable model based on real benchmark data is brilliant, and the cost savings are very impressive. Your research work on LLM routing (Avengers, AvengersPro, LLMRouterBench, etc.) is also incredibly valuable to the community. Great job! 🙌

I noticed that EcoClaw currently uses PinchBench data for model selection and scoring. I'm curious whether there are any plans to support [Claw-Eval](https://github.com/claw-eval/claw-eval) as an additional benchmark source?

Claw-Eval is an end-to-end benchmark specifically designed for AI agents acting as personal assistants, with 104 tasks, 15 mock enterprise services, Docker sandboxes, and deterministic grading. It could be a great complement to PinchBench for evaluating model capabilities in agentic scenarios.

Specifically, I'm wondering:
1. Would it be feasible to incorporate Claw-Eval scores into EcoClaw's routing decisions (e.g., as an additional signal alongside PinchBench)?
2. Or has there been any testing of EcoClaw's routing accuracy against the Claw-Eval benchmark?

Thanks again for your amazing work, and looking forward to hearing your thoughts!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Any plans to support Claw-Eval benchmark? #1

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Any plans to support Claw-Eval benchmark? #1

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions