Skip to content

Any plans to support Claw-Eval benchmark? #1

Description

@passionate11

Hi team! First of all, thank you so much for EcoClaw — the idea of routing to the cheapest capable model based on real benchmark data is brilliant, and the cost savings are very impressive. Your research work on LLM routing (Avengers, AvengersPro, LLMRouterBench, etc.) is also incredibly valuable to the community. Great job! 🙌

I noticed that EcoClaw currently uses PinchBench data for model selection and scoring. I'm curious whether there are any plans to support Claw-Eval as an additional benchmark source?

Claw-Eval is an end-to-end benchmark specifically designed for AI agents acting as personal assistants, with 104 tasks, 15 mock enterprise services, Docker sandboxes, and deterministic grading. It could be a great complement to PinchBench for evaluating model capabilities in agentic scenarios.

Specifically, I'm wondering:

  1. Would it be feasible to incorporate Claw-Eval scores into EcoClaw's routing decisions (e.g., as an additional signal alongside PinchBench)?
  2. Or has there been any testing of EcoClaw's routing accuracy against the Claw-Eval benchmark?

Thanks again for your amazing work, and looking forward to hearing your thoughts!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions