feat(luce-bench): multi-turn agent_recorded redesign + LLM-judge grading by easel · Pull Request #333 · Luce-Org/lucebox-hub

easel · 2026-06-03T21:29:44Z

Multi-turn agent_recorded area: replaces the single-shot recorded harness with a multi-turn replay, adds the grading/ subpackage with the llm_judge, multi_turn_cases fixture, agent_recorded test + extract-agentic-fixture test, and touches normalize/regrade/cli to support new turn-level metrics. Carries the two qwen3.6 and two gemma4 coding-agent-loop sweep writeups that consume this area.

Part of the PR #285 split

This PR is part of splitting the omnibus PR #285 into tightly-scoped reviewable PRs.

Dependencies

lucebench-harness — this PR edits files introduced there and must land first.

Risk

medium

Replaces the single-shot recorded harness with a multi-turn replay, adds the grading/ subpackage with the llm_judge, multi_turn_cases fixture, agent_recorded test + extract-agentic-fixture test, and touches normalize/regrade/cli to support new turn-level metrics. Also carries the two qwen3.6 and two gemma4 coding-agent-loop sweep writeups that consume this area. Depends on: lucebench-harness (edits files introduced there). Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

easel · 2026-06-03T22:27:22Z

Merged into PR #337 (lucebench-harness branch) via cherry-pick a4dafc4a — the harness package + multi-turn redesign are now a single PR. Original commit content preserved verbatim.

🤖 Generated with Claude Code

easel closed this Jun 3, 2026

easel deleted the feat/lucebench-multi-turn branch June 3, 2026 22:27

easel mentioned this pull request Jun 3, 2026

feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge #337

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(luce-bench): multi-turn agent_recorded redesign + LLM-judge grading#333

feat(luce-bench): multi-turn agent_recorded redesign + LLM-judge grading#333
easel wants to merge 1 commit into
Luce-Org:mainfrom
easel:feat/lucebench-multi-turn

easel commented Jun 3, 2026

Uh oh!

easel commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

easel commented Jun 3, 2026

Part of the PR #285 split

Dependencies

Risk

Uh oh!

easel commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant