Skip to content

feat(luce-bench): multi-turn agent_recorded redesign + LLM-judge grading#333

Closed
easel wants to merge 1 commit into
Luce-Org:mainfrom
easel:feat/lucebench-multi-turn
Closed

feat(luce-bench): multi-turn agent_recorded redesign + LLM-judge grading#333
easel wants to merge 1 commit into
Luce-Org:mainfrom
easel:feat/lucebench-multi-turn

Conversation

@easel

@easel easel commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Multi-turn agent_recorded area: replaces the single-shot recorded harness with a multi-turn replay, adds the grading/ subpackage with the llm_judge, multi_turn_cases fixture, agent_recorded test + extract-agentic-fixture test, and touches normalize/regrade/cli to support new turn-level metrics. Carries the two qwen3.6 and two gemma4 coding-agent-loop sweep writeups that consume this area.

Part of the PR #285 split

This PR is part of splitting the omnibus PR #285 into tightly-scoped reviewable PRs.

Dependencies

  • lucebench-harness — this PR edits files introduced there and must land first.

Risk

medium

Replaces the single-shot recorded harness with a multi-turn replay,
adds the grading/ subpackage with the llm_judge, multi_turn_cases
fixture, agent_recorded test + extract-agentic-fixture test, and
touches normalize/regrade/cli to support new turn-level metrics.

Also carries the two qwen3.6 and two gemma4 coding-agent-loop sweep
writeups that consume this area.

Depends on: lucebench-harness (edits files introduced there).

Part of the PR Luce-Org#285 split (Luce-Org#285) into
tightly-scoped PRs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@easel

easel commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Merged into PR #337 (lucebench-harness branch) via cherry-pick a4dafc4a — the harness package + multi-turn redesign are now a single PR. Original commit content preserved verbatim.

🤖 Generated with Claude Code

@easel easel closed this Jun 3, 2026
@easel easel deleted the feat/lucebench-multi-turn branch June 3, 2026 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant