Fix KL approximation and evaluation prompt-response alignment by Tianqi-Xuuu · Pull Request #7 · llmsystem/llmsys_hw7

Tianqi-Xuuu · 2026-03-18T05:45:07Z

This PR fixes the KL approximation direction and corrects prompt-response alignment in evaluation so all sampled responses are scored.

For KL, the previous implementation used the approximation in the wrong direction, so the computed KL did not correctly match the intended KL(policy || ref) quantity. This change makes the KL computation consistent with the current policy being compared against the reference policy.

For evaluation, evaluate_policy generates multiple responses per prompt, but the old code only paired responses with the original prompt list. As a result, only part of the generated responses were actually scored. This change uses the duplicated prompt list so every sampled response is included in reward evaluation.

Fix KL approximation and evaluation prompt-response alignment

d3453c6

Tianqi-Xuuu force-pushed the main branch from eec6423 to d3453c6 Compare March 18, 2026 05:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix KL approximation and evaluation prompt-response alignment#7

Fix KL approximation and evaluation prompt-response alignment#7
Tianqi-Xuuu wants to merge 1 commit intollmsystem:mainfrom
Tianqi-Xuuu:main

Tianqi-Xuuu commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Tianqi-Xuuu commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant