Skip to content

Fix KL approximation and evaluation prompt-response alignment#7

Open
Tianqi-Xuuu wants to merge 1 commit intollmsystem:mainfrom
Tianqi-Xuuu:main
Open

Fix KL approximation and evaluation prompt-response alignment#7
Tianqi-Xuuu wants to merge 1 commit intollmsystem:mainfrom
Tianqi-Xuuu:main

Conversation

@Tianqi-Xuuu
Copy link
Copy Markdown

This PR fixes the KL approximation direction and corrects prompt-response alignment in evaluation so all sampled responses are scored.

For KL, the previous implementation used the approximation in the wrong direction, so the computed KL did not correctly match the intended KL(policy || ref) quantity. This change makes the KL computation consistent with the current policy being compared against the reference policy.

For evaluation, evaluate_policy generates multiple responses per prompt, but the old code only paired responses with the original prompt list. As a result, only part of the generated responses were actually scored. This change uses the duplicated prompt list so every sampled response is included in reward evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant