-
Notifications
You must be signed in to change notification settings - Fork 745
Open
Description
The following code in PPOAgent.compute_advantages ignores value predictions for final observations in the trajectory and instead passes one-before-last values to the generalized_advantage_estimation function twice:
# Arg value_preds was appended with final next_step value. Make tensors
# next_value_preds by stripping first and last elements respectively.
value_preds = value_preds[:, :-1]
if self._use_gae:
advantages = value_ops.generalized_advantage_estimation(
values=value_preds,
final_value=value_preds[:, -1],
rewards=rewards,
discounts=discounts,
td_lambda=self._lambda,
time_major=False,
)Instead, final_value should be extracted before value_preds are stripped, e.g.:
final_value_preds = value_preds[:, -1]
value_preds = value_preds[:, :-1]
if self._use_gae:
advantages = value_ops.generalized_advantage_estimation(
values=value_preds,
final_value=final_value_preds,
rewards=rewards,
discounts=discounts,
td_lambda=self._lambda,
time_major=False,
)Also, the comment about next_value_preds doesn't match the code so it could be improved.
Metadata
Metadata
Assignees
Labels
No labels