Incorrect calculation of generalized advantage estimates in PPO

The following code in `PPOAgent.compute_advantages` ignores value predictions for final observations in the trajectory and instead passes one-before-last values to the `generalized_advantage_estimation` function twice:

```python
    # Arg value_preds was appended with final next_step value. Make tensors
    #   next_value_preds by stripping first and last elements respectively.
    value_preds = value_preds[:, :-1]
    if self._use_gae:
      advantages = value_ops.generalized_advantage_estimation(
          values=value_preds,
          final_value=value_preds[:, -1],
          rewards=rewards,
          discounts=discounts,
          td_lambda=self._lambda,
          time_major=False,
      )
```

Instead, `final_value` should be extracted before `value_preds` are stripped, e.g.:

```python
    final_value_preds = value_preds[:, -1]
    value_preds = value_preds[:, :-1]
    if self._use_gae:
      advantages = value_ops.generalized_advantage_estimation(
          values=value_preds,
          final_value=final_value_preds,
          rewards=rewards,
          discounts=discounts,
          td_lambda=self._lambda,
          time_major=False,
      )
```

Also, the comment about `next_value_preds` doesn't match the code so it could be improved.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect calculation of generalized advantage estimates in PPO #953

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect calculation of generalized advantage estimates in PPO #953

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions