Handling of truncated trajectories in AlphaZero training example

### Problem Description

In examples/alphazero/train.py, we compute `value_mask` as follows:

https://github.com/sotetsuk/pgx/blob/87278d2d6e677fd87248c457207b59cfa42e578d/examples/alphazero/train.py#L179

The purpose is to avoid updating the critic network on incomplete trajectories, as is evident by masking of value loss:

https://github.com/sotetsuk/pgx/blob/87278d2d6e677fd87248c457207b59cfa42e578d/examples/alphazero/train.py#L211

Now, critic and actor networks share a torso of residual blocks [as defined in network.py](https://github.com/sotetsuk/pgx/blob/87278d2d6e677fd87248c457207b59cfa42e578d/examples/alphazero/network.py#L66), and while we mask value losses, we don't mask policy losses for samples from incomplete trajectories:

https://github.com/sotetsuk/pgx/blob/87278d2d6e677fd87248c457207b59cfa42e578d/examples/alphazero/train.py#L207

Therefore, we are in fact inadvertently influencing both the policy and the value network outputs by samples from incomplete trajectories. This seems to be against the intended effect of defining `value_mask`.

---------------------------------


### Possible Solutions

1. To mask out the effect of truncated trajectories from computation of policy loss as well. 
2. To bootstrap value target for truncated trajectories. 

I am not sure which of these or another solution is used by the original AlphaZero papers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of truncated trajectories in AlphaZero training example #1306

Problem Description

Possible Solutions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Handling of truncated trajectories in AlphaZero training example #1306

Description

Problem Description

Possible Solutions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions