About Diffu-GRPO

Hi team, thanks for the great work—seeing dLLM is RL-able is super exciting!

I noticed in your GRPO loss equation (screenshot below) that the loss is normalized by the response length |o|:

![Image](https://github.com/user-attachments/assets/99265647-37ea-4f3b-82b2-defc67cea70b)

As we have discussed in [sail-sg/understand-r1-zero](https://github.com/sail-sg/understand-r1-zero), dividing by |o| can introduce a **length bias** during optimization. Given that dLLM operates with a fixed diffusion step T (more like fixed-horizon RL with eos being the absorbing state), could you clarify whether you're normalizing by  |o| or by T in your actual implementation?

Also, do you expect this normalization choice to affect dLLM's behavior (e.g., output length) in a similar way to what we observed in [Section 3.2 of this paper](https://arxiv.org/pdf/2503.20783#page=7.78)?

Thanks again for the awesome work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Diffu-GRPO #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

About Diffu-GRPO #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions