Skip to content

About Diffu-GRPO #1

@lkevinzc

Description

@lkevinzc

Hi team, thanks for the great work—seeing dLLM is RL-able is super exciting!

I noticed in your GRPO loss equation (screenshot below) that the loss is normalized by the response length |o|:

Image

As we have discussed in sail-sg/understand-r1-zero, dividing by |o| can introduce a length bias during optimization. Given that dLLM operates with a fixed diffusion step T (more like fixed-horizon RL with eos being the absorbing state), could you clarify whether you're normalizing by |o| or by T in your actual implementation?

Also, do you expect this normalization choice to affect dLLM's behavior (e.g., output length) in a similar way to what we observed in Section 3.2 of this paper?

Thanks again for the awesome work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions