Hi team, thanks for the great work—seeing dLLM is RL-able is super exciting!
I noticed in your GRPO loss equation (screenshot below) that the loss is normalized by the response length |o|:

As we have discussed in sail-sg/understand-r1-zero, dividing by |o| can introduce a length bias during optimization. Given that dLLM operates with a fixed diffusion step T (more like fixed-horizon RL with eos being the absorbing state), could you clarify whether you're normalizing by |o| or by T in your actual implementation?
Also, do you expect this normalization choice to affect dLLM's behavior (e.g., output length) in a similar way to what we observed in Section 3.2 of this paper?
Thanks again for the awesome work!
Hi team, thanks for the great work—seeing dLLM is RL-able is super exciting!
I noticed in your GRPO loss equation (screenshot below) that the loss is normalized by the response length |o|:
As we have discussed in sail-sg/understand-r1-zero, dividing by |o| can introduce a length bias during optimization. Given that dLLM operates with a fixed diffusion step T (more like fixed-horizon RL with eos being the absorbing state), could you clarify whether you're normalizing by |o| or by T in your actual implementation?
Also, do you expect this normalization choice to affect dLLM's behavior (e.g., output length) in a similar way to what we observed in Section 3.2 of this paper?
Thanks again for the awesome work!