Dear authors, I'm a novice in DPO, and I am attracted by the awesome results in this paper, therefore I tried to train this based with the open-sourced codes. However, I found the L2 norm of the gradient is too large. Is that a normal practice in diffusion DPO training? I'm looking forward to your reply, and thanks for your help.
