Thank you for this excellent work on Cheers! In SFT, I organized my own data for training. However, after multiple rounds of data cleaning and comparative experiments, I found that when training with a 1:1 ratio of understanding to generated data as mentioned in the paper, and a learning rate of 2e-6, the loss and grad_norm fluctuated wildly, showing almost no signs of convergence. During training, as the number of training steps increased, the understanding branch of Cheers tended to generate the response “.... ..." I subsequently adjusted the ratio of understanding to generated data to 8:2, which alleviated the issue significantly, though the "......." responses still occurred occasionally. I would like to ask the authors if they encountered this situation during the SFT training phase, and whether Cheers’ SFT data requires millions of samples to achieve stable convergence?
Thank you for this excellent work on Cheers! In SFT, I organized my own data for training. However, after multiple rounds of data cleaning and comparative experiments, I found that when training with a 1:1 ratio of understanding to generated data as mentioned in the paper, and a learning rate of 2e-6, the loss and grad_norm fluctuated wildly, showing almost no signs of convergence. During training, as the number of training steps increased, the understanding branch of Cheers tended to generate the response “.... ..." I subsequently adjusted the ratio of understanding to generated data to 8:2, which alleviated the issue significantly, though the "......." responses still occurred occasionally. I would like to ask the authors if they encountered this situation during the SFT training phase, and whether Cheers’ SFT data requires millions of samples to achieve stable convergence?