Regarding the SFT Stability of Cheers

Thank you for this excellent work on Cheers! In SFT, I organized my own data for training. However, after multiple rounds of data cleaning and comparative experiments, I found that when training with a 1:1 ratio of understanding to generated data as mentioned in the paper, and a learning rate of 2e-6, the loss and grad_norm fluctuated wildly, showing almost no signs of convergence. During training, as the number of training steps increased, the understanding branch of Cheers tended to generate the response “.... ..." I subsequently adjusted the ratio of understanding to generated data to 8:2, which alleviated the issue significantly, though the "......." responses still occurred occasionally. I would like to ask the authors if they encountered this situation during the SFT training phase, and whether Cheers’ SFT data requires millions of samples to achieve stable convergence?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding the SFT Stability of Cheers #3

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Regarding the SFT Stability of Cheers #3

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions