Skip to content

Regarding the SFT Stability of Cheers #3

@tour-xray

Description

@tour-xray

Thank you for this excellent work on Cheers! In SFT, I organized my own data for training. However, after multiple rounds of data cleaning and comparative experiments, I found that when training with a 1:1 ratio of understanding to generated data as mentioned in the paper, and a learning rate of 2e-6, the loss and grad_norm fluctuated wildly, showing almost no signs of convergence. During training, as the number of training steps increased, the understanding branch of Cheers tended to generate the response “.... ..." I subsequently adjusted the ratio of understanding to generated data to 8:2, which alleviated the issue significantly, though the "......." responses still occurred occasionally. I would like to ask the authors if they encountered this situation during the SFT training phase, and whether Cheers’ SFT data requires millions of samples to achieve stable convergence?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions