Skip to content

Confirmation over settings on clevr_math_sft and its RL version #48

@serser

Description

@serser

Hi Tan, I am trying to reproduce your results using Qwen2-VL-2B. As written in your paper https://arxiv.org/pdf/2503.20752,

Training Paradigms and Baselines To assess the performance and generalization of different
training strategies, we compare: (1) SFT-based methods—ANS-SFT, which fine-tunes on answer
generation, and CoT-SFT, which uses supervised learning with CoT reasoning; and (2) RL-based
methods—Reason-RFT-Zero, which applies RL without reasoning activation stage, and ReasonRFT,
which uses limited CoT data for reasoning activation before RL training.

I've converted the full clever-math-sft data (1.5K) into the specified CoT format and obtained the sft model (stage 1). So for the RL (stage 2), which data shall I use? Shall I convert the CoT sft data into RL version for RL training? Or shall I split the sft data to two parts (at which proportion?) for SFT and RL respectively? And for the format, shall I follow Reason-RFT-Zero?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions