Hi Tan, I am trying to reproduce your results using Qwen2-VL-2B. As written in your paper https://arxiv.org/pdf/2503.20752,
Training Paradigms and Baselines To assess the performance and generalization of different
training strategies, we compare: (1) SFT-based methods—ANS-SFT, which fine-tunes on answer
generation, and CoT-SFT, which uses supervised learning with CoT reasoning; and (2) RL-based
methods—Reason-RFT-Zero, which applies RL without reasoning activation stage, and ReasonRFT,
which uses limited CoT data for reasoning activation before RL training.
I've converted the full clever-math-sft data (1.5K) into the specified CoT format and obtained the sft model (stage 1). So for the RL (stage 2), which data shall I use? Shall I convert the CoT sft data into RL version for RL training? Or shall I split the sft data to two parts (at which proportion?) for SFT and RL respectively? And for the format, shall I follow Reason-RFT-Zero?
Hi Tan, I am trying to reproduce your results using Qwen2-VL-2B. As written in your paper https://arxiv.org/pdf/2503.20752,
I've converted the full clever-math-sft data (1.5K) into the specified CoT format and obtained the sft model (stage 1). So for the RL (stage 2), which data shall I use? Shall I convert the CoT sft data into RL version for RL training? Or shall I split the sft data to two parts (at which proportion?) for SFT and RL respectively? And for the format, shall I follow Reason-RFT-Zero?