Issue with SFT Data and Abnormal Tool Call Behavior During RL Training

Thank you so much for open-sourcing rStar2! While trying to reproduce the results, I encountered a few issues:

1. Where can I find the non-reasoning SFT data?
2. If I don’t have any SFT data, can I use `qwen3-4b` directly instead of `qwen3-4b-base`?
3. I tried running RL directly with `qwen3-4b`, but after around 50 steps of training, the number of tool calls suddenly increases (the model keeps repeating tool calls), such as in the following example:

----
 .... Therefore, the smallest positive integer n is when k=1: n = 41*1 - 18 = 23. Wait, is that correct? Let me check.

Let me test n=23. Then 9*23 -2 = 207 -2 = 205. 7*23 +3 = 161 +3 = 164. Now, gcd(205, 164). Let's compute that. 205 divided by 164 is 1 with remainder 41. Then 164 divided by 41 is 4 with remainder 0. So gcd is 41. Which is greater than 1. So yes, n=23 works. But is there a smaller n?

Wait, maybe I made a mistake in the steps. Let's check for smaller n. Let's see, since n+18 must be a multiple of 41, the smallest n is 41 -18 = 23. But perhaps there's a smaller n where n+18 is a multiple of 41? Well, 41 is prime, so the next smaller multiple would be 41*0 = 0, which would mean n = -18, which is not positive. So 23 is indeed the smallest positive integer. Therefore, the answer is 23.

But wait, let me check n=23. Yes, as shown, the gcd is 41. So the answer is 23. So the smallest positive integer is 23.
</reason>
<tool_call>
{"name": "python_code_with_standard_io", "arguments": {"code": "def gcd(a, b):\n    while b:\n        a, b = b, a % b\n    return a\n\nfor n in range(1, 100):\n    a = 9 * n - 2\n    b = 7 * n + 3\n    if gcd(a, b) > 1:\n        print(n)\n        break", "input": ""}}
</tool_call>
<tool_call>
{"name": "python_code_with_standard_io", "arguments": {"code": "def gcd(a, b):\n    while b:\n        a, b = b, a % b\n    return a\n\nfor n in range(1, 100):\n    a = 9 * n - 2\n    b = 7 * n + 3\n    if gcd(a, b) > 1:\n        print(n)\n        break", "input": ""}}
</tool_call>
</think>

<tool_call>
{"name": "python_code_with_standard_io", "arguments": {"code": "def gcd(a, b):\n    while b:\n        a, b = b, a % b\n    return a\n\nfor n in range(1, 100):\n    a = 9 * n - 2\n    b = 7 * n + 3\n    if gcd(a, b) > 1:\n        print(n)\n        break", "input": ""}}
</tool_call>user
<tool_response>
Tool call success
stdout: 23

execution time: 0.01s
</tool_response>
assistant
<reason> The Python code executed successfully and returned the result 23. This confirms that the smallest positive integer $ n $ for which $ 9n - 2 $ and $ 7n + 3 $ share a common factor greater than 1 is indeed 23. </reason> <answer> \boxed{23} </answer>

---
Could this behavior be caused by using `qwen3-4b` directly without SFT?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with SFT Data and Abnormal Tool Call Behavior During RL Training #52

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue with SFT Data and Abnormal Tool Call Behavior During RL Training #52

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions