Post-DeepSeek-R1

Resources and research after DeepSeek-R1, around test-time computing, resurgence of RL, and new LLM learning/application paradigms.

This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.

-- From DeepSeek-R1

-- From Mark Chen, OpenAI Chief Research Officer

DeepSeek-R1 Reproduction ("popular" and fast ones)

Simple Reinforcement Learning for Reasoning (HKUST)

Rule-based reward (no MCTS and reward models)
Uses PPO rather than GRPO
Trains small models (7B) on limited data (8K examples)
Starting from Qwen2.5-Math-7B (base model), performs RL on it directly, achieving surprisingly strong results

Training dynamics of our Qwen2.5-SimpleRL-Zero training starting from the Qwen2.5-Math-7B, without SFT or reward models.

DeepScaleR (Berkeley)

Aimed to democratize reinforcement learning (RL) for LLMs and reproduce DeepSeek R1 and OpenAI O1/O3 at scale
Iteratively scaling Deepseek's GRPO algorithm from 8K→16K→24K context length for thinking
Trained on top of DeepSeek-R1-Distill-Qwen-1.5B (Joe: so the initial model is already capable of deep thinking; better if we can do from base models)
Heavily based on modified fork of veRL, an open-source RLHF library
Good insight and training receipe: error cases are initially longer CoTs, so gradually extending context length for thinking during training (Joe: a sort of curriculum learning for RL)

Figure 1: DeepScaleR 1.5B model's Pass@1 accuracy on AIME2024 as RL training progresses. At step 1040 and 1520, the context length is extended to 16K and 24K. For more details, see our blog post.

Open R1 (Hugging Face)

Fully open reproduction of DeepSeek-R1
Blog post

TinyZero

A reproduction of DeepSeek-R1-Zero in countdown and multiplication tasks
Through RL, the 3B base LM develops self-verification and search abilities all on its own
Fails to learn reasoning with Qwen2.5-0.5B base
Works with Qwen/Qwen2.5-3B-Instruct model
Experiment run based on veRL

Mini-R1

A minimal single notebook that tries to reproduce the DeepSeek-R1 "reasoning" results on a single task (the Countdown Game)
Uses GRPO and Q-Lora, also with the TRL library
Starting with the Qwen/Qwen2.5-3B-Instruct model (suggested using models > 1.5B) (Joe: Yes, we need the model to start with to have certain capabilities)
Good learning material with code

Oat-Zero

There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study

Aha moment (such as self-reflection patterns) may already exist in the base model.
There are Superficial Self-Reflection (SSR) from base models' responses, in which case self-reflections do not necessarily lead to correct final answers.
Closer look at R1-Zero-like training via RL, and found that the increasing response length phenomenon is not due to the emergence of self-reflection, but a consequence of RL optimizing well-designed rule-based reward functions.

Open Reasoner Zero

An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Uses PPO (instead of GRPO; some discussions)

Colab Reproductions with Unsloth

One GPU with GRPO (worth trying when resource constraint)
Experience the "aha moment" for free on Colab (seems easy to play with)

Online Materials, Discussions

Video tutorials from Sasha Rush on o1-like test-time scaling and DeepSeek
Some takeaways from the R1, DeepSeek-V3 and GRPO papers (twitter)

Other RL Trained Models

(2025 Mar) QwQ-32B: Embracing the Power of Reinforcement Learning

R1-like RL Reproduction for More Scenarios

Tools

RL libraries:
- veRL (seems most popular as of Mar 2025). Check this list of R1 followup works
- TRL
- Inference: vLLM seems a must to speed up inference
Starting models: Qwen2.5 (base, instruct, R1-distilled, math) seems most popular (as of Mar 2025) (why? some empirical answers), both 3B and 7B models are made work; 0.5B is a bit weaker but could also learn
RL algorithms: GRPO, PPO (some dispute on whether GRPO is the must, here and here)
- some tutorials here and here
GPU resourse: see the other reproductions, and discussion e.g. here
- One GPU with GRPO on Colab
AgentsMeetRL
- A summary of open-source repos for training LLM Agents with RL

LLM + RL with/for X

RAGEN: Training Agents by Reinforcing Reasoning

RL + LLM applied to agents

Using PPO instead of GRPO

(2025, Apr) ToolRL: Reward is All Tool Learning Needs

RL + LLM applied to tool calling

Explored different reward design for tool selection

Logic-RL

RL + LLM applied with synthetic logic puzzles with controllable complexity and straightforward answer verification

Teaching Language Models to Critique via Reinforcement Learning

RL + LLM applied to coding

Train with GRPO using verifiable rewards from sandbox execution

Code-R1: Reproducing R1 for Code with Reliable Rewards

RL + LLM applied to coding

EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework

RL + LLM applied to multimodality (such as VLMs)

R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning

RL + LLM applied to multimodality

For the specific task of emotion recognition, with visual and audio signals (videos)
Learning with a 0.5B model

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

RL + LLM applied to multimodality

Audio LLM, fine-tuned with GRPO

SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning

RL + LLM applied to multimodality

Spatial reasoning with 3D augmented input parsed from images

Search-R1: Train your LLMs to reason and call a search engine with reinforcement learning

RL + LLM applied to retrieval (interleaved with generaion/reasoning)

Tested on NQ dataset, retrieving from Wikipedia

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

RL + LLM applied to retrieval (RAG)

Trained with HotpotQA data

DeepRetrieval - Hacking Search Engines & Retrievers with LLM + RL

RL + LLM applied to retrieval

Tested on literature mining, publication search and trial search tasks

Literature

Here is a collection of papers of different topics and flavors. They are not (cannot be) exhaustive, but grouped based on their themes to give some sense of different types of research and problems in the space.

Joe: I marked the year with month for papers, due to the extreme fast pace in this domain of exploding research

Test-time Scaling

(2024 Aug) Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

-> Test-time scaling for math

Includes search strategies such as Best-of-N, beam search, and beam search with lookahead
Involves process reward model (PRM) and revision models

(2024 Nov) Deliberative Alignment: Reasoning Enables Safer Language Models

-> Test-time scaling for safety

(2025 Jan) s1: Simple test-time scaling

-> Test-time scaling for reasoning

Collected 1K datapoints from diverse datasets and their reasoning traces (from Google Gemini Flash Thinking API), and then a pipeline of quality control and filtering
Finetune Qwen2.5-32B-Instruct on the 1K datapoints, with training takes just 26 minutes on 16 NVIDIA H100 GPUs
Control the test-time compute in the sequential generation scenario (as opposed to parallel like search or best of N). Control the reasoning length by inserting tokens "Final Answer:" and "Wait"

(2025 Feb) S∗: Test Time Scaling for Code Generation

-> Test-time scaling for coding

(2025 Feb) Teaching Language Models to Critique via Reinforcement Learning

-> Test-time scaling for coding

Note

Joe: If we think about test time computing promoted by OpenAI o1, Deepmind AlphaCode in 2022 already used test-time scaling to do a lot of sampling and selection to boost the performance of competitive coding.

(2025 Feb) Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers

-> Test-time scaling with multiple agents (LLMs) for verification

(2025 Mar) Chain-of-Retrieval Augmented Generation

-> Test-time scaling for RAG

Design ways that can scale up inference computation for RAG, such as decomping the question into modular questions and iteratively retrieve
Joe: this is a recurring theme of current rearch on test-time scaling for X. Design ways to increase inference computation, whether it be long CoT, search, verification, etc.

(2025 Mar) Remasking Discrete Diffusion Models with Inference-Time Scaling

-> Test-time scaling for discrete diffusion models for texts

(2025, Apr) T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models

-> Test-time scaling for small langauge models (SLMs) with tool integration and external verification

Scaling Laws (all kinds of)

Scaling Laws

(2024 Feb) Scaling Laws for Downstream Task Performance in Machine Translation

-> Scaling behavior in a transfer learning setting

(2025 Feb) Distillation Scaling Laws

-> Scaling behavior for knowledge distillation

(2025 Feb) Distributional Scaling Laws for Emergent Capabilities

-> Emerging capabilities across multiple training runs with different random seeds

Training experiments with Qwen2.5-0.5B and Qwen2.5-1.5B

Process Reward (after o1)

(2025 Feb) Process Reinforcement through Implicit Rewards

Multimodal, Image Generation

(2025 Jan) Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

(2025 Mar) ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

(2025 Apr) SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning

Spatial reasoning from vision inputs, augmented with parsed 3D structures

(2025 May) Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

RL for Different Ways of Generation

(2025 Feb) Self-rewarding correction for mathematical reasoning

-> Self corrections trained with RL during generaion

(2025 Mar) Reinforcement Learning for Long-Horizon Interactive LLM Agents

-> RL (LOOP, a data- and memory-efficient variant of proximal policy optimization) for long-horizon interactive agents (AppWorld)

(2025, Sep) RLP: Reinforcement as a Pretraining Objective

-> Train with RL to let model think before generating every token

This is related to the earlier work below, which presents the same idea but without explicitly using RL

(2024, Mar) Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Train model to generate rational/thinking before each token generation

Improve Long CoT for Reasoning

(2025 Mar) START: Self-taught Reasoner with Tools

-> Integrate tool usages with reasoning, with controled hint insertion and rejection sampling for training

Tool usage (writing Python code) inside reasoning
Enhance tool usage by injecting hint sequences in CoT during training, such as "Wait", "Maybe I can use Python" at various places based on heuristics
Interleave Python code + executor with reasoning
Rejection sampling fine-tuning (RFT)
Joe: this uses rejection sampling (you can call it RL, from the Llama2 paper). And the paper was not well polished (e.g. from small things like in-text citation formats, etc.)

(2025 Feb) LIMO: Less is More for Reasoning

817 curated training samples
Fine-tune Qwen2.5-32B-Instruct with SFT

(2025 May) SEAL: Steerable Reasoning Calibration of Large Language Models for Free

Categorize the reasoning steps into three behaviors: Execution thoughts, Reflecting thoughts, and Transition thoughts
Analyzed that wrong reasonings often result in much longer generations, with more usages of reflecting and transition
Extract hidden states corresponding to different behaviral steps, and construct steering vectors to control the type of reasoning steps
Achieve more effective and efficient reasoning with inference-time steering

Understanding R1 and RL + LLMs, Tricks to Train RL

(2025 Jan) Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

-> Tricks to scale up RL training to make it work

Encourage sample diversity through oversampling
Auxilliary loss on entropy
Penalize undesired behaviors

(2025 Feb) Demystifying Long Chain-of-Thought Reasoning in LLMs

-> Analyzing the learning dynamics of emergent reasoning with LLM + RL, across different factors such as SFT initilization, lengh reward design, etc.

(2025, Mar) SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

-> Reproduction of RL training on a diverse set of base models beside Qwen used in DeepSeek-R1. Also works for small language models (SLMs).

Code is here: https://github.com/hkust-nlp/simpleRL-reason

(2025 Mar) Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

-> Analyzing the behaviors of emergent reasoning from LLM + RL, across base models and training data

Why Qwen works better then Llama? Qwen already exhibits certain reasoning behaviors before training
Priming Llama to begin RL training with data of complext reasoning behaviors helps, even when the final anwer is not correct
Joe: somehow I don't really get the name of cognitive behaviors (and the whole title); maybe I'm naive

(2025 Mar) Understanding R1-Zero-Like Training: A Critical Perspective

-> Analyzing base models and RL

(2025, Mar) The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models

-> Analyzing the role of prefixes of reasoning trajectories; could also work for self-improvements

Found low diversity in the first few token generations (which makes sense as the sequence length is short, and the possibilities of different trajectories grow exponentially)
Only sample a short prefix and fine-tune the model based on that. Not using labels.

(2025 Jue) Thought Anchors: Which LLM Reasoning Steps Matter? -> Analysis of reasoning sentences

Break down the reasoning chain into each single sentence, and check their causal relations and importances to other sentences and answer
Summarized a sentence taxonomy for reasoning sentences (Table 1 in Appendix A)
And visualize, with a good demo page https://www.thought-anchors.com/

(2025 Apr) Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning

Mutual information is computed between hidden states (continuous vectors) at token step t and ground truth answer
Mutual information (MI) is not computed by estimating a distribution in the Shannon entropy format, but estimated by the Hilbert–Schmidt Independence Criterion (HSIC) with Gaussian kernels
MI is computed by first sampling the hidden state vectors, and then compute based on HSIC between two matrices (collections of the continuous vectors)
Token step t vectors are then mapped to tokens (e.g. by projection to the vocabulary) for concrete analysis

(2025 Feb) Understanding the Uncertainty of LLM Explanations: A Perspective Based on Reasoning Topology

-> Not necessarily long CoT, but built a topological graph to explain reasoning patterns.

The structured representation of reasoning could be applied elsewhere, e.g. to super long reasoning process.

(2025 Sept) Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic -> Steering/task vectors for reasoning

Two identity models, one going through SFT, and one GRPO
Extract task vectors as the difference between parameters to control the reasoning behaviors

(2025 Oct) First Try Matters: Revisiting the Role of Reflection in Reasoning Models -> challenges the conception that reflection in model reasoning actually does "reflection"

Focused on reflective behaviours of model reasoning
Found that most reflective behaviors do not actually alter model reasonings, but merely confirm
Fine-tuning on more reflective behaivors mostly enhance first-answer correctness

(2025 Sept) RL's Razor: Why Online Reinforcement Learning Forgets Less

RL training incurs less forgetting than SFT

(2025 Apr) Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Challenges the role of RL to incentive model with new capabilities vs. just capitalizing on existing capabilities

(2025, Apr) Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Many works with similar questions and conclusions

(2025, Apr) Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining

Pretraining of smaller LMs + RL to understand RL effects

(2025, Jun) Spurious Rewards: Rethinking Training Signals in RLVR

Tested spurious rewards such as ground truth, majority voting, format, random, etc. that can still help the model reasoning. Interesting study to understand how RL helps model learn.

(2025 Nov) Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs

Similar research topic as above

Data

Thought Anchors: https://www.thought-anchors.com/

Open Thoughts: https://github.com/open-thoughts/open-thoughts

(2025, May) [NeurIPS] REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

(2025, Dec) Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision

Data containing math problems, reasoning trajectories, etc.: https://huggingface.co/datasets/nvidia/Nemotron-Math-v2

Training Receipe

(2025 Nov) JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

Simple RL training receipe for scaling 1.5B LLM training

RL Algorithms

(2025, Jun) TreeRPO: Tree Relative Policy Optimization

Sampling to generate a tree structured trajectory, and collect rewards for every node
Improves sampling efficiency for training effciency

(2025 June) Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning

Measures how much intermediate reasoning steps lead to the same final answer, as a "consistency" metric summarizing the reasoning trajectory
Also measures how much sudden changes of the final answer at later reasoning steps are there in the trajectory, as a "volatility" metric
Observes clear separations of these two metrics between trajectories leading to correct vs. incorrect final answers
Include these trajectory statistics for reward, plus a "curiosity" reward that encourages diversity; also borrows the grouping idea from GRPO -> no external reward is needed, as during training the reward just depends on the sample trajectories and their final answers

(2025, July) STeCa: Step-level Trajectory Calibration for LLM Agent Learning

Efficiency

(2024 Dec) Compressed Chain of Thought: Efficient Reasoning through Dense Representations

Reasoning with continuous tokens

(2024, Dec) Token-Budget-Aware LLM Reasoning

Estimating token budget, and then SFT and DPO with optimal token budget (set in the instruction) data

(2025, Jan) Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization

Budget aware thinking

(2025 Feb) TokenSkip: Controllable Chain-of-Thought Compression in LLMs

Filtering out some "unimportant" CoT tokens based on heuristics, generate compressed CoT tokens, and then fine-tune on the reduced trajectories
Joe: similar flavor to context compression, token delection, like LLMLingua

(2025 Mar) Chain of Draft: Thinking Faster by Writing Less

-> Joe: this is not using RL, but just a simple way of prompting by limiting the reasoning step lengths with instructions in prompts. I think similarly we can train LLM with RL to enforce this, and/or as a reward, to improve efficiency during the reasoning process

-> Joe: (a few days later) found out the following paper does that exactly lol

(2025 Mar) L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

-> Joe: LLM + RL to encourage shorter reasoning steps. The way is to condition on special symbols in the prompt controling reasoning steps, which poses another reward

Training starts from the base model DeepScaleR-1.5B-Preview (using the same hyperparameters for GRPO)
Training data also from DeepScaleR-Preview-Dataset, 40K question-answer pairs drawn from AIME, AMC, Omni-Math and STILL
Training context length restricted to 4K, and testing restricted to 8K
Fine-tuned for 700 steps and further 120 steps for two different length reward formulations
Again using VeRL framework

(2025 Feb) Demystifying Long Chain-of-Thought Reasoning in LLMs

-> Joe: see Section 4.2 for the length control with reward design. Strategy is similar to the paper above.

(2025, Feb) [EMNLP 2025] LightThinker: Thinking Step-by-Step Compression

-> Joe: Compressing thinking steps into smaller set of special tokens. Train with special attention mask, inference with reduced KV cache based on the mask structures.

Not using RL. Merging rules are based on heuristics.

(2025 Mar) The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models

(2025, Apr) ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning

-> reward shaping to reduce reasoning token numbers within each GRPO batch

(2025 Apr) Z1: Efficient Test-time Scaling with Code

-> Reducing reasoning token length through SFT on QwQ-32B-preview model generated data

Dataset size of 107K, SFT model Qwen-2.5-Coder-7B-Instruct with bfloat16, FSDP, global batch size to 128 for 2 epochs using 8 NVIDIA A100-80G GPUs
Simple reasoning dataset analysis of trigram frequency in Section 2.1 and Appendix A.2
The biggest difference is removing <think>...</think> delimiters?
Joe: Not quite sure about the "Shifted Thinking Window" name

(2024 Apr) Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification

-> Probe whether the intermediate reasoning step hidden states can predict the correctness of the final answer

Can use the probe for early exit for long reasoning

(2025 Apr) ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning

-> Added reasoning length limit as a reward for RL

(2025 Apr) Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

-> Survey

Collection of papers: https://github.com/Eclipsess/Awesome-Efficient-Reasoning-LLMs

(2025 Apr) Learning Adaptive Parallel Reasoning with Language Models

-> Changing the generation process to combine parallel and sequential search during generation

Similar to one of my earlier ideas of optimizing generation process that can be trained with RL directly for efficiency
But focused on Countdown task only, and trained model from scratch for small scale experiments

(2025 May) Learn to Reason Efficiently with Adaptive Length-based Reward Shaping

-> Reducing reasoning trajectory with different length reward shapes

(2025 May) SEAL: Steerable Reasoning Calibration of Large Language Models for Free

Categorize the reasoning steps into three behaviors: Execution thoughts, Reflecting thoughts, and Transition thoughts
Analyzed that wrong reasonings often result in much longer generations, with more usages of reflecting and transition
Extract hidden states corresponding to different behaviral steps, and construct steering vectors to control the type of reasoning steps
Achieve more effective and efficient reasoning with inference-time steering, by rougly controling the number of steps for reflection, etc.

(2025 May) AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models

(2025, Apr) Acting Less is Reasoning More! Teaching Model to Act Efficiently

-> reward shaping to reduce the cost of the number of tool calling

(2025 June) Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning

-> Again adding a length related penalty in the reward for RL training, but adjusted to the difficulty of each questions, measured by the pass rate of K samples

The length reward formulation is a bit less straightforward
Doesn't show superior performance compared to previous baselines with simple length reward, such as L1-Max

(2025 June) Token-Efficient RL for LLM Reasoning

-> Reduce resource usages when training with GRPO with LoRA

Restrict the tokens that contribute to the loss
Estimate token level advantage, and uses replay for resampling

(2025 July) RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents

RL for long-horizon reasoning with agents

(2025, Jun) AALC: Large Language Model Efficient Reasoning via Adaptive Accuracy-Length Control

Similar length penalty added to reward (a couple of them only differ in the details)

(2025 Aug) Efficient Inference for Large Reasoning Models: A Survey

-> Survey

(2025, Aug) Deep Think with Confidence

(2025, Oct, Webpage) Efficient LLM Reasoning Papers

-> List of recent relevant papers from Arxiv

(2025, Oct) The Markovian Thinker: Architecture-Agnostic Linear Scaling of Reasoning

-> structured compression of thinking chunks

(2025, Dec) Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
images		images
README.md		README.md
applications_agents_research.md		applications_agents_research.md
self-improvement_self-evolvement.md		self-improvement_self-evolvement.md

Folders and files

Latest commit

History

Repository files navigation

Post-DeepSeek-R1

Table of Contents

DeepSeek-R1 Reproduction ("popular" and fast ones)

Simple Reinforcement Learning for Reasoning (HKUST)

DeepScaleR (Berkeley)

Open R1 (Hugging Face)

Online Materials, Discussions

Other RL Trained Models

R1-like RL Reproduction for More Scenarios

Tools

LLM + RL with/for X

Literature

Test-time Scaling

Scaling Laws (all kinds of)

Process Reward (after o1)

Multimodal, Image Generation

RL for Different Ways of Generation

Improve Long CoT for Reasoning

Understanding R1 and RL + LLMs, Tricks to Train RL

Data

Training Receipe

RL Algorithms

Efficiency

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!