pull down to refresh

Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently.
However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose Parallel-R1, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks.
Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification.
Method# ParallelAIME25 Mean@16AIME25 Pass@16MATH
Qwen3-4B-Base01.310.213.9
Parallel-SFT-Seen95.6829.876.6
Parallel-SFT-Unseen95.65.220.971.5
GRPO (DAPO)014.832.483.5
+ RL on GSM8K013.326.382.6
Parallel-R1-Seen27.319.238.986.7
Parallel-R1-Unseen (S1)13.617.737.882.6
Parallel-R1-Unseen (S2)631942.284.5

Funny that we had a discussion under @siggy47's post about Bee brains today about this - and currently this is the #1 community paper on HF.
This must of course mean that SN influences HF <smirkmoji>.