Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently.However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose Parallel-R1, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks.
Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification.
Method # Parallel AIME25 Mean@16 AIME25 Pass@16 MATH Qwen3-4B-Base 0 1.3 10.2 13.9 Parallel-SFT-Seen 95.6 8 29.8 76.6 Parallel-SFT-Unseen 95.6 5.2 20.9 71.5 GRPO (DAPO) 0 14.8 32.4 83.5 + RL on GSM8K 0 13.3 26.3 82.6 Parallel-R1-Seen 27.3 19.2 38.9 86.7 Parallel-R1-Unseen (S1) 13.6 17.7 37.8 82.6 Parallel-R1-Unseen (S2) 63 19 42.2 84.5
Funny that we had a discussion under @siggy47's post about Bee brains today about this - and currently this is the #1 community paper on HF.
This must of course mean that SN influences HF
<smirkmoji>
.