pull down to refresh

Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions.

What it does:
How it performs vs GRPO:

Try it yourself

Insights

High entropy fluctuations occur after LLMs interact with external tool environments, indicating that tool calls introduce significant uncertainty into the model's reasoning direction.
This is why MCP is not performing as well as it ought to, and why the whole "agentic" workflow is a risk where risk of hallucination grows while context builds up (and with MCP, it can build it up fast.)
ARPO proposes an entropy-based adaptive rollout mechanism. The core idea is to adaptively perform additional branch sampling during high-entropy tool interaction rounds to supplement exploration for uncertain reasoning steps.
How I understand this is: the adaptive mechanism will make it more fuzzy at first, to then re-evaluate and make it better after, like a fan-out->distill pattern. As I've personally largely fallen back from MCP into programmed re-prompting 1 Not needing that could save compute time, which is great, because time is money, even if you run your LLM locally.
I'll attempt to run some experiments and see if Qwen3-8B-ARPO-DeepSearch can beat Salesforce's xLAM2-8b in tool calls.

Footnotes

  1. I basically try to not use MCP or RAG when I can but instead do patterned prompt re-composition where I separate tool output to then construct a new clean prompt from that, like RAG orchestration does in the ancient techniques from a year ago, to precisely prevent the "information overflow" caused by tool calls in the middle of reasoning processes.