High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

(a) In CoTs, only a minority of tokens exhibit high entropy and act as "forks" in reasoning paths, while majority tokens are low-entropy. (b) RLVR using policy gradients of forking tokens delivers significant performance gains that scale with model size. With a 20k maximum response length, our 32B model sets new SoTA scores (63.5 on AIME'24 and 56.7 on AIME'25) for RLVR on base models under 600B. Extending the maximum response length to 29k further boosts the AIME'24 score to 68.1.

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance.

Takeaways

Entropy patterns in CoTs. In CoTs, the majority of tokens are generated with low entropy, while only a small subset exhibits high entropy. These high-entropy minority tokens often act as "forks" in the reasoning process, guiding the model toward diverse reasoning paths. Maintaining high entropy at these critical forking tokens is beneficial for reasoning performance.

Evolution of entropy patterns in CoTs during RLVR. During RLVR training, the reasoning model largely preserves the base model's entropy patterns, showing only gradual and minor changes. RLVR primarily adjusts the entropy of high-entropy tokens, while the entropy of low-entropy tokens fluctuates only within a narrow range.

High-entropy minority tokens drive nearly all reasoning performance gains during RLVR, whereas low-entropy majority tokens contribute little or may even hinder performance. One possible explanation is that, prior to performance convergence, a subset (~20% in our experiments) of high-entropy tokens facilitates exploration, while low-entropy tokens offer minimal benefit or may even impede it.

More discussions and insights. Based on the insights above, we further discuss (i) high-entropy minority tokens as a potential reason why supervised fine-tuning (SFT) memorizes but RL generalizes, (ii) how prior knowledge and readability requirements shape the different entropy patterns seen in LLM CoTs compared to traditional RL trajectories, and (iii) the advantage of clip-higher over entropy bonus for RLVR.

Entropy Patterns in CoTs

High-entropy tokens are the minority, while low-entropy tokens constitute the majority.
Typically, only a minority of tokens are generated with high entropy, while a majority of tokens are outputted with low entropy.
Highest-entropy tokens serve to bridge the logic, while lowest-entropy tokens completes it.
Tokens with the highest entropy typically serve to bridge the logical connection between two consecutive parts of reasoning, while tokens with the lowest entropy tend to complete the current part of a sentence or finish constructing a word. Other tokens combine these two functions to varying degrees.

Entropy patterns in the chain of thoughts of LLMs. (a) Token entropy distribution. The Y-axis frequency is on a log scale. A minority of tokens exhibit high entropy, while the majority have low entropy, often approaching zero. (b) & (c) Word clouds of the top 100 tokens with the highest and lowest average entropy, respectively, selected from the set of frequently occurring tokens. A larger font size indicates a higher average token entropy. Tokens with the highest average entropy typically function as "forks" to determine reasoning directions, whereas tokens with the lowest average entropy tend to execute reasoning steps along the established path.

High-entropy tokens as "forks" in CoTs.
High-entropy tokens benefit from being assigned a relatively higher temperature compared to other tokens. Given that high-entropy tokens naturally exhibit higher entropy than other tokens, this further supports the need for them to operate at an even higher entropy level. This observation indicates their role as "forks," where high entropy enables them to branch into diverse reasoning directions. Therefore, we also refer to these high-entropy tokens as forking tokens.

Average scores of AIME 2024 and AIME 2025. Red curve: varying the decoding temperature of high-entropy tokens while keeping the decoding temperature of low-entropy tokens fixed at 1. Blue curve: adjusting the decoding temperature of low-entropy tokens while maintaining the decoding temperature of high-entropy tokens at 1.

RLVR Preserves and Reinforces Base Model Entropy Patterns

RLVR primarily preserves the existing entropy patterns of the base models.
As shown in the following table, although overlap with the base model gradually decreases and overlap with the final RLVR model increases, the base model's overlap still remains above 86% at convergence (step 1360), suggesting that RLVR largely retains the base model's entropy patterns regarding which tokens exhibit high or low uncertainty.

The progression of the overlap ratio in the positions of the top 20% high-entropy tokens, comparing the base model (i.e., step 0) with the model after RLVR training (i.e., step 1360).

Compared w/	Step 0	Step 16	Step 112	Step 160	Step 480	Step 800	Step 864	Step 840	Step 1280	Step 1360
Base Model	100%	98.92%	98.70%	93.04%	93.02%	93.03%	87.45%	87.22%	87.09%	86.67%
RLVR Model	86.67%	86.71%	86.83%	90.64%	90.65%	90.64%	96.61%	97.07%	97.34%	100%

RLVR predominantly alters the entropy of high-entropy tokens, whereas the entropy of low-entropy tokens remains comparatively stable with minimal variations.
We compute the average entropy change after RLVR for each 5% entropy percentile range of the base model in the following figure. It is observed that tokens with higher initial entropy in the base model tend to exhibit larger increases in entropy after RLVR. This observation could also further reinforce that RLVR primarily preserves the entropy patterns of the base model.
Average entropy change after RLVR within each 5% entropy percentile range of the base model. x% percentile means that x% of the tokens in the dataset have entropy values less than or equal to this value. It is worth noting that the Y-axis is presented on a log scale. Tokens with higher initial entropy tend to experience greater entropy increases after RLVR.

High-Entropy Minority Tokens Drive Effective RLVR

DAPO

discarding the policy gradients of the bottom 80% low-entropy tokens, focusing instead on training the model using only the policy gradients of the top 20% high-entropy tokens (i.e., forking tokens).

Retaining only the top 20% of high-entropy tokens does not negatively impact performance on the 8B model and even leads to significant improvements on the 14B and 32B models, highlighting a strong scaling trend as model size increases.
Comparison between vanilla DAPO using all tokens and DAPO using only the top 20% high-entropy tokens (i.e. forking tokens) in policy gradient loss, evaluated on the Qwen3-32B, Qwen3-14B and Qwen3-8B base models. "Acc@16" and "Len@16" denotes the average accuracy and response length over 16 evaluations per benchmark, respectively.
Especially, with a 20k maximum response length, our 32B model sets new SoTA scores (63.5 on AIME'24 and 56.7 on AIME'25) for RLVR on base models under 600B.
Curves of AIME'24 scores and response lengths with a maximum response length of 20k, trained from the Qwen3-32B base model. Dropping the bottom 80% low-entropy tokens stabilizes training and improves the AIME'24 score by 7.73.
Extending the maximum response length to 29k further boosts the AIME'24 score to 68.1.
By extending the maximum response length from 20k to 29k and continuing training from the SoTA 32B model shown in the figure above, the AIME'24 scores improve further from 63.54 to 68.12, alongside a notable increase in response length.

Discussions

High-entropy minority tokens (i.e., forking tokens) could play a key role in explaining why RL generalizes while SFT memorizes.
Unlike traditional RL, LLM reasoning integrates prior knowledge and must produce readable output. Consequently, LLM CoTs contain a mix of low-entropy majority tokens and high-entropy minority tokens, whereas traditional RL can assume uniform action entropy throughout a trajectory.
In RLVR, entropy bonus may be suboptimal, as it increases the entropy of low-entropy majority tokens. In contrast, clip-higher effectively promotes entropy in high-entropy minority tokens.

BibTeX

@misc{wang20258020rulehighentropyminority,
      title={Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning}, 
      author={Shenzhi Wang and Le Yu and Chang Gao and Chujie Zheng and Shixuan Liu and Rui Lu and Kai Dang and Xionghui Chen and Jianxin Yang and Zhenru Zhang and Yuqiong Liu and An Yang and Andrew Zhao and Yang Yue and Shiji Song and Bowen Yu and Gao Huang and Junyang Lin},
      year={2025},
      eprint={2506.01939},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.01939}, 
}