(a) In CoTs, only a minority of tokens exhibit high entropy and act as "forks" in reasoning paths, while majority tokens are low-entropy. (b) RLVR using policy gradients of forking tokens delivers significant performance gains that scale with model size. With a 20k maximum response length, our 32B model sets new SoTA scores (63.5 on AIME'24 and 56.7 on AIME'25) for RLVR on base models under 600B. Extending the maximum response length to 29k further boosts the AIME'24 score to 68.1.
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance.
Entropy patterns in CoTs. In CoTs, the majority of tokens are generated with low entropy, while only a small subset exhibits high entropy. These high-entropy minority tokens often act as "forks" in the reasoning process, guiding the model toward diverse reasoning paths. Maintaining high entropy at these critical forking tokens is beneficial for reasoning performance.
Evolution of entropy patterns in CoTs during RLVR. During RLVR training, the reasoning model largely preserves the base model's entropy patterns, showing only gradual and minor changes. RLVR primarily adjusts the entropy of high-entropy tokens, while the entropy of low-entropy tokens fluctuates only within a narrow range.
High-entropy minority tokens drive nearly all reasoning performance gains during RLVR, whereas low-entropy majority tokens contribute little or may even hinder performance. One possible explanation is that, prior to performance convergence, a subset (~20% in our experiments) of high-entropy tokens facilitates exploration, while low-entropy tokens offer minimal benefit or may even impede it.
More discussions and insights. Based on the insights above, we further discuss (i) high-entropy minority tokens as a potential reason why supervised fine-tuning (SFT) memorizes but RL generalizes, (ii) how prior knowledge and readability requirements shape the different entropy patterns seen in LLM CoTs compared to traditional RL trajectories, and (iii) the advantage of clip-higher over entropy bonus for RLVR.
Entropy patterns in the chain of thoughts of LLMs. (a) Token entropy distribution. The Y-axis frequency is on a log scale. A minority of tokens exhibit high entropy, while the majority have low entropy, often approaching zero. (b) & (c) Word clouds of the top 100 tokens with the highest and lowest average entropy, respectively, selected from the set of frequently occurring tokens. A larger font size indicates a higher average token entropy. Tokens with the highest average entropy typically function as "forks" to determine reasoning directions, whereas tokens with the lowest average entropy tend to execute reasoning steps along the established path.
Average scores of AIME 2024 and AIME 2025. Red curve: varying the decoding temperature of high-entropy tokens while keeping the decoding temperature of low-entropy tokens fixed at 1. Blue curve: adjusting the decoding temperature of low-entropy tokens while maintaining the decoding temperature of high-entropy tokens at 1.
Compared w/ | Step 0 | Step 16 | Step 112 | Step 160 | Step 480 | Step 800 | Step 864 | Step 840 | Step 1280 | Step 1360 |
---|---|---|---|---|---|---|---|---|---|---|
Base Model | 100% | 98.92% | 98.70% | 93.04% | 93.02% | 93.03% | 87.45% | 87.22% | 87.09% | 86.67% |
RLVR Model | 86.67% | 86.71% | 86.83% | 90.64% | 90.65% | 90.64% | 96.61% | 97.07% | 97.34% | 100% |
Average entropy change after RLVR within each 5% entropy percentile range of the base model. x% percentile means that x% of the tokens in the dataset have entropy values less than or equal to this value. It is worth noting that the Y-axis is presented on a log scale. Tokens with higher initial entropy tend to experience greater entropy increases after RLVR.
Comparison between vanilla DAPO using all tokens and DAPO using only the top 20% high-entropy tokens (i.e. forking tokens) in policy gradient loss, evaluated on the Qwen3-32B, Qwen3-14B and Qwen3-8B base models. "Acc@16" and "Len@16" denotes the average accuracy and response length over 16 evaluations per benchmark, respectively.
Curves of AIME'24 scores and response lengths with a maximum response length of 20k, trained from the Qwen3-32B base model. Dropping the bottom 80% low-entropy tokens stabilizes training and improves the AIME'24 score by 7.73.
By extending the maximum response length from 20k to 29k and continuing training from the SoTA 32B model shown in the figure above, the AIME'24 scores improve further from 63.54 to 68.12, alongside a notable increase in response length.
@misc{wang20258020rulehighentropyminority,
title={Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning},
author={Shenzhi Wang and Le Yu and Chang Gao and Chujie Zheng and Shixuan Liu and Rui Lu and Kai Dang and Xionghui Chen and Jianxin Yang and Zhenru Zhang and Yuqiong Liu and An Yang and Andrew Zhao and Yang Yue and Shiji Song and Bowen Yu and Gao Huang and Junyang Lin},
year={2025},
eprint={2506.01939},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.01939},
}