Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward Paper • 2512.16912 • Published 14 days ago • 10
GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators Paper • 2512.19682 • Published 10 days ago • 15
Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward Paper • 2512.16912 • Published 14 days ago • 10
ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models Paper • 2310.10505 • Published Oct 16, 2023 • 3
Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO Paper • 2505.11595 • Published May 16, 2025 • 1
A Survey on Large Language Models for Mathematical Reasoning Paper • 2506.08446 • Published Jun 10, 2025
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling Paper • 2508.17445 • Published Aug 24, 2025 • 80
Bridging Formal Language with Chain-of-Thought Reasoning to Geometry Problem Solving Paper • 2508.09099 • Published Aug 12, 2025
Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment Paper • 2505.04113 • Published May 7, 2025
Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation Paper • 2509.25849 • Published Sep 30, 2025 • 47
Scaling Latent Reasoning via Looped Language Models Paper • 2510.25741 • Published Oct 29, 2025 • 221