Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning Paper • 2604.05404 • Published 3 days ago • 38
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation Paper • 2604.02368 • Published 14 days ago • 8
Self-Execution Simulation Improves Coding Models Paper • 2604.03253 • Published about 1 month ago • 29