MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity Paper • 2511.03146 • Published Nov 5, 2025 • 7
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts Paper • 2511.04655 • Published Nov 6, 2025 • 7
Contamination Detection for VLMs using Multi-Modal Semantic Perturbation Paper • 2511.03774 • Published Nov 5, 2025 • 12
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm Paper • 2511.04570 • Published Nov 6, 2025 • 210
GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents Paper • 2511.04307 • Published Nov 6, 2025 • 14
HoneyBee: Data Recipes for Vision-Language Reasoners Paper • 2510.12225 • Published Oct 14, 2025 • 10
The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation Paper • 2510.23393 • Published Oct 27, 2025 • 20
A Survey of Data Agents: Emerging Paradigm or Overstated Hype? Paper • 2510.23587 • Published Oct 27, 2025 • 65
Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation Paper • 2510.21583 • Published Oct 24, 2025 • 30
UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning Paper • 2510.20286 • Published Oct 23, 2025 • 23