CliBench: Multifaceted Evaluation of Large Language Models in Clinical Decisions on Diagnoses, Procedures, Lab Tests Orders and Prescriptions Paper • 2406.09923 • Published Jun 14, 2024 • 1
Memorize and Rank: Elevating Large Language Models for Clinical Diagnosis Prediction Paper • 2501.17326 • Published Jan 28, 2025
MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science Paper • 2510.12171 • Published Oct 14, 2025
From Solving to Verifying: A Unified Objective for Robust Reasoning in LLMs Paper • 2511.15137 • Published Nov 19, 2025
ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning Paper • 2602.21534 • Published Feb 25 • 26
HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness Paper • 2606.12882 • Published 18 days ago • 13
HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness Paper • 2606.12882 • Published 18 days ago • 13
HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness Paper • 2606.12882 • Published 18 days ago • 13