MMMU

non-profit

https://mmmu-benchmark.github.io/

Activity Feed Request to join this org

AI & ML interests

Multimodal Model Evaluation

Recent Activity

yuexiang96 authored a paper 21 days ago

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

yuexiang96 authored a paper 21 days ago

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

yuexiang96 authored a paper 21 days ago

Simulating Environments with Reasoning Models for Agent Training

View all activity

yuexiang96

authored 4 papers 21 days ago

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

Paper • 2510.24702 • Published Oct 28, 2025 • 28

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Paper • 2510.25726 • Published Oct 29, 2025 • 45

Simulating Environments with Reasoning Models for Agent Training

Paper • 2511.01824 • Published Nov 3, 2025 • 2

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Paper • 2512.07783 • Published 30 days ago • 36

zhangysk

submitted a paper to Daily Papers 22 days ago

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Paper • 2512.12730 • Published 24 days ago • 43

wren93

authored a paper about 1 month ago

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Paper • 2512.02014 • Published Dec 1, 2025 • 71

zhangysk

authored a paper about 1 month ago

From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

Paper • 2511.18538 • Published Nov 23, 2025 • 282

zhangysk

authored 5 papers about 2 months ago

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Paper • 2511.03146 • Published Nov 5, 2025 • 7

RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

Paper • 2511.04285 • Published Nov 6, 2025 • 7

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Paper • 2511.07250 • Published Nov 10, 2025 • 17

DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

Paper • 2511.10984 • Published Nov 14, 2025 • 4

Virtual Width Networks

Paper • 2511.11238 • Published Nov 14, 2025 • 37

drogozhang

authored a paper 2 months ago

Scaling Agent Learning via Experience Synthesis

Paper • 2511.03773 • Published Nov 5, 2025 • 81

zhangysk

authored 7 papers 2 months ago

IFEvalCode: Controlled Code Generation

Paper • 2507.22462 • Published Jul 30, 2025

VideoScore2: Think before You Score in Generative Video Evaluation

Paper • 2509.22799 • Published Sep 26, 2025 • 25

Towards Personalized Deep Research: Benchmarks and Evaluations

Paper • 2509.25106 • Published Sep 29, 2025 • 29

Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution

Paper • 2509.25301 • Published Sep 29, 2025 • 19

Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

Paper • 2509.25849 • Published Sep 30, 2025 • 47

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

Paper • 2510.10689 • Published Oct 12, 2025 • 46

ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

Paper • 2510.11652 • Published Oct 13, 2025 • 29