Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation
Abstract
UniMRG enhances unified multimodal models by training them to generate multiple visual representations, improving both understanding and generation capabilities through complementary information capture.
Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.
Community
arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/generation-enhances-understanding-in-unified-multimodal-models-via-multi-representation-generation-1920-3c1c13a7
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning (2025)
- UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision (2026)
- UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation (2025)
- COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence (2025)
- Exploring MLLM-Diffusion Information Transfer with MetaCanvas (2025)
- Uni-RS: A Spatially Faithful Unified Understanding and Generation Model for Remote Sensing (2026)
- Unified Personalized Understanding, Generating and Editing (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper