AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models
Abstract
Empirical analysis of visual token pruning methods in large vision-language models reveals that attention-based approaches work better for simple images while diversity-based methods excel with complex images, leading to improved hybrid strategies.
Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations. Our project page available at https://cvsp-lab.github.io/AgilePruner.
Community
Accept to ICLR 2026
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention (2026)
- EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models (2026)
- Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning (2026)
- IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs (2026)
- ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning (2026)
- SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass (2026)
- Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper