Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Paper
• 2403.12596
• Published
• 11
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
• 2404.13013
• Published
• 31
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
• 2404.16994
• Published
• 37
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment
Capability
Paper
• 2405.14129
• Published
• 14
Dense Connector for MLLMs
Paper
• 2405.13800
• Published
• 24
Merlin:Empowering Multimodal LLMs with Foresight Minds
Paper
• 2312.00589
• Published
• 27
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language
Understanding
Paper
• 2407.15754
• Published
• 21
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
• 2407.15841
• Published
• 40
Efficient Inference of Vision Instruction-Following Models with Elastic
Cache
Paper
• 2407.18121
• Published
• 17
VideoLLaMB: Long-context Video Understanding with Recurrent Memory
Bridges
Paper
• 2409.01071
• Published
• 27
LongVLM: Efficient Long Video Understanding via Large Language Models
Paper
• 2404.03384
• Published
Visual Context Window Extension: A New Perspective for Long Video
Understanding
Paper
• 2409.20018
• Published
• 11
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality
Documents
Paper
• 2410.10594
• Published
• 29
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
Understanding
Paper
• 2501.13106
• Published
• 90