-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 30 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2511.19575
-
HunyuanOCR Technical Report
Paper • 2511.19575 • Published • 23 -
DeepSeek-OCR 2: Visual Causal Flow
Paper • 2601.20552 • Published • 70 -
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
Paper • 2602.01785 • Published • 96 -
MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding
Paper • 2603.22458 • Published • 136
-
LoRA+: Efficient Low Rank Adaptation of Large Models
Paper • 2402.12354 • Published • 7 -
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper • 2402.12659 • Published • 24 -
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Paper • 2402.13249 • Published • 15 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69
-
tencent/HunyuanOCR
Image-Text-to-Text • 1.0B • Updated • 239k • 752 -
HunyuanOCR Technical Report
Paper • 2511.19575 • Published • 23 -
PaddlePaddle/PaddleOCR-VL
Image-Text-to-Text • 1.0B • Updated • 11k • 1.61k -
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Paper • 2510.14528 • Published • 127
-
PubTables-1M: Towards comprehensive table extraction from unstructured documents
Paper • 2110.00061 • Published • 3 -
Optimized Table Tokenization for Table Structure Recognition
Paper • 2305.03393 • Published • 1 -
Qwen3-VL Technical Report
Paper • 2511.21631 • Published • 163 -
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Paper • 2510.14528 • Published • 127
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 190 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 42
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 30 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
tencent/HunyuanOCR
Image-Text-to-Text • 1.0B • Updated • 239k • 752 -
HunyuanOCR Technical Report
Paper • 2511.19575 • Published • 23 -
PaddlePaddle/PaddleOCR-VL
Image-Text-to-Text • 1.0B • Updated • 11k • 1.61k -
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Paper • 2510.14528 • Published • 127
-
HunyuanOCR Technical Report
Paper • 2511.19575 • Published • 23 -
DeepSeek-OCR 2: Visual Causal Flow
Paper • 2601.20552 • Published • 70 -
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
Paper • 2602.01785 • Published • 96 -
MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding
Paper • 2603.22458 • Published • 136
-
PubTables-1M: Towards comprehensive table extraction from unstructured documents
Paper • 2110.00061 • Published • 3 -
Optimized Table Tokenization for Table Structure Recognition
Paper • 2305.03393 • Published • 1 -
Qwen3-VL Technical Report
Paper • 2511.21631 • Published • 163 -
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Paper • 2510.14528 • Published • 127
-
LoRA+: Efficient Low Rank Adaptation of Large Models
Paper • 2402.12354 • Published • 7 -
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper • 2402.12659 • Published • 24 -
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Paper • 2402.13249 • Published • 15 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 190 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 42