Can we embed multiple images and text into a single embedding?

#38

by srinivasbilla - opened Jun 26, 2025

Discussion

srinivasbilla

Jun 26, 2025

Can we have like 5 images and 5 sentences in one embedding?

michael-guenther

Jina AI org Jun 26, 2025

Hey, if you want to encode 5 sentences into one embedding, you can just concatenate them. For images the encode function does not support it. This means you need to implement yourself a function that converts the images into a sequence of tokens that you can pass to the model. So you basically need to implement something that does the functionality of the encode function [1] yourself but pass multiple images (that should not be too complicated). If you want to encode both text and images into a single embedding you can do it in a similar way. Nevertheless the model is only trained to encode single images and pure text into one embedding representation. So I don't now if multi-model inputs or inputs with multiple images with produce good embeddings.

[1] https://huggingface.co/jinaai/jina-embeddings-v4/blob/main/modeling_jina_embeddings_v4.py#L487-L546

michael-guenther

Jina AI org Jun 26, 2025

We also plan to support encoding multiple images at the time into multiple embedding, i.e., late chunking for images, e.g., to preserve context between pdf pages of the same document by using the late chunking method [1] . But first we need to run some experiments how well this works.

[1] https://github.com/jina-ai/late-chunking

srinivasbilla changed discussion status to closed Jun 27, 2025

srinivasbilla

Jun 27, 2025

Thank you for your reply! Makes sense. Super interesting work though! Thank you for sharing

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment