clip-japanese-base-v2

This is a Japanese CLIP (Contrastive Language-Image Pre-training) model developed by LY Corporation. This model is an updated version of line-corporation/clip-japanese-base. It increases the training data to approximately ~2B image–text pairs and applies model distillation to improve overall performance.

How to use

  1. Install packages
pip install pillow requests sentencepiece transformers torch timm
  1. Run
import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer

HF_MODEL_PATH = 'line-corporation/clip-japanese-base-v2'
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True).to(device)

image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt").to(device)
text = tokenizer(["犬", "猫", "象"]).to(device)

with torch.no_grad():
    image_features = model.get_image_features(**image)
    text_features = model.get_text_features(**text)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# [[1., 0., 0.]]

Model architecture

The model uses an Eva02-B Transformer architecture as the image encoder and a 12-layer BERT as the text encoder. The text encoder was initialized from rinna/japanese-clip-vit-b-16.

Evaluation

Dataset

  • ImageNet-1k for image classification.
  • Recruit Datasets for image classification.
  • WAON for image classification.
  • STAIR Captions (v2014 val set of MSCOCO) for image-to-text (i2t) and text-to-image (t2i) retrieval. We measure performance using R@1, which is the average recall of i2t and t2i retrieval.

Result

Model Params Avg. ImageNet-1k (acc@1) Recruit Datasets (acc@1) WAON (acc@1) STAIR Captions (R@1)
clip-japanese-base-v2 196M 0.708 0.666 0.913 0.975 0.277
clip-japanese-base 196M 0.673 0.580 0.884 0.934 0.293
llm-jp/waon-siglip2-base-path16-256 375M 0.664 0.555 0.872 0.951 0.276
google/siglip2-base-patch16-224 375M 0.517 0.579 0.802 0.871 0.126
google/siglip2-so400m-patch14-224 1135M 0.642 0.643 0.837 0.925 0.163

Licenses

The Apache License, Version 2.0

Citation

@misc{clip-japanese-base-v2,
    title = {CLIP Japanese Base V2},
    author={Shuntaro Okada, Shuhei Yokoo, Kei Mukaiyama, Peifei Zhu and Shuhei Nishimura}
    url = {https://huggingface.co/line-corporation/clip-japanese-base-v2},
}
Downloads last month
643
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support