clip-japanese-base-v2
This is a Japanese CLIP (Contrastive Language-Image Pre-training) model developed by LY Corporation. This model is an updated version of line-corporation/clip-japanese-base. It increases the training data to approximately ~2B image–text pairs and applies model distillation to improve overall performance.
How to use
- Install packages
pip install pillow requests sentencepiece transformers torch timm
- Run
import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
HF_MODEL_PATH = 'line-corporation/clip-japanese-base-v2'
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True).to(device)
image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt").to(device)
text = tokenizer(["犬", "猫", "象"]).to(device)
with torch.no_grad():
image_features = model.get_image_features(**image)
text_features = model.get_text_features(**text)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
# [[1., 0., 0.]]
Model architecture
The model uses an Eva02-B Transformer architecture as the image encoder and a 12-layer BERT as the text encoder. The text encoder was initialized from rinna/japanese-clip-vit-b-16.
Evaluation
Dataset
- ImageNet-1k for image classification.
- Recruit Datasets for image classification.
- WAON for image classification.
- STAIR Captions (v2014 val set of MSCOCO) for image-to-text (i2t) and text-to-image (t2i) retrieval. We measure performance using R@1, which is the average recall of i2t and t2i retrieval.
Result
| Model | Params | Avg. | ImageNet-1k (acc@1) | Recruit Datasets (acc@1) | WAON (acc@1) | STAIR Captions (R@1) |
|---|---|---|---|---|---|---|
| clip-japanese-base-v2 | 196M | 0.708 | 0.666 | 0.913 | 0.975 | 0.277 |
| clip-japanese-base | 196M | 0.673 | 0.580 | 0.884 | 0.934 | 0.293 |
| llm-jp/waon-siglip2-base-path16-256 | 375M | 0.664 | 0.555 | 0.872 | 0.951 | 0.276 |
| google/siglip2-base-patch16-224 | 375M | 0.517 | 0.579 | 0.802 | 0.871 | 0.126 |
| google/siglip2-so400m-patch14-224 | 1135M | 0.642 | 0.643 | 0.837 | 0.925 | 0.163 |
Licenses
The Apache License, Version 2.0
Citation
@misc{clip-japanese-base-v2,
title = {CLIP Japanese Base V2},
author={Shuntaro Okada, Shuhei Yokoo, Kei Mukaiyama, Peifei Zhu and Shuhei Nishimura}
url = {https://huggingface.co/line-corporation/clip-japanese-base-v2},
}
- Downloads last month
- 643