SPRIGHT-T2I/spright_coco
Viewer โข Updated โข 39.4k โข 275 โข 5
Love โค๏ธ this CLIP?
แ Buy me a coffee on Ko-Fi โ
3PscBrWYvrutXedLmvpcnQbE12Py8qLqMK
import torch
from PIL import Image, ImageDraw
import transformers
from hfmodel.modeling_clip import CLIPModel
from transformers import CLIPProcessor
from torchvision.transforms import ToTensor
import torch.nn.functional as F
model = CLIPModel.from_pretrained("zer0int/CLIP-Registers-Gated_MLP-ViT-L-14")
processor = CLIPProcessor.from_pretrained("zer0int/CLIP-Registers-Gated_MLP-ViT-L-14")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
size = 224
im = Image.new("RGB", (size, size), (255, 255, 255))
draw = ImageDraw.Draw(im)
# --------- GPT-4.1's idea of a pineapple. Need an input image... ---------
body_bbox = [size*0.28, size*0.38, size*0.72, size*0.90]
draw.ellipse(body_bbox, fill=(254, 221, 72), outline=(180, 120, 0), width=5)
eye_color = (198, 134, 66)
for row in range(4):
for col in range(3):
ex = size*0.36 + col*size*0.09 + (row%2)*size*0.045
ey = size*0.50 + row*size*0.085
ew, eh = size*0.035, size*0.025
draw.ellipse([ex-ew, ey-eh, ex+ew, ey+eh], fill=eye_color, outline=None)
leaf_color = (61, 179, 70)
leaf_base_x = size/2
leaf_base_y = size*0.38
for i, (angle, length) in enumerate([(-28, 65), (-12, 70), (0, 80), (12, 70), (28, 65)]):
from math import radians, cos, sin
a = radians(angle)
tip_x = leaf_base_x + length*sin(a)
tip_y = leaf_base_y - length*cos(a)
left = (leaf_base_x + 13*cos(a+1.5), leaf_base_y + 13*sin(a+1.5))
right = (leaf_base_x + 13*cos(a-1.5), leaf_base_y + 13*sin(a-1.5))
draw.polygon([left, (tip_x, tip_y), right], fill=leaf_color)
im.save("pineapple.png")
# ---------
image = Image.open("pineapple.png").convert("RGB")
texts = ["pine", "apple", "pineapple", "orange", "pear", "person", "cat", "dog"]
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
outputs = model(**inputs)
image_embeds = outputs.image_embeds
text_embeds = outputs.text_embeds
image_embeds = F.normalize(image_embeds, dim=-1)
text_embeds = F.normalize(text_embeds, dim=-1)
cos_sim = image_embeds @ text_embeds.T
cos_sim = cos_sim.squeeze(0)
for text, sim in zip(texts, cos_sim):
print(f"Similarity with '{text}': {sim.item():.4f}")
Attention Heatmap, pre-trained OpenAI CLIP ViT-L/14:

Text-To-Image examples, Flux.1-dev, pure CLIP (no T5) guidance:
| Task / Dataset | Metric | ViT-L/14 OpenAI (Pre-trained) | X-GATED (ckpt20 xtreme) | X-GATED (ckpt12 balanced) | X-GATED (ckpt12 balanced, ablated) |
|---|---|---|---|---|---|
| VoC-2007 (Multilabel) | mAP | 0.7615 | 0.8140 | 0.8471 | 0.8247 |
| MSCOCO Retrieval | Image Recall@5 | 0.2194 | 0.3565 | 0.3532 | 0.3349 |
| Text Recall@5 | 0.3034 | 0.5425 | 0.5278 | 0.5086 | |
| Linear Probe CIFAR-10 | Acc@1 | 0.9535 | 0.9813 | 0.9813 | 0.9811 |
| Acc@5 | 0.9966 | 0.9997 | 0.9997 | 0.9997 | |
| Mean Class Recall | 0.9535 | 0.9813 | 0.9813 | 0.9811 | |
| MVT ImageNet/ObjectNet (Zero-Shot) | Accuracy | 0.8453 | 0.8686 | 0.8830 | 0.8815 |
| Linear Probe ILSVRC2012 | Top-1 | 69.86% | 66.43% | 67.10% | 68.99% |
| Top-5 | 92.70% | 91.52% | 91.83% | 92.64% | |
| Modality Gap Metrics | Euclidean Gap โ | 0.8276 | 0.4740 | 0.5395 | 0.7486 |
| JSD โ | 0.5200 | 0.1601 | 0.1303 | 0.3310 | |
| Wasserstein Distance โ | 0.4084 | 0.1742 | 0.2102 | 0.3262 | |
| Img-Text Cos Sim (mean) โ | 0.2723 | 0.4926 | 0.4794 | 0.3634 | |
| Img-Text Cos Sim (std) | 0.0362 | 0.0814 | 0.0758 | 0.0537 | |
| Text-Text Cos Sim (mean) | 0.6807 | 0.6657 | 0.6896 | 0.6896 | |
| Text-Text Cos Sim (std) | 0.1344 | 0.1671 | 0.1535 | 0.1535 |
Bolded values represent the best performance for each metric.