Instructions to use macpaw-research/blip-icon-captioning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use macpaw-research/blip-icon-captioning with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="macpaw-research/blip-icon-captioning")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("macpaw-research/blip-icon-captioning") model = AutoModelForImageTextToText.from_pretrained("macpaw-research/blip-icon-captioning") - Notebooks
- Google Colab
- Kaggle
π§ BLIP β UI Elements Captioning
This model is a fine-tuned version of Salesforce/blip-image-captioning-base, adapted for captioning UI elements from macOS application screenshots.
It is part of the Screen2AX research project focused on improving accessibility using vision-based deep learning.
π― Use Case
The model takes an image of a UI icon or element and generates a natural language description (e.g., "Settings icon", "Play button", "Search field").
This helps build assistive technologies such as screen readers by providing textual labels for unlabeled visual components.
π Model Architecture
- Base model:
Salesforce/blip-image-captioning-base - Architecture: BLIP (Bootstrapping Language-Image Pre-training)
- Task:
image-to-text
πΌ Example
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests
processor = BlipProcessor.from_pretrained("macpaw-research/blip-icon-captioning")
model = BlipForConditionalGeneration.from_pretrained("macpaw-research/blip-icon-captioning")
image = Image.open("path/to/ui_icon.png")
inputs = processor(images=image, return_tensors="pt")
output = model.generate(**inputs)
caption = processor.decode(output[0], skip_special_tokens=True)
print(caption)
# Example: "Settings icon"
π License
This model is released under the MIT License.
π Related Projects
βοΈ Citation
If you use this model in your research, please cite the Screen2AX paper:
@misc{muryn2025screen2axvisionbasedapproachautomatic,
title={Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation},
author={Viktor Muryn and Marta Sumyk and Mariya Hirna and Sofiya Garkot and Maksym Shamrai},
year={2025},
eprint={2507.16704},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.16704},
}
π MacPaw Research
Learn more at https://research.macpaw.com
- Downloads last month
- 13,999
Model tree for macpaw-research/blip-icon-captioning
Base model
Salesforce/blip-image-captioning-base