Image-Text-to-Text
Transformers
Safetensors
internvl_chat
feature-extraction
conversational
custom_code
Instructions to use OS-Copilot/OS-Genesis-4B-AW with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OS-Copilot/OS-Genesis-4B-AW with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="OS-Copilot/OS-Genesis-4B-AW", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("OS-Copilot/OS-Genesis-4B-AW", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use OS-Copilot/OS-Genesis-4B-AW with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OS-Copilot/OS-Genesis-4B-AW" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OS-Copilot/OS-Genesis-4B-AW", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/OS-Copilot/OS-Genesis-4B-AW
- SGLang
How to use OS-Copilot/OS-Genesis-4B-AW with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OS-Copilot/OS-Genesis-4B-AW" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OS-Copilot/OS-Genesis-4B-AW", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OS-Copilot/OS-Genesis-4B-AW" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OS-Copilot/OS-Genesis-4B-AW", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use OS-Copilot/OS-Genesis-4B-AW with Docker Model Runner:
docker model run hf.co/OS-Copilot/OS-Genesis-4B-AW
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,155 @@
|
|
| 1 |
-
---
|
| 2 |
-
license:
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
base_model: OpenGVLab/InternVL2-4B
|
| 5 |
+
pipeline_tag: image-text-to-text
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
|
| 9 |
+
|
| 10 |
+
<div align="center">
|
| 11 |
+
|
| 12 |
+
[\[🏠Homepage\]](https://qiushisun.github.io/OS-Genesis-Home/) [\[💻Code\]](https://github.com/OS-Copilot/OS-Genesis) [\[📝Paper\]](https://arxiv.org/abs/2412.19723) [\[🤗Models\]](https://huggingface.co/collections/OS-Copilot/os-genesis-6768d4b6fffc431dbf624c2d)[\[🤗Data\]](https://huggingface.co/collections/OS-Copilot/os-genesis-6768d4b6fffc431dbf624c2d)
|
| 13 |
+
|
| 14 |
+
</div>
|
| 15 |
+
|
| 16 |
+
## Overview
|
| 17 |
+

|
| 18 |
+
|
| 19 |
+
We introduce OS-Genesis, an interaction-driven pipeline that synthesizes high-quality and diverse GUI agent trajectory data without human supervision. By leveraging reverse task synthesis, OS-Genesis enables effective training of GUI agents to achieve superior performance on dynamic benchmarks such as AndroidWorld and WebArena.
|
| 20 |
+
|
| 21 |
+
## Quick Start
|
| 22 |
+
OS-Genesis-4B-AW is a mobile action model finetuned from [InternVL2-4B](https://huggingface.co/OpenGVLab/InternVL2-4B).
|
| 23 |
+
|
| 24 |
+
### OS-Genesis AW Family Models
|
| 25 |
+
In the following table, we provide an overview of the OS-Genesis AW Family Models used for evaluating the AndroidControl Benchmark.
|
| 26 |
+
|
| 27 |
+
| Model Name | Base Model | Training Data | HF Link |
|
| 28 |
+
| :-------------: | :-------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------: | :---------------------------------------------------------: |
|
| 29 |
+
| OS-Genesis-4B-AW | [InternVL2-4B](https://huggingface.co/OpenGVLab/InternVL2-4B) | [OS-Genesis-aw-training-data](https://huggingface.co/datasets/OS-Copilot/OS-Genesis-mobile-data/blob/main/os_genesis_aw_training_data.jsonl) | [🤗 link](https://huggingface.co/OS-Copilot/OS-Genesis-4B-AW) |
|
| 30 |
+
| OS-Genesis-7B-AW | [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) | [OS-Genesis-aw-training-data](https://huggingface.co/datasets/OS-Copilot/OS-Genesis-mobile-data/blob/main/os_genesis_aw_training_data.jsonl) | [🤗 link](https://huggingface.co/OS-Copilot/OS-Genesis-7B-AW) |
|
| 31 |
+
| OS-Genesis-8B-AW | [InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B) | [OS-Genesis-aw-training-data](https://huggingface.co/datasets/OS-Copilot/OS-Genesis-mobile-data/blob/main/os_genesis_aw_training_data.jsonl) | [🤗 link](https://huggingface.co/OS-Copilot/OS-Genesis-8B-AW) |
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
### Inference Example
|
| 35 |
+
First, install the `transformers` library:
|
| 36 |
+
|
| 37 |
+
```
|
| 38 |
+
pip install transformers
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
For additional dependencies, please refer to the [InternVL2 documentation](https://internvl.readthedocs.io/en/latest/get_started/installation.html).
|
| 42 |
+
|
| 43 |
+
For evaluating the AndroidWorld Benchmark, please refer to the [**evaluation code**](https://github.com/OS-Copilot/OS-Genesis/tree/main/evaluation/android_world).
|
| 44 |
+
|
| 45 |
+
Inference code example:
|
| 46 |
+
```python
|
| 47 |
+
import numpy as np
|
| 48 |
+
import torch
|
| 49 |
+
import torchvision.transforms as T
|
| 50 |
+
from PIL import Image
|
| 51 |
+
from torchvision.transforms.functional import InterpolationMode
|
| 52 |
+
from transformers import AutoModel, AutoTokenizer
|
| 53 |
+
IMAGENET_MEAN = (0.485, 0.456, 0.406)
|
| 54 |
+
IMAGENET_STD = (0.229, 0.224, 0.225)
|
| 55 |
+
|
| 56 |
+
def build_transform(input_size):
|
| 57 |
+
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
|
| 58 |
+
transform = T.Compose([
|
| 59 |
+
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
|
| 60 |
+
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
|
| 61 |
+
T.ToTensor(),
|
| 62 |
+
T.Normalize(mean=MEAN, std=STD)
|
| 63 |
+
])
|
| 64 |
+
return transform
|
| 65 |
+
|
| 66 |
+
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
|
| 67 |
+
best_ratio_diff = float('inf')
|
| 68 |
+
best_ratio = (1, 1)
|
| 69 |
+
area = width * height
|
| 70 |
+
for ratio in target_ratios:
|
| 71 |
+
target_aspect_ratio = ratio[0] / ratio[1]
|
| 72 |
+
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
|
| 73 |
+
if ratio_diff < best_ratio_diff:
|
| 74 |
+
best_ratio_diff = ratio_diff
|
| 75 |
+
best_ratio = ratio
|
| 76 |
+
elif ratio_diff == best_ratio_diff:
|
| 77 |
+
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
|
| 78 |
+
best_ratio = ratio
|
| 79 |
+
return best_ratio
|
| 80 |
+
|
| 81 |
+
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
|
| 82 |
+
orig_width, orig_height = image.size
|
| 83 |
+
aspect_ratio = orig_width / orig_height
|
| 84 |
+
|
| 85 |
+
# calculate the existing image aspect ratio
|
| 86 |
+
target_ratios = set(
|
| 87 |
+
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
|
| 88 |
+
i * j <= max_num and i * j >= min_num)
|
| 89 |
+
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
|
| 90 |
+
|
| 91 |
+
# find the closest aspect ratio to the target
|
| 92 |
+
target_aspect_ratio = find_closest_aspect_ratio(
|
| 93 |
+
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
|
| 94 |
+
|
| 95 |
+
# calculate the target width and height
|
| 96 |
+
target_width = image_size * target_aspect_ratio[0]
|
| 97 |
+
target_height = image_size * target_aspect_ratio[1]
|
| 98 |
+
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
|
| 99 |
+
|
| 100 |
+
# resize the image
|
| 101 |
+
resized_img = image.resize((target_width, target_height))
|
| 102 |
+
processed_images = []
|
| 103 |
+
for i in range(blocks):
|
| 104 |
+
box = (
|
| 105 |
+
(i % (target_width // image_size)) * image_size,
|
| 106 |
+
(i // (target_width // image_size)) * image_size,
|
| 107 |
+
((i % (target_width // image_size)) + 1) * image_size,
|
| 108 |
+
((i // (target_width // image_size)) + 1) * image_size
|
| 109 |
+
)
|
| 110 |
+
# split the image
|
| 111 |
+
split_img = resized_img.crop(box)
|
| 112 |
+
processed_images.append(split_img)
|
| 113 |
+
assert len(processed_images) == blocks
|
| 114 |
+
if use_thumbnail and len(processed_images) != 1:
|
| 115 |
+
thumbnail_img = image.resize((image_size, image_size))
|
| 116 |
+
processed_images.append(thumbnail_img)
|
| 117 |
+
return processed_images
|
| 118 |
+
|
| 119 |
+
def load_image(image_file, input_size=448, max_num=12):
|
| 120 |
+
image = Image.open(image_file).convert('RGB')
|
| 121 |
+
transform = build_transform(input_size=input_size)
|
| 122 |
+
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
|
| 123 |
+
pixel_values = [transform(image) for image in images]
|
| 124 |
+
pixel_values = torch.stack(pixel_values)
|
| 125 |
+
return pixel_values
|
| 126 |
+
|
| 127 |
+
# If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
|
| 128 |
+
path = 'OS-Copilot/OS-Genesis-4B-AW'
|
| 129 |
+
model = AutoModel.from_pretrained(
|
| 130 |
+
path,
|
| 131 |
+
torch_dtype=torch.bfloat16,
|
| 132 |
+
low_cpu_mem_usage=True,
|
| 133 |
+
trust_remote_code=True).eval().cuda()
|
| 134 |
+
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
|
| 135 |
+
|
| 136 |
+
# set the max number of tiles in `max_num`
|
| 137 |
+
pixel_values = load_image('./web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png', max_num=6).to(torch.bfloat16).cuda()
|
| 138 |
+
generation_config = dict(max_new_tokens=1024, do_sample=True)
|
| 139 |
+
|
| 140 |
+
question = "<image>\nYou are a GUI task expert, I will provide you with a high-level instruction, an action history, a screenshot with its corresponding accessibility tree.\n High-level instruction: {high_level_instruction}\n Action history: {action_history}\n Accessibility tree: {a11y_tree}\n Please generate the low-level thought and action for the next step."
|
| 141 |
+
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
|
| 142 |
+
print(f'User: {question}\nAssistant: {response}')
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
|
| 146 |
+
## Citation
|
| 147 |
+
If you find this repository helpful, feel free to cite our paper:
|
| 148 |
+
```bibtex
|
| 149 |
+
@article{sun2024osgenesis,
|
| 150 |
+
title={OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis},
|
| 151 |
+
author={Qiushi Sun and Kanzhi Cheng and Zichen Ding and Chuanyang Jin and Yian Wang and Fangzhi Xu and Zhenyu Wu and Chengyou Jia and Liheng Chen and Zhoumianze Liu and Ben Kao and Guohao Li and Junxian He and Yu Qiao and Zhiyong Wu},
|
| 152 |
+
journal={arXiv preprint arXiv:2412.19723},
|
| 153 |
+
year={2024}
|
| 154 |
+
}
|
| 155 |
+
```
|