| --- |
| license: apache-2.0 |
| datasets: |
| - Alex11556666/Reason_Tuning |
| base_model: |
| - Qwen/Qwen2.5-VL-3B-Instruct |
| pipeline_tag: text-to-image |
| --- |
| |
| # π‘ DeepGen 1.0 (Diffusers Format): A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing |
|
|
| This is the **diffusers-compatible** version of [DeepGen-1.0](https://huggingface.co/deepgenteam/DeepGen-1.0). The model weights are stored in safetensors format with a self-contained pipeline script (`deepgen_pipeline.py`) β **no need to clone the DeepGen repository**. |
|
|
| DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilitiesβgeneral image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβwithin a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ to 16Γ larger. |
|
|
| ## π οΈ Quick Start |
|
|
| ### Installation |
|
|
| ```bash |
| pip install torch diffusers transformers safetensors einops accelerate huggingface_hub |
| # Flash Attention (recommended) |
| pip install flash-attn --no-build-isolation |
| ``` |
|
|
| ### Load Pipeline |
|
|
| ```python |
| import torch |
| from diffusers import DiffusionPipeline |
| |
| pipe = DiffusionPipeline.from_pretrained( |
| "deepgenteam/DeepGen-1.0-diffusers", |
| torch_dtype=torch.bfloat16, |
| trust_remote_code=True, |
| ) |
| pipe.to("cuda") |
| |
| # Optional: enable CPU offload for GPUs with limited memory (< 24GB) |
| # pipe.enable_model_cpu_offload() |
| ``` |
|
|
| ### Text-to-Image |
|
|
| ```python |
| result = pipe( |
| prompt="a racoon holding a shiny red apple over its head", |
| height=512, width=512, |
| num_inference_steps=50, |
| guidance_scale=4.0, |
| seed=42, |
| ) |
| result.images[0].save("output.png") |
| ``` |
|
|
| ### Image Editing |
|
|
| ```python |
| from PIL import Image |
| |
| source_image = Image.open("guitar.png").convert("RGB") |
| result = pipe( |
| prompt="Take a photo of this guitar placed on a sandy beach with the sunset in the background.", |
| image=source_image, |
| height=512, width=512, |
| num_inference_steps=50, |
| guidance_scale=4.0, |
| seed=42, |
| ) |
| result.images[0].save("edited.png") |
| ``` |
|
|
| ## π Parameters |
|
|
| | Parameter | Default | Description | |
| |-----------|---------|-------------| |
| | `prompt` | required | Text prompt for generation or editing | |
| | `image` | `None` | Input image for editing. If `None`, performs text-to-image generation | |
| | `height` | 512 | Output image height | |
| | `width` | 512 | Output image width | |
| | `num_inference_steps` | 50 | Number of denoising steps | |
| | `guidance_scale` | 4.0 | Classifier-free guidance scale | |
| | `seed` | `None` | Random seed for reproducibility | |
| | `negative_prompt` | `""` | Negative prompt for CFG | |
|
|
| ## πΎ Memory Requirements |
|
|
| | Mode | VRAM | |
| |------|------| |
| | Full GPU | ~20 GB | |
| | CPU Offload (`pipe.enable_model_cpu_offload()`) | ~14 GB | |
|
|
| ## π Directory Structure |
|
|
| ``` |
| DeepGen-1.0-diffusers/ |
| βββ transformer/ # SD3 DiT weights (safetensors) |
| βββ vae/ # AutoencoderKL weights |
| βββ connector/ # SCB Connector weights + config |
| βββ scheduler/ # FlowMatchEulerDiscreteScheduler config |
| βββ tokenizer/ # Qwen2.5-VL tokenizer |
| βββ prompt_template.json # Prompt formatting template |
| βββ model_index.json # Model metadata |
| βββ deepgen_pipeline.py # Self-contained pipeline script |
| ``` |
|
|
| > **Note:** The VLM (Qwen2.5-VL-3B-Instruct) is loaded separately from [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct). You can override the VLM path using the `vlm_model_path` parameter in `from_pretrained()`. |
| |
| ## π§ Method |
| |
| Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce **Stacked Channel Bridging (SCB)**, a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance. |
| |
| | Component | Parameters | Description | |
| |-----------|-----------|-------------| |
| | VLM (Qwen2.5-VL-3B) | 3B | Visual Language Model for understanding prompts and reference images | |
| | Connector (SCB) | ~0.8B | 6-layer Transformer bridging VLM hidden states to DiT conditioning | |
| | DiT (SD3.5M Kontext) | 2B | Diffusion Transformer for image generation | |
| | VAE | ~80M | Image encoder/decoder | |
| |
| ## π Benchmarks |
| |
| ### 1. General Image Generation |
| |
| | Model | Params | Geneval β | DPGBench β | UniGenBench β | |
| | --------------------- | ----------- | ----------- | ------------ | ------------- | |
| | OmniGen2 | 3B + 4B | 0.80 | 83.57 | 63.09 | |
| | BAGEL | 14B | 0.82 | 85.10 | 61.53 | |
| | X-Omni | 7B + 12B | 0.83 | 87.65π₯ | 53.77 | |
| | Lumina-DiMOO | 8B | 0.88π₯ | 86.04 | 71.12 | |
| | Hunyuan-Image-3.0 | 80B | 0.72 | 86.10 | β | |
| | Qwen-Image | 7B + 20B | 0.87 π₯ | 88.32 π₯ | 78.81 π₯ | |
| | LongCat-Image | 7B + 6B | 0.87 π₯ | 86.80 | β | |
| | Z-Image-Turbo | 4B + 6B | 0.84 | 85.15 | 71.40 | |
| | GLM-Image | 9B + 7B | β | 84.78 | β | |
| | **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.86 π₯ | 87.05 | 74.18 π₯ | |
| | **DeepGen 1.0 (RL)** | **3B + 2B** | 0.87 π₯ | 87.90 π₯ | 75.74 π₯ | |
| |
| ### 2. General Image Editing |
| |
| | Model | Params | GEdit-EN β | ImgEdit β | |
| | :--- | :--- | :--- | :--- | |
| | BAGEL | 14B | 6.52 | 3.20 | |
| | Qwen-Image-Edit [2509] | 7B + 20B | 7.54 π₯ | 4.35 π₯ | |
| | LongCat-Image-Edit | 7B + 6B | 7.60 π₯ | 4.50 π₯ | |
| | Mammoth2 | 8B + 3B + 2B | 6.60 | 4.06 | |
| | **DeepGen 1.0 (SFT)** | **3B + 2B** | 7.12 | 4.09 | |
| | **DeepGen 1.0 (RL)** | **3B + 2B** | 7.17 π₯ | 4.14 π₯ | |
| |
| ### 3. Reasoning Image Generation |
| |
| | Model | Params | WISE β | T2I-CoREBench β | |
| | :--- | :--- | :--- | :--- | |
| | OmniGen2 | 3B + 4B | 0.47 | 36.1 | |
| | BAGEL | 14B | 0.70 π₯ | 41.1 | |
| | Hunyuan-Image-3.0 | 80B | 0.57 | 46.0 | |
| | Qwen-Image | 7B + 20B | 0.62 | 46.3 π₯ | |
| | LongCat-Image | 7B + 6B | 0.65 | 52.2 π₯ | |
| | Z-Image-Turbo | 4B + 6B | - | 43.7 | |
| | **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.72 π₯ | 45.7 | |
| | **DeepGen 1.0 (RL)** | **3B + 2B** | 0.73 π₯ | 46.5 π₯ | |
| |
| ### 4. Reasoning Image Editing |
| |
| | Model | Params | RISE β | UniREditBench β | |
| | :--- | :--- | :--- | :--- | |
| | OmniGen2 | 3B + 4B | - | 43.4 | |
| | BAGEL | 14B | 11.9 π₯ | 51.0 | |
| | Qwen-Image-Edit [2509] | 7B + 20B | 8.9 | 56.5 π₯ | |
| | **DeepGen 1.0 (SFT)** | **3B + 2B** | 13.3 π₯ | 77.5 π₯ | |
| | **DeepGen 1.0 (RL)** | **3B + 2B** | 10.8 π₯ | 75.7 π₯ | |
| |
| ## β Citation |
| |
| ```bibtex |
| @article{wang2026deepgen, |
| title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing}, |
| author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others}, |
| journal={arXiv preprint arXiv:2602.12205}, |
| year={2026} |
| } |
| ``` |
| |
| ## License |
| |
| Apache 2.0 |
| |