Tina: Text-to-Model Generative AI (CIFAR-100, CNN)
Tina is a text-conditioned neural network diffusion model that generates personalized image classifiers from natural language prompts. Given a text description of the desired classification task (e.g., a list of class names), Tina directly outputs the full parameters of a lightweight CNN — no gradient-based training required at inference time.
This checkpoint is the Tina model trained on CIFAR-100, capable of generating 10-class personalized CNN classifiers (~5K parameters) from text prompts.
Model Description
| Property | Value |
|---|---|
| Architecture | Diffusion Transformer (DiT), GPT-2 style backbone |
| Text Encoder | CLIP ViT-B/32 (frozen) |
| Hidden Size | 2048 |
| Transformer Layers | 12 encoder layers + 12 decoder layers |
| Attention Heads | 16 |
| Diffusion Steps | 1000 (DDPM sampling) |
| Prediction Type | Signal prediction (xâ‚€) |
| Generated Model | 2-layer CNN, ~5K parameters |
| Max Classification Classes | 10 |
| Training p-Models | 1000 personalized models |
| Training Dataset | CIFAR-100 (100 classes, 32×32 images) |
How It Works
Tina treats model generation as a conditional diffusion process — analogous to how text-to-image diffusion models denoise random pixels into coherent images, Tina denoises random vectors into functional neural network parameters.
- Training: Tina is trained on (task description, personalized model) pairs. Each personalized model is a CNN fine-tuned on a specific 10-class subset of CIFAR-100.
- Inference: Given a text prompt listing the desired classes (e.g.,
["apple", "bear", "bicycle", "bus", "castle", "clock", "cloud", "forest", "mountain", "train"]), Tina generates a complete CNN classifier in a single forward pass through 1000 denoising steps.
Thanks to the vision-language alignment of CLIP, Tina also supports:
- Image prompts: Zero-shot and few-shot image-prompted generation
- Natural language descriptions: Using class descriptions instead of class names
- Unseen classes: Generalization to classes not seen during training
- Variable class counts: Any number of classes up to 10 via classification sequence padding
Intended Use
- On-demand personalized classification: Quickly generate a lightweight classifier tailored to a user's specific needs without any training data or GPU-intensive fine-tuning.
- Edge AI deployment: The generated CNN (~5K params) is extremely lightweight, suitable for resource-constrained devices.
- Research on text-to-model generation: Exploring the paradigm of generating functional AI models from natural language.
Performance
Main Results on CIFAR-100 (10-class personalization)
| Method | In-Distribution | Out-of-Distribution |
|---|---|---|
| Generic Model | 28.72 | 29.88 |
| Classifier Selection | 64.83 | 64.15 |
| TAPER | 67.71 | 66.85 |
| Tina (this model) | 68.35 | 67.14 |
Inference Efficiency
| Method | Time per model (CNN) |
|---|---|
| Pretrain + fine-tune | 94.35s |
| TAPER | 18.10s |
| Tina | 4.88s |
Limitations
- This checkpoint generates CNN classifiers only (2-layer, ~5K parameters) for CIFAR-100 class subsets.
- Input images are expected to be 32×32 resolution.
- A single Tina cannot generate models across different architectures or modalities simultaneously.
- Performance on entirely out-of-domain classes (beyond CIFAR-100 semantic scope) may degrade.