MiniGPT-124M
A 124-million parameter GPT-style language model trained from scratch on a mixture of TinyStories and Wikitext-103, using pure PyTorch. Inspired by Andrej Karpathy's nanoGPT.
Model Details
| Property |
Value |
| Parameters |
~124M |
| Layers |
12 |
| Attention heads |
12 |
| Embedding dimension |
768 |
| Context length |
256 tokens |
| Vocabulary size |
50,257 (GPT-2) |
| Tokenizer |
tiktoken β gpt2 encoding |
| Activation |
GELU |
| Architecture |
Decoder-only Transformer |
Training
| Property |
Value |
| Dataset |
TinyStories + Wikitext-103 (mixed) |
| Training steps |
8,000 |
| Batch size |
16 (Γ 4 gradient accumulation = effective 64) |
| Learning rate |
2e-4 with cosine decay + warmup |
| Optimizer |
AdamW (weight decay 0.1) |
| Precision |
Mixed precision (AMP) |
| Hardware |
2Γ NVIDIA T4 GPUs |
| Training time |
~8 hours |
Usage
Install dependencies
pip install torch safetensors tiktoken huggingface_hub
Load and generate
import os
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
import torch, tiktoken, importlib.util
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
REPO_ID = "sopanm11/Mini-GPT-124M"
device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = hf_hub_download(repo_id=REPO_ID, filename="model.safetensors")
modeling_path = hf_hub_download(repo_id=REPO_ID, filename="modelling_mini_gpt.py")
spec = importlib.util.spec_from_file_location("modelling_mini_gpt", modeling_path)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
GPT, GPTConfig = module.GPT, module.GPTConfig
state_dict = load_file(model_path)
if "token_emb.weight" not in state_dict:
state_dict["token_emb.weight"] = state_dict["output_head.weight"]
state_dict.pop("output_head.bias", None)
model = GPT(GPTConfig()).to(device)
model.load_state_dict(state_dict)
model.eval()
enc = tiktoken.get_encoding("gpt2")
def generate(prompt, max_tokens=200, temperature=0.8):
input_ids = torch.tensor([enc.encode(prompt)], dtype=torch.long, device=device)
output_ids = model.generate(input_ids, max_new_tokens=max_tokens, temperature=temperature)
return enc.decode(output_ids[0].tolist())
print(generate("Once upon a time", max_tokens=200, temperature=0.8))
Temperature guide
| Temperature |
Effect |
0.2 β 0.5 |
Conservative, repetitive, more coherent |
0.7 β 0.9 |
Balanced creativity (recommended) |
1.0+ |
Very creative, less coherent |
Files in this repo
| File |
Description |
model.safetensors |
Trained model weights |
modelling_mini_gpt.py |
Full model architecture (pure PyTorch) |
config.json |
Hyperparameters used during training |
tokenizer_config.json |
Tokenizer metadata (GPT-2) |
Limitations
- Context window is limited to 256 tokens
- Trained on a relatively small dataset β not suitable for production use
- May generate incoherent or repetitive text on complex prompts
- No instruction tuning or RLHF β base language model only
License
MIT