MiniGPT-124M

A 124-million parameter GPT-style language model trained from scratch on a mixture of TinyStories and Wikitext-103, using pure PyTorch. Inspired by Andrej Karpathy's nanoGPT.

Model Details

Property	Value
Parameters	~124M
Layers	12
Attention heads	12
Embedding dimension	768
Context length	256 tokens
Vocabulary size	50,257 (GPT-2)
Tokenizer	`tiktoken` — `gpt2` encoding
Activation	GELU
Architecture	Decoder-only Transformer

Training

Property	Value
Dataset	TinyStories + Wikitext-103 (mixed)
Training steps	8,000
Batch size	16 (× 4 gradient accumulation = effective 64)
Learning rate	2e-4 with cosine decay + warmup
Optimizer	AdamW (weight decay 0.1)
Precision	Mixed precision (AMP)
Hardware	2× NVIDIA T4 GPUs
Training time	~8 hours

Usage

Install dependencies

pip install torch safetensors tiktoken huggingface_hub

Load and generate

import os
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"   # Windows only

import torch, tiktoken, importlib.util
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

REPO_ID = "sopanm11/Mini-GPT-124M"
device  = "cuda" if torch.cuda.is_available() else "cpu"

# ── 1. Download files ──────────────────────────────────────
model_path    = hf_hub_download(repo_id=REPO_ID, filename="model.safetensors")
modeling_path = hf_hub_download(repo_id=REPO_ID, filename="modelling_mini_gpt.py")

# ── 2. Import model class ──────────────────────────────────
spec   = importlib.util.spec_from_file_location("modelling_mini_gpt", modeling_path)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
GPT, GPTConfig = module.GPT, module.GPTConfig

# ── 3. Load weights ───────────────────────────────────────
state_dict = load_file(model_path)
if "token_emb.weight" not in state_dict:
    state_dict["token_emb.weight"] = state_dict["output_head.weight"]
state_dict.pop("output_head.bias", None)

model = GPT(GPTConfig()).to(device)
model.load_state_dict(state_dict)
model.eval()

# ── 4. Generate text ──────────────────────────────────────
enc = tiktoken.get_encoding("gpt2")

def generate(prompt, max_tokens=200, temperature=0.8):
    input_ids = torch.tensor([enc.encode(prompt)], dtype=torch.long, device=device)
    output_ids = model.generate(input_ids, max_new_tokens=max_tokens, temperature=temperature)
    return enc.decode(output_ids[0].tolist())

print(generate("Once upon a time", max_tokens=200, temperature=0.8))

Temperature guide

Temperature	Effect
`0.2 – 0.5`	Conservative, repetitive, more coherent
`0.7 – 0.9`	Balanced creativity (recommended)
`1.0+`	Very creative, less coherent

Files in this repo

File	Description
`model.safetensors`	Trained model weights
`modelling_mini_gpt.py`	Full model architecture (pure PyTorch)
`config.json`	Hyperparameters used during training
`tokenizer_config.json`	Tokenizer metadata (GPT-2)

Limitations

Context window is limited to 256 tokens
Trained on a relatively small dataset — not suitable for production use
May generate incoherent or repetitive text on complex prompts
No instruction tuning or RLHF — base language model only

License

MIT

Downloads last month: 37