X2 Banner

GPT-X2-125M is the second generation GPT-X model. 125M parameters, 75B tokens, custom 32K tokenizer, 30 layers. Trained from scratch using a 4-source progressive curriculum with AST-normalized code, achieving near state-of-the-art performance on both natural language and structured reasoning benchmarks at the 125M scale on a fraction of the training data.

Results

Average Score vs Training Compute

GPT-X2-125M achieves competitive performance with leading models despite using significantly less data. Most notably, it matches SmolLM2-135M within ~1 point on aggregate while using ~27x fewer tokens and having ~8% fewer parameters.

Evaluated with an internal harness modeled on EleutherAI/lm-eval-harness; all benchmarks are zero-shot.

Company Model HellaSwag ARC (Average) PIQA LogicMark Winogrande ArithMark Average Training tokens
HuggingFace SmolLM2-135M 43.22% 44.62% 67.52% 48.78% 48.46% 33.26% 47.64% 2T
Axiomic Labs GPT-X2-125M 40.55% 39.90% 66.97% 49.12% 49.01% 34.78% 46.72% 75B
HuggingFace SmolLM-135M 42.70% 43.17% 67.19% 43.89% 50.43% 32.34% 46.62% 600B
Facebook MobileLLM-R1-140M-base 33.91% 37.47% 62.79% 45.04% 50.28% 46.94% 46.07% 4.2T
Axiomic Labs GPT-X-125M 36.57% 38.84% 65.72% 43.83% 50.83% 30.52% 44.39% 15B
Facebook MobileLLM-125M 38.90% 35.50% 65.30% 42.04% 53.10% 31.16% 44.33% 1T
OpenAI GPT-3 (125M) 33.70% 35.10% 64.60% NA 52.00% NA NA 300B
OpenAI GPT-2 Medium (355M) 39.40% 34.80% 66.30% 44.90% 50.40% 34.80% 43.94% ~10B
OpenAI GPT-2 (124M) 31.49% 31.40% 63.28% 44.52% 48.54% 32.80% 42.01% ~10B
EleutherAI Pythia-160M 30.46% 29.95% 57.94% 40.87% 49.41% 28.06% 39.45% ~225B
Facebook OPT-125M 31.39% 31.53% 62.02% 43.81% 49.96% 27.48% 41.03% 180B
EleutherAI GPT-Neo-125M 30.55% 31.43% 61.75% 45.40% 49.09% 29.98% 41.37% 300B

LogicMark and ArithMark are procedural benchmarks designed to evaluate structured reasoning and arithmetic generalization across increasing difficulty levels.


Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "AxiomicLabs/GPT-X2-125M"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    output = model.generate(
        **inputs,
        max_new_tokens=120,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
        no_repeat_ngram_size=4,
    )

text = tokenizer.decode(output[0], skip_special_tokens=True)
print(text)

What's New in v2

Change GPT-X (v1) GPT-X2 (v2) Why
Tokenizer GPT-2 BPE (50K) Custom 32K trained on FineWeb-Edu ~9% better compression, frees params for layers
Depth 27 layers 30 layers Saved embedding params reinvested into 3 extra layers
rope_theta 10,000 100,000 Better long-range attention (SmolLM2-proven)
Learning rate 6e-4 1.5e-3 SmolLM used 3e-3 at this scale; v1 was too conservative
LR decay Cosine to 1/10th peak WSD decay to 0 SmolLM-style; tighter convergence during cooldown
Warmup 1,000 steps 2,000 steps Higher LR needs longer warmup for stability
Data 15B tokens, FineWeb-Edu only 75B tokens, 4-source curriculum 5x more data with progressive multi-source mixing
Training Messy resume Clean from-scratch Controlled curriculum with planned distribution shifts

Architecture

Component Details
Position encoding RoPE (theta=100,000)
Normalization RMSNorm (float32 upcast)
Feed-forward SwiGLU (3-matrix gated MLP)
Attention Grouped Query Attention -- 9Q / 3KV (3:1)
QK stability QK-Norm (RMSNorm per head, before RoPE)
Bias None (all layers bias-free)
Embedding sqrt(d_model) scaling + weight tying
Auxiliary loss z-loss (1e-4 on logit magnitudes < 31Bt then 0)
Depth 30 layers x 576 hidden

Config

vocab_size     = 32,768    (custom BPE trained on FineWeb-Edu)
n_layer        = 30
n_head         = 9         (query heads)
n_kv_heads     = 3         (key-value heads, 3:1 GQA)
n_embd         = 576
head_dim       = 64
intermediate   = 1,536     (SwiGLU, 2.67x ratio)
block_size     = 1,024
rope_theta     = 100,000
total params   = 125,081,664

Parameter Breakdown

Component Params
Token embeddings (32768 x 576) 18,874,368
Per block (x30): attention + SwiGLU + norms 3,540,224
30 transformer blocks 106,206,720
Final RMSNorm 576
LM head (tied with embeddings) 0
Total 125,081,664

Training

Data

The curriculum gradually introduces specialized data (math/code) after the model learns core language, improving reasoning without harming fluency. Four data sources mixed via a progressive curriculum:

Source Dataset Purpose
FineWeb-Edu HuggingFaceFW/fineweb-edu (sample-100BT) Primary educational web text
DCLM mlfoundations/dclm-baseline-1.0 High-quality diverse web text
FineMath HuggingFaceTB/finemath (finemath-4plus) Mathematical reasoning (score >= 4)
NPset-Python AxiomicLabs/NPset-python AST-normalized Python code
  • Tokens: 75B (143,051 steps x 524,288 tokens/step)
  • Tokenizer: Custom 32K BPE trained on 50GB of FineWeb-Edu
  • Final val loss: 2.7525 (FineWeb-Edu held-out)

Progressive Data Curriculum

Rather than a fixed mixture, data sources are introduced progressively to let the model build foundational language capabilities before adding specialized data:

Phase Token Range FineWeb-Edu DCLM FineMath Code
Early 0 -- 18B 58% 40% 1% 1%
Ramp 18B -- 20B 58% -> 54% 40% -> 36% 1% -> 6% 1% -> 4%
Hold 20B -- 45B 54% 36% 6% 4%
Taper 45B -- 58B 54% -> 55% 36% -> 38% 6% -> 4.5% 4% -> 2.5%
Hold 58B -- 75B 55% 38% 4.5% 2.5%

FineMath and code are drip-fed at 1% from the start so the model sees early examples, then ramped to peak during the stable LR phase for maximum learning, and gradually tapered back toward primary sources during the LR decay phase.

AST-Based Code Normalization

Raw Python code is converted to a compact pseudocode representation (TinyDSL) before tokenization using AST parsing. This achieves ~1.25x token compression compared to raw code, letting the model learn programming reasoning without the need for code-specific tokens and reducing understanding overhead.

The normalizer:

  • Strips comments, docstrings, whitespace, and syntactic noise
  • Replaces Python builtins with full English words (len -> length, str -> string, etc.)
  • Uses natural language keywords with spaces (end function, for else, list comprehension, etc.)
  • Expands comprehensions, decorators, exception handling, and all Python AST node types

Example:

# Raw Python
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

# Normalized TinyDSL
# function fibonacci n 
# begin
# if n <= 1
# begin
# return n
# end
# return call fibonacci n - 1 + call fibonacci n - 2 
# end function

Optimization

  • Optimizer: AdamW (betas=0.9/0.95, weight_decay=0.1 < 31Bt then 0.01)
  • Learning rate: 1.5e-3 max, decays to 0
  • Schedule: WSD -- 2,000 step warmup, stable phase (80%), linear decay to 0 over final 20%
  • Batch size: 524,288 tokens (micro_batch=8, seq_len=1024, 64 grad accumulation)
  • Precision: bfloat16 mixed precision
  • Gradient clipping: 1.0

Hardware

  • 1x RTX 3080 Ti
  • Training time: ~500 hours

Design Decisions

  • 30 layers x 576 hidden -- 3 more layers than v1, made possible by smaller 32K vocab. Depth is the primary driver of quality at 125M scale (SmolLM, MobileLLM).
  • Custom 32K tokenizer -- Trained on FineWeb-Edu for ~9% better compression than GPT-2 BPE. Fewer vocab entries = smaller embedding table = more params for transformer layers.
  • rope_theta=100K -- Matches SmolLM2-135M. Better extrapolation and long-range dependency modeling.
  • 1.5e-3 learning rate -- GPT-X v1 used 6e-4 which was too conservative. SmolLM proved 3e-3 @ 1M batch works at 135M scale; thus used 1.5e-3 at 524288 batch.
  • GQA 3:1 -- 9 query heads, 3 KV heads. Saves attention parameters reinvested into SwiGLU capacity.
  • QK-Norm -- RMSNorm on Q and K before RoPE.
  • z-loss -- Prevents logit magnitude drift (PaLM, T5) used early.
  • Progressive curriculum -- FineMath and NPset-Python introduced at 1% early, ramped to peak during stable LR, tapered during decay. Lets the model build language foundations first, then learn specialized reasoning and logic patterns.
  • AST code normalization (NPset-Python) -- 1.25x token compression via Python AST to TinyDSL conversion. Strips syntactic noise and normalizes identifiers so the model learns programming structure rather than memorizing variable names.

Limitations

  • Hardware: Total training took roughly 500 hours for 75B tokens; as a result, ablations were not feasible, and VRAM capped the context window at 1024
  • Small model: 125M parameters limits reasoning and factual recall
  • Educational data only: Primarily trained on educational datasets; not representative of general web text
  • Not instruction-tuned: Base model only, not aligned for chat
  • English only
  • 1024 context window

Citation

@misc{gptx2_2025,
  title={GPT-X2: Data-Efficient Language Modeling at 125M Scale},
  author={Axiomic Labs},
  year={2025},
  howpublished={\url{https://huggingface.co/AxiomicLabs/GPT-X2-125M}},
  note={Trained on 75B tokens with a progressive curriculum and custom tokenizer}
}
Downloads last month
883
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train AxiomicLabs/GPT-X2-125M

Collection including AxiomicLabs/GPT-X2-125M