Chess BPE Tokenizer

A Byte Pair Encoding (BPE) tokenizer trained on chess moves in custom notation format.

Model Details

Tokenizer Type: BPE (Byte Pair Encoding)
Vocabulary Size: 256
Training Data: angeluriot/chess_games
Training Split: train[0:1000]
Move Format: Custom notation with Unicode chess pieces (e.g., w.♘g1♘f3..)

Move Format Description

The tokenizer is trained on a custom chess move notation:

Component	Description	Example
Player prefix	`w.` (white) or `b.` (black)	`w.`
Piece + Source	Unicode piece + square	`♘g1`
Piece + Destination	Unicode piece + square	`♘f3`
Flags	`.x.` (capture), `..+` (check), `..#` (checkmate)	`..`

Examples

Move	Meaning
`w.♘g1♘f3..`	White knight from g1 to f3
`b.♟c7♟c5..`	Black pawn from c7 to c5
`b.♟c5♟d4.x.`	Black pawn captures on d4
`w.♔e1♔g1♖h1♖f1..`	White kingside castle
`b.♛d7♛d5..+`	Black queen to d5 with check

Chess Piece Symbols

White	Black	Piece
♔	♚	King
♕	♛	Queen
♖	♜	Rook
♗	♝	Bishop
♘	♞	Knight
♙	♟	Pawn

Usage

Installation

pip install rustbpe huggingface_hub

Loading and Using the Tokenizer

import json
from huggingface_hub import hf_hub_download

# Download tokenizer files
vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="vocab.json")
config_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="tokenizer_config.json")

# Load vocabulary
with open(vocab_path, 'r') as f:
    vocab = json.load(f)

with open(config_path, 'r') as f:
    config = json.load(f)

print(f"Vocab size: {len(vocab)}")
print(f"Pattern: {config['pattern']}")

Using with rustbpe (for encoding)

import rustbpe

# Note: rustbpe tokenizer needs to be retrained or loaded from merges
# See the training script for details

Training Your Own

from bpess.main import train_chess_tokenizer, push_to_hub

# Train
tokenizer = train_chess_tokenizer(
    vocab_size=4096,
    dataset_fraction="train",
    moves_key='moves_custom'
)

# Push to HuggingFace
push_to_hub(
    tokenizer=tokenizer,
    repo_id="your-username/chess-bpe-tokenizer",
    config={
        "vocab_size": 4096,
        "dataset_fraction": "train",
        "moves_key": "moves_custom"
    }
)

Training Details

Library: rustbpe by Andrej Karpathy
Algorithm: Byte Pair Encoding with GPT-4 style regex pre-tokenization
Source Dataset: ~14M chess games from angeluriot/chess_games

Intended Use

This tokenizer is designed for:

Training language models on chess games
Chess move prediction tasks
Game analysis and embedding generation

License

MIT License

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

ItsMaxNorm
/

tokenizerchess