Chess BPE Tokenizer

A Byte Pair Encoding (BPE) tokenizer trained on chess moves in custom notation format.

Model Details

  • Tokenizer Type: BPE (Byte Pair Encoding)
  • Vocabulary Size: 256
  • Training Data: angeluriot/chess_games
  • Training Split: train[0:1000]
  • Move Format: Custom notation with Unicode chess pieces (e.g., w.β™˜g1β™˜f3..)

Move Format Description

The tokenizer is trained on a custom chess move notation:

Component Description Example
Player prefix w. (white) or b. (black) w.
Piece + Source Unicode piece + square β™˜g1
Piece + Destination Unicode piece + square β™˜f3
Flags .x. (capture), ..+ (check), ..# (checkmate) ..

Examples

Move Meaning
w.β™˜g1β™˜f3.. White knight from g1 to f3
b.β™Ÿc7β™Ÿc5.. Black pawn from c7 to c5
b.β™Ÿc5β™Ÿd4.x. Black pawn captures on d4
w.β™”e1β™”g1β™–h1β™–f1.. White kingside castle
b.β™›d7β™›d5..+ Black queen to d5 with check

Chess Piece Symbols

White Black Piece
β™” β™š King
β™• β™› Queen
β™– β™œ Rook
β™— ♝ Bishop
β™˜ β™ž Knight
β™™ β™Ÿ Pawn

Usage

Installation

pip install rustbpe huggingface_hub

Loading and Using the Tokenizer

import json
from huggingface_hub import hf_hub_download

# Download tokenizer files
vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="vocab.json")
config_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="tokenizer_config.json")

# Load vocabulary
with open(vocab_path, 'r') as f:
    vocab = json.load(f)

with open(config_path, 'r') as f:
    config = json.load(f)

print(f"Vocab size: {len(vocab)}")
print(f"Pattern: {config['pattern']}")

Using with rustbpe (for encoding)

import rustbpe

# Note: rustbpe tokenizer needs to be retrained or loaded from merges
# See the training script for details

Training Your Own

from bpess.main import train_chess_tokenizer, push_to_hub

# Train
tokenizer = train_chess_tokenizer(
    vocab_size=4096,
    dataset_fraction="train",
    moves_key='moves_custom'
)

# Push to HuggingFace
push_to_hub(
    tokenizer=tokenizer,
    repo_id="your-username/chess-bpe-tokenizer",
    config={
        "vocab_size": 4096,
        "dataset_fraction": "train",
        "moves_key": "moves_custom"
    }
)

Training Details

  • Library: rustbpe by Andrej Karpathy
  • Algorithm: Byte Pair Encoding with GPT-4 style regex pre-tokenization
  • Source Dataset: ~14M chess games from angeluriot/chess_games

Intended Use

This tokenizer is designed for:

  • Training language models on chess games
  • Chess move prediction tasks
  • Game analysis and embedding generation

License

MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train ItsMaxNorm/tokenizerchess