Chess BPE Tokenizer
A Byte Pair Encoding (BPE) tokenizer trained on chess moves in custom notation format.
Model Details
- Tokenizer Type: BPE (Byte Pair Encoding)
- Vocabulary Size: 256
- Training Data: angeluriot/chess_games
- Training Split: train[0:1000]
- Move Format: Custom notation with Unicode chess pieces (e.g.,
w.βg1βf3..)
Move Format Description
The tokenizer is trained on a custom chess move notation:
| Component | Description | Example |
|---|---|---|
| Player prefix | w. (white) or b. (black) |
w. |
| Piece + Source | Unicode piece + square | βg1 |
| Piece + Destination | Unicode piece + square | βf3 |
| Flags | .x. (capture), ..+ (check), ..# (checkmate) |
.. |
Examples
| Move | Meaning |
|---|---|
w.βg1βf3.. |
White knight from g1 to f3 |
b.βc7βc5.. |
Black pawn from c7 to c5 |
b.βc5βd4.x. |
Black pawn captures on d4 |
w.βe1βg1βh1βf1.. |
White kingside castle |
b.βd7βd5..+ |
Black queen to d5 with check |
Chess Piece Symbols
| White | Black | Piece |
|---|---|---|
| β | β | King |
| β | β | Queen |
| β | β | Rook |
| β | β | Bishop |
| β | β | Knight |
| β | β | Pawn |
Usage
Installation
pip install rustbpe huggingface_hub
Loading and Using the Tokenizer
import json
from huggingface_hub import hf_hub_download
# Download tokenizer files
vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="vocab.json")
config_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="tokenizer_config.json")
# Load vocabulary
with open(vocab_path, 'r') as f:
vocab = json.load(f)
with open(config_path, 'r') as f:
config = json.load(f)
print(f"Vocab size: {len(vocab)}")
print(f"Pattern: {config['pattern']}")
Using with rustbpe (for encoding)
import rustbpe
# Note: rustbpe tokenizer needs to be retrained or loaded from merges
# See the training script for details
Training Your Own
from bpess.main import train_chess_tokenizer, push_to_hub
# Train
tokenizer = train_chess_tokenizer(
vocab_size=4096,
dataset_fraction="train",
moves_key='moves_custom'
)
# Push to HuggingFace
push_to_hub(
tokenizer=tokenizer,
repo_id="your-username/chess-bpe-tokenizer",
config={
"vocab_size": 4096,
"dataset_fraction": "train",
"moves_key": "moves_custom"
}
)
Training Details
- Library: rustbpe by Andrej Karpathy
- Algorithm: Byte Pair Encoding with GPT-4 style regex pre-tokenization
- Source Dataset: ~14M chess games from angeluriot/chess_games
Intended Use
This tokenizer is designed for:
- Training language models on chess games
- Chess move prediction tasks
- Game analysis and embedding generation
License
MIT License
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support