You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Til-mini-1B (base)

Til-mini-1B — 956M-параметрлік көптілді base-модель: қазақ тілін бірінші кезекте қолдайтын, орыс/ағылшын/код/математиканы қамтитын тілдік модель. Толық 47 миллиард токендік Til-Corpus корпусында нөлден бастап оқытылған.

Til-mini-1B is a 956M-parameter multilingual base language model with first-class Kazakh support, trained from scratch on the full 47-billion-token Til-Corpus.

This is a base (non-instruct) model — it completes text; it does not follow chat instructions. Instruct and grammar-correction (GEC) fine-tunes are released separately under the TilQazyna organization.

Model details

Architecture DeepSeek-V3-style dense decoder with MLA (Multi-head Latent Attention)
Parameters 956.3M (tied input/output embeddings)
Hidden / layers 1792 / 24
Attention 16 heads, MLA: q_lora_rank 384, kv_lora_rank 192, qk_rope 32, qk_nope 64, v_head 64
FFN intermediate 4864 (SwiGLU)
Context length 2048
Position encoding RoPE, θ = 100 000
Vocab 131 072 — Til-Tokenizer-128k
Precision bf16

MLA compresses the KV-cache via low-rank latent projections, which makes the model memory-efficient at inference time — including on mobile-class hardware (≈0.5 GB at 4-bit quantization).

Tokenizer

TilQazyna/Til-Tokenizer-128k — 131 072 BPE vocabulary trained with a focus on Kazakh morphology (≈1 token per Kazakh word on average), while remaining efficient for Russian, English, code and math. Special tokens: pad=0, <s>=1, </s>=2, <|im_start|>=6, <|im_end|>=7.

Training data

One full epoch over Til-Corpus — 47.0B tokens, ~71M documents:

Domain Tokens Share
English 11.9B 25%
Code 9.9B 21%
Kazakh 9.7B 21%
Math 9.0B 19%
Russian 6.6B 14%

Documents are tokenized, concatenated with </s> separators and packed into fixed 2048-token sequences. Batches are fully shuffled across domains.

Training procedure

Steps 89 690 (1 epoch)
Global batch 256 sequences × 2048 = 0.52M tokens/step
Optimizer AdamW, lr 6e-4, weight decay 0.1, grad clip 1.0
LR schedule WSD (warmup 1000 → stable → linear decay over final 30%)
Precision bf16
Hardware 8×H200, DDP, 35.5 h
Tokens / parameter ≈47 (deliberately overtrained for deployment quality)

Evaluation

Bits-per-byte (BPB) on a frozen held-out set, 5 domains. BPB normalizes by UTF-8 bytes of the scored text, so the number is independent of the tokenizer:

Domain BPB ↓
Kazakh (kk) 0.4645
Code 0.4389
Russian (ru) 0.5079
Math 0.7715
English (en) 0.9208
Macro 0.6207

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "TilQazyna/Til-mini-1B"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="auto")

ids = tok("Абай Құнанбайұлы — қазақ халқының", return_tensors="pt").input_ids.to(model.device)
out = model.generate(ids, max_new_tokens=80, do_sample=True,
                     temperature=0.7, top_p=0.9, repetition_penalty=1.1,
                     pad_token_id=0)
print(tok.decode(out[0], skip_special_tokens=True))

Sample completions (temperature 0.7, base model, no SFT):

Қазақстан Республикасының астанасы - Астана қаласы.

Абай Құнанбайұлы — қазақ халқының ұлы ақыны, ағартушы, қазақтың жазба әдебиетінің және әдеби тілінің негізін қалаушы, философ, композитор.

Intended use & limitations

  • Intended: research on Kazakh/multilingual NLP; foundation for fine-tunes (instruct, GEC, domain adaptation); on-device text completion after quantization.
  • Base model: completes text, does not answer questions or follow instructions.
  • Factuality: like all sub-1B models, it hallucinates facts and numbers; do not use raw outputs as a source of truth.
  • Reasoning/code: surface form is fluent; logical and arithmetic correctness is not guaranteed.
  • Context window is 2048 tokens.
  • No safety alignment has been applied.

License

Apache 2.0. Access is gated (manual approval) for usage tracking.

Downloads last month
-
Safetensors
Model size
1.0B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TilQazyna/Til-mini-1B

Finetunes
1 model

Dataset used to train TilQazyna/Til-mini-1B