Instructions to use TilQazyna/Til-mini-1B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TilQazyna/Til-mini-1B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TilQazyna/Til-mini-1B")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("TilQazyna/Til-mini-1B") model = AutoModelForCausalLM.from_pretrained("TilQazyna/Til-mini-1B") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use TilQazyna/Til-mini-1B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TilQazyna/Til-mini-1B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TilQazyna/Til-mini-1B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/TilQazyna/Til-mini-1B
- SGLang
How to use TilQazyna/Til-mini-1B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TilQazyna/Til-mini-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TilQazyna/Til-mini-1B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TilQazyna/Til-mini-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TilQazyna/Til-mini-1B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use TilQazyna/Til-mini-1B with Docker Model Runner:
docker model run hf.co/TilQazyna/Til-mini-1B
Til-mini-1B (base)
Til-mini-1B — 956M-параметрлік көптілді base-модель: қазақ тілін бірінші кезекте қолдайтын, орыс/ағылшын/код/математиканы қамтитын тілдік модель. Толық 47 миллиард токендік Til-Corpus корпусында нөлден бастап оқытылған.
Til-mini-1B is a 956M-parameter multilingual base language model with first-class Kazakh support, trained from scratch on the full 47-billion-token Til-Corpus.
This is a base (non-instruct) model — it completes text; it does not follow chat instructions. Instruct and grammar-correction (GEC) fine-tunes are released separately under the TilQazyna organization.
Model details
| Architecture | DeepSeek-V3-style dense decoder with MLA (Multi-head Latent Attention) |
| Parameters | 956.3M (tied input/output embeddings) |
| Hidden / layers | 1792 / 24 |
| Attention | 16 heads, MLA: q_lora_rank 384, kv_lora_rank 192, qk_rope 32, qk_nope 64, v_head 64 |
| FFN intermediate | 4864 (SwiGLU) |
| Context length | 2048 |
| Position encoding | RoPE, θ = 100 000 |
| Vocab | 131 072 — Til-Tokenizer-128k |
| Precision | bf16 |
MLA compresses the KV-cache via low-rank latent projections, which makes the model memory-efficient at inference time — including on mobile-class hardware (≈0.5 GB at 4-bit quantization).
Tokenizer
TilQazyna/Til-Tokenizer-128k —
131 072 BPE vocabulary trained with a focus on Kazakh morphology
(≈1 token per Kazakh word on average), while remaining efficient for Russian,
English, code and math. Special tokens: pad=0, <s>=1, </s>=2,
<|im_start|>=6, <|im_end|>=7.
Training data
One full epoch over Til-Corpus — 47.0B tokens, ~71M documents:
| Domain | Tokens | Share |
|---|---|---|
| English | 11.9B | 25% |
| Code | 9.9B | 21% |
| Kazakh | 9.7B | 21% |
| Math | 9.0B | 19% |
| Russian | 6.6B | 14% |
Documents are tokenized, concatenated with </s> separators and packed into
fixed 2048-token sequences. Batches are fully shuffled across domains.
Training procedure
| Steps | 89 690 (1 epoch) |
| Global batch | 256 sequences × 2048 = 0.52M tokens/step |
| Optimizer | AdamW, lr 6e-4, weight decay 0.1, grad clip 1.0 |
| LR schedule | WSD (warmup 1000 → stable → linear decay over final 30%) |
| Precision | bf16 |
| Hardware | 8×H200, DDP, 35.5 h |
| Tokens / parameter | ≈47 (deliberately overtrained for deployment quality) |
Evaluation
Bits-per-byte (BPB) on a frozen held-out set, 5 domains. BPB normalizes by UTF-8 bytes of the scored text, so the number is independent of the tokenizer:
| Domain | BPB ↓ |
|---|---|
| Kazakh (kk) | 0.4645 |
| Code | 0.4389 |
| Russian (ru) | 0.5079 |
| Math | 0.7715 |
| English (en) | 0.9208 |
| Macro | 0.6207 |
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "TilQazyna/Til-mini-1B"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="auto")
ids = tok("Абай Құнанбайұлы — қазақ халқының", return_tensors="pt").input_ids.to(model.device)
out = model.generate(ids, max_new_tokens=80, do_sample=True,
temperature=0.7, top_p=0.9, repetition_penalty=1.1,
pad_token_id=0)
print(tok.decode(out[0], skip_special_tokens=True))
Sample completions (temperature 0.7, base model, no SFT):
Қазақстан Республикасының астанасы - Астана қаласы.
Абай Құнанбайұлы — қазақ халқының ұлы ақыны, ағартушы, қазақтың жазба әдебиетінің және әдеби тілінің негізін қалаушы, философ, композитор.
Intended use & limitations
- Intended: research on Kazakh/multilingual NLP; foundation for fine-tunes (instruct, GEC, domain adaptation); on-device text completion after quantization.
- Base model: completes text, does not answer questions or follow instructions.
- Factuality: like all sub-1B models, it hallucinates facts and numbers; do not use raw outputs as a source of truth.
- Reasoning/code: surface form is fluent; logical and arithmetic correctness is not guaranteed.
- Context window is 2048 tokens.
- No safety alignment has been applied.
License
Apache 2.0. Access is gated (manual approval) for usage tracking.
- Downloads last month
- -