argilla/databricks-dolly-15k-curated-multilingual
Viewer • Updated • 60.1k • 562 • 54
This repository hosts the first German nanochat model. It was fine-tuned (mid-training phase) on various German SFT datasets.
💬 A demo space of the model can be found here.
The chat model was fine-tuned on the following datasets:
More information can be found in the corresponding German nanochat repository.
We use lm_eval to measure and compare the model's performance against other language models in the same parameter range (note: this list is not exhaustive):
| Model | arc_de | hellaswag_de | m_mmlu_de | truthfulqa_de_mc1 | truthfulqa_de_mc2 | ||
|---|---|---|---|---|---|---|---|
| acc | acc_norm | acc | acc_norm | acc | acc | acc | |
| nanochat German v1 | 0.2241 | 0.2626 | 0.3203 | 0.3581 | 0.2285 | 0.2500 | 0.4184 |
| LLäMmlein-120M | 0.1942 | 0.2301 | 0.2945 | 0.3178 | 0.2285 | 0.2310 | 0.4055 |
| LLäMmlein-1B | 0.2515 | 0.2960 | 0.3703 | 0.4490 | 0.2317 | 0.2322 | 0.3617 |
Command that was used to retrieve evaluation results - using our model:
lm_eval --model hf \
--model_args pretrained="stefan-it/nanochat-german-v1" \
--tasks "arc_de,hellaswag_de,m_mmlu_de,truthfulqa_de_mc1,truthfulqa_de_mc2" \
--device cuda:0 \
--batch_size auto \
--trust_remote_code \
--log_samples \
--output_path ./nanochat-german-v1
To generate some text, please make sure that you are using this specific HF branch.
Then the following code can be used:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_id = "stefan-it/nanochat-german-v1"
revision = "main"
max_new_tokens = 64
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False, revision=revision)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=False, dtype=torch.bfloat16, revision=revision).to(device)
model.eval()
conversation = [
{"role": "user", "content": "Was ist die Hauptstadt von Bayern?"},
]
inputs = tokenizer.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
)
# Decode only the generated tokens (excluding the input prompt)
generated_tokens = outputs[0, inputs["input_ids"].shape[1]:]
print(tokenizer.decode(generated_tokens, skip_special_tokens=True))
The model is licences under a permissive Apache 2.0 license.
Base model
stefan-it/nanochat-german-base