mapudungun-nllb-600M-es-arn-joint-bpe

Fine-tuned NLLB-200 distilled 600M for Spanish→Mapudungun translation using the Joint-5K BPE tokenization condition.

Joint Mapudungun+Spanish BPE with 5K merge operations (Duan et al. 2020).

Part of the paper: Bringing Mapudungun into the Modern MT Ecosystem: Morphology-Aware Tokenization for NLLB-200 Fine-Tuning (AmericasNLP 2026 @ ACL).

Usage

from transformers import pipeline

pipe = pipeline(
    "translation",
    model="byumatrixlab/mapudungun-nllb-600M-es-arn-joint-bpe",
    src_lang="spa_Latn",
    tgt_lang="arn_Latn",
)
print(pipe("your text here", max_length=256))

Citation

@inproceedings{thompson2026mapudungun,
  title     = {Bringing {Mapudungun} into the Modern {MT} Ecosystem: Morphology-Aware Tokenization for {NLLB}-200 Fine-Tuning},
  author    = {Thompson, Isaac},
  booktitle = {Proceedings of the 5th Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP 2026)},
  year      = {2026},
}

Downloads last month: 17

Safetensors

Model size

0.6B params

Tensor type

F32

Collection including byumatrixlab/mapudungun-nllb-600M-es-arn-joint-bpe

Mapudungun NLLB

Collection

34 items • Updated 9 days ago