Norwegian
Hi BSC-LT team,
Great initiative! On the model page language listings, there's Norwegian Nynorsk (nn), but not Norwegian Bokmal (nb). Is this a tagging mishap, or does the model really not support nb, in favor of nn ?
Thanks again for making this model!
Hi @exoplanet ! Thanks for the question and for taking a look at the model.
The training data we use for Norwegian variants (from FineWeb2) actually contains text from both Norwegian written standards. Since Norwegian Bokmål is the more common standard form, we grouped it by the name Norwegian (no), while we listed Norwegian Nynorsk explicitly as nn.
So both variants are present in the training data. In terms of scale, we used 6,798,808,558 tokens for Norwegian Bokmal and 214,056,022 tokens for Norwegian Nynorsk, as detailed in our paper: https://arxiv.org/abs/2602.21379
Thanks again for the interest in the model. Please feel free to reach out if you have any other questions.