What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?
Abstract
MultiTempBench evaluates multilingual temporal reasoning capabilities of LLMs across different calendar systems and languages, revealing tokenization quality as a key bottleneck in low-resource settings.
We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains 15,000 examples built by translating 750 curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb
Community
We present MULTITEMPBENCH, a multilingual temporal reasoning benchmark spanning
three tasks, date arithmetic, time zone conversion, and temporal relation extraction across
five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar).
MULTITEMPBENCH contains 15,000 examples built by translating 750 curated English
questions and expanding each into controlled
date-format variants. We evaluate 20 LLMs and
introduce the multilingual Date Fragmentation
Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing
analyses of internal temporal representations.
We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in
low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day
separation and accuracy collapses, while highresource settings are often robust to digit-level
splitting. Beyond tokenisation, crossed mixedeffects regression shows that temporal linearity
is the strongest predictor of temporal reasoning
in high-resource languages, whereas fragmentation is the stronger predictor in low-resource
languages.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Is continuous CoT better suited for multi-lingual reasoning? (2026)
- The Effect of Scripts and Formats on LLM Numeracy (2026)
- UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop (2026)
- Can Linguistically Related Languages Guide LLM Translation in Low-Resource Settings? (2026)
- Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads (2026)
- Making Large Language Models Speak Tulu: Structured Prompting for an Extremely Low-Resource Language (2026)
- BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper