arxiv:2603.19017

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

Published on Mar 19

· Submitted by

Authors:

Abstract

MultiTempBench evaluates multilingual temporal reasoning capabilities of LLMs across different calendar systems and languages, revealing tokenization quality as a key bottleneck in low-resource settings.

AI-generated summary

We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains 15,000 examples built by translating 750 curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb

View arXiv page View PDF GitHub 0 Add to collection

Community

gagan3012

Paper submitter about 10 hours ago

We present MULTITEMPBENCH, a multilingual temporal reasoning benchmark spanning
three tasks, date arithmetic, time zone conversion, and temporal relation extraction across
five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar).
MULTITEMPBENCH contains 15,000 examples built by translating 750 curated English
questions and expanding each into controlled
date-format variants. We evaluate 20 LLMs and
introduce the multilingual Date Fragmentation
Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing
analyses of internal temporal representations.
We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in
low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day
separation and accuracy collapses, while highresource settings are often robust to digit-level
splitting. Beyond tokenisation, crossed mixedeffects regression shows that temporal linearity
is the strongest predictor of temporal reasoning
in high-resource languages, whereas fragmentation is the stronger predictor in low-resource
languages.