Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models
Abstract
Large language models exhibit post-conventional moral reasoning patterns inconsistent with human developmental trajectories, showing systematic logical incoherence and rhetorical sophistication without underlying moral reasoning development.
Do large language models reason morally, or do they merely sound like they do? We investigate whether LLM responses to moral dilemmas exhibit genuine developmental progression through Kohlberg's stages of moral development, or whether alignment training instead produces reasoning-like outputs that superficially resemble mature moral judgment without the underlying developmental trajectory. Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and conduct ten complementary analyses to characterize the nature and internal coherence of the resulting patterns. Our results reveal a striking inversion: responses overwhelmingly correspond to post-conventional reasoning (Stages 5-6) regardless of model size, architecture, or prompting strategy, the effective inverse of human developmental norms, where Stage 4 dominates. Most strikingly, a subset of models exhibit moral decoupling: systematic inconsistency between stated moral justification and action choice, a form of logical incoherence that persists across scale and prompting strategy and represents a direct reasoning consistency failure independent of rhetorical sophistication. Model scale carries a statistically significant but practically small effect; training type has no significant independent main effect; and models exhibit near-robotic cross-dilemma consistency producing logically indistinguishable responses across semantically distinct moral problems. We posit that these patterns constitute evidence for moral ventriloquism: the acquisition, through alignment training, of the rhetorical conventions of mature moral reasoning without the underlying developmental trajectory those conventions are meant to represent.
Community
The paper shows that LLMs systematically produce post-conventional moral rhetoric (Kohlberg Stages 5–6) independent of scale, prompting, or context—revealing “moral ventriloquism,” where models mimic advanced moral reasoning without underlying coherent reasoning processes.
➡️ 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝐨𝐟 𝐌𝐨𝐫𝐚𝐥 𝐕𝐞𝐧𝐭𝐫𝐢𝐥𝐨𝐪𝐮𝐢𝐬𝐦 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬:
🧪 𝑲𝒐𝒉𝒍𝒃𝒆𝒓𝒈-𝑩𝒂𝒔𝒆𝒅 𝑫𝒊𝒂𝒈𝒏𝒐𝒔𝒕𝒊𝒄 𝑬𝒗𝒂𝒍𝒖𝒂𝒕𝒊𝒐𝒏:
Introduces a large-scale evaluation pipeline using Kohlberg’s moral development stages as a distributional diagnostic, scoring 600+ responses from 13 LLMs across 6 dilemmas and 3 prompting regimes via an LLM-as-judge system (multi-judge validated). Unlike prior work, it benchmarks developmental structure rather than surface correctness, enabling detection of stage distribution inversion vs. human norms .
🧩 𝑴𝒐𝒓𝒂𝒍 𝑽𝒆𝒏𝒕𝒓𝒊𝒍𝒐𝒒𝒖𝒊𝒔𝒎 & 𝑫𝒊𝒔𝒕𝒓𝒊𝒃𝒖𝒕𝒊𝒐𝒏𝒂𝒍 𝑰𝒏𝒗𝒆𝒓𝒔𝒊𝒐𝒏:
Finds that 86% of responses fall in Stages 5–6 (vs. ~20% in humans), with near-zero Stage 1–3 presence—an inversion of human developmental distributions (Table on p.6). Models exhibit cross-dilemma rigidity (ICC > 0.90) and negligible sensitivity to prompting (p = 0.15), indicating outputs are governed by a fixed rhetorical prior rather than contextual reasoning .
🧠 𝑴𝒐𝒓𝒂𝒍 𝑫𝒆𝒄𝒐𝒖𝒑𝒍𝒊𝒏𝒈 & 𝑨𝒍𝒊𝒈𝒏𝒎𝒆𝒏𝒕-𝑰𝒏𝒅𝒖𝒄𝒆𝒅 𝑹𝒉𝒆𝒕𝒐𝒓𝒊𝒄:
Identifies a novel failure mode—action–reasoning decoupling—where models produce high-stage justifications but choose lower-stage actions (logical inconsistency). Factorial analysis shows scale has limited effect (<1 stage range), while RLHF drives a shared “moral vocabulary manifold,” implying alignment training installs rhetorical patterns independent of decision processes .
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability (2026)
- The Fragility Of Moral Judgment In Large Language Models (2026)
- Unsupervised Elicitation of Moral Values from Language Models (2026)
- Literary Narrative as Moral Probe : A Cross-System Framework for Evaluating AI Ethical Reasoning and Refusal Behavior (2026)
- Visual Distraction Undermines Moral Reasoning in Vision-Language Models (2026)
- Grounded Concreteness: Human-Like Concreteness Sensitivity in Vision-Language Models (2026)
- Reasoning Traces Shape Outputs but Models Won't Say So (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.21854 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper