Model Card β€” sanskrit-qwen-grpo-v2

Published at: archijaiswal07/Qwen_Finetuned Β· Environment: Adityahars/Sanskrit-env Β· Framework: HuggingFace TRL Β· GRPO Β· License: Apache 2.0


Model Overview

sanskrit-qwen-grpo-v2 is a LoRA-adapted fine-tune of Qwen2.5-1.5B-Instruct, trained via Group Relative Policy Optimization (GRPO) on SanskritEnv β€” the first reinforcement learning environment for structured linguistic ambiguity resolution in Sanskrit manuscripts.

The model resolves six layers of classical Sanskrit interpretation that block automated translation of India's manuscript corpus: lexical disambiguation, sandhi resolution, samāsa classification, referential coherence tracking, evidence-driven manuscript restoration, and full cross-phase compositional consistency.

This is a v2 checkpoint, continuing from Adityahars/sanskrit-qwen-grpo (v1) with an additional 675 global GRPO steps across all six linguistic task types.


Intended Use

Primary use cases

  • Sanskrit manuscript translation assistance: resolving lexical, phonological, morphological, and discourse-level ambiguity in classical texts (Ayurvedic, astronomical, philosophical, narrative).
  • Structured NLP research: benchmarking on multi-layer linguistic decision tasks with deterministic graders.
  • RL post-training research: reference implementation for GRPO on a humanistic domain with fully verifiable rewards and no LLM judge.
  • Upstream disambiguation module for the Gyan Bharatam Mission and national manuscript digitization pipelines.

Out-of-scope

  • Modern Hindi, Pali, or Prakrit (trained on classical Sanskrit only).
  • Open-ended translation generation (decision/classification model, not a generative translator).
  • General-purpose instruction following unrelated to Sanskrit linguistic analysis.

Training Details

Base Model

Property Value
Base model Qwen/Qwen2.5-1.5B-Instruct
Starting checkpoint Adityahars/sanskrit-qwen-grpo (v1)
Architecture Transformer decoder (causal LM)
Parameters 1.5B
Adapter type LoRA

Hyperparameters

Setting Value
Algorithm GRPO (Group Relative Policy Optimization)
Group size 8
Epochs 3
Steps per epoch 225
Total global steps 675
Optimizer AdamW
LR schedule Cosine decay
Peak learning rate 4.0 Γ— 10⁻⁢
Final learning rate 5.9 Γ— 10⁻⁹
Per-device batch size 4
Gradient accumulation 4
Hardware HuggingFace Jobs Β· A100-large
Training duration ~6 hours

Training Data

Episodes are generated dynamically at training time using seed variation over 150 unique hand-annotated base episodes per task (900 total). With EPISODES_PER_TASK=1500, the trainer samples 1500 (prompt, seed) pairs per task.

Task Base Episodes Domains
Glossary Anchoring 150 Ayurveda, Astronomy, Philosophy
Sandhi Resolution 150 Philosophy, Ayurveda, Narrative
Samāsa Classification 150 Philosophy, Narrative, Ayurveda, Astronomy
Referential Coherence 150 Narrative, Philosophy
Manuscript Restoration 150 Ayurveda, Philosophy, Narrative, Astronomy
Full Manuscript Session 150 All domains

Source passages: Bhagavad Gita, Charaka Samhita, Mahabharata, Ramayana, Kalidasa. All texts are public domain (composed before 1928).


Environment & Reward Design

Task Architecture (SanskritEnv)

Task ID Type Steps/Episode Core Challenge
glossary_anchoring Single-step MCQ 1 Domain-specific term disambiguation
sandhi_resolution Single-step MCQ 1 Phonological compound splitting
samasa_classification Single-step MCQ 1 Grammatical compound type identification
referential_coherence Multi-step MCQ 4–7 Cross-verse pronoun tracking
manuscript_restoration Tool-use POMDP Variable Evidence gathering + deterministic commit
full_manuscript_session Long-horizon chain Multi-phase All skills + cross-phase consistency

Reward Functions

Tasks 1–4 β€” Shaped MCQ reward (zero floor preserves GRPO group advantage variance):

Outcome Raw Shaped
Full credit 1.00 0.95
Partial credit 0.40 0.50
Adjacent sandhi 0.25 0.25
Wrong 0.00 0.00

Shaping: raw ∈ [0.40, 1.00] β†’ shaped = 0.50 + (raw βˆ’ 0.40) Γ— (0.45 / 0.60). Wrong answers are hard zero (never soft-floored) to maximize inter-group reward std for GRPO.

Task 5 β€” POMDP tool reward + terminal commit:

per-step:  tool_reward = relevance_bonus + workflow_bonus βˆ’ redundancy_penalty
terminal:  terminal_reward = r_correctness Γ— M_evidence βˆ’ P_budget
           M_evidence = 0.60 + 0.40 Γ— (relevant_tools_used / tools_needed)
           P_budget   = 0.10 Γ— max(0, steps_used βˆ’ ideal_steps) / tool_budget
Condition Reward
PRIMARY tool (first use) +0.08
SECOND tool (PRIMARY already called) +0.05
Workflow pair bonus (lexicon_lookup β†’ commentary_fetch, etc.) +0.03 – +0.05
Redundant or irrelevant call βˆ’0.05
Wrong commit 0.00 (regardless of evidence)

Task 6 β€” Cross-phase consistency:

session_score = mean(phase_rewards) βˆ’ 0.05 Γ— contradictions + 0.05 Γ— (zero_contradictions)

Grader Design

All six graders are fully deterministic β€” no LLM judge, no BLEU/ROUGE. Scoring uses exact string match against pre-annotated answer tables in the data JSON. Identical seeds always produce identical scores across runs, models, and hardware.


Training Results

Reward Trajectory

Phase Steps Mean Reward Std Entropy
Early (5–50) 10 0.452 0.34 0.91
Mid-epoch 1 (50–225) 35 0.541 0.32 0.94
Epoch 2 (225–450) 45 0.561 0.30 0.95
Epoch 3 (450–675) 45 0.572 0.30 0.96

Starting reward (step 5): 0.475 Β· Peak (step 545): 0.733 Β· Final mean: 0.576 Β· Relative lift: +27%

GRPO Health Metrics

Metric Range Status
Group reward std 0.21 – 0.39 βœ… Healthy advantage signal
frac_reward_zero_std 0.0 throughout βœ… No group collapse at any step
Policy entropy 0.84 – 1.07 βœ… No greedy-mode collapse
clipped_ratio 0.99375 – 1.00 βœ… Minimal clipping

frac_reward_zero_std == 0.0 for all 675 steps means the advantage A_i = (r_i βˆ’ ΞΌ) / (Οƒ + Ξ΅) is well-defined throughout. The hard zero floor on wrong answers drives group std into the 0.35–0.45 range required for productive GRPO training.

Per-Task Evaluation (Pre vs Post)

Evaluated on the live SanskritEnv API Β· 25 episodes per Task 1–4 Β· 10 episodes per Task 5–6:

Task Pre-train Post-train Notes
Glossary Anchoring 0.444 ↑ Largest MCQ lift
Sandhi Resolution 0.630 ↑ Strong baseline, further improved
Samāsa Classification 0.386 ↑ Significant lift
Referential Coherence 0.278 ↑ Hardest task, lowest baseline
Manuscript Restoration 0.573 ↑ Mean tools used: 1.4, mean steps: 2.4
Full Manuscript Session 0.824 ↑ stable High baseline maintained

Model Lineage

Qwen/Qwen2.5-1.5B (base)
  └─ Qwen/Qwen2.5-1.5B-Instruct (instruction-tuned)
       └─ Adityahars/sanskrit-qwen-grpo (v1 Β· LoRA Β· GRPO on SanskritEnv)
            └─ archijaiswal07/Qwen_Finetuned (v2 Β· this model Β· +675 steps)

Inference

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="archijaiswal07/Qwen_Finetuned",
    device="cuda",
)

prompt = "The term 'agni' appears in this Ayurvedic passage. Which meaning is correct in this domain context?"
output = generator(
    [{"role": "user", "content": prompt}],
    max_new_tokens=128,
    return_full_text=False,
)[0]
print(output["generated_text"])

Against the live environment:

export HF_TOKEN=your_token
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=archijaiswal07/Qwen_Finetuned
python inference.py

Limitations

  • Domain specificity: Classical Sanskrit only. Modern registers, Pali, and Prakrits are untested.
  • Classification, not generation: Selects from candidates; not a Sanskrit-to-English translator.
  • Scale: At 1.5B parameters, long-range tracking tasks (Referential Coherence, Full Session) are near the model's working memory limit.
  • Dataset scope: 150 annotated base episodes per task. Seed variation expands training coverage but linguistic diversity is bounded by annotation budget.
  • Deterministic graders: Cannot reward paraphrastic or partially correct free-form interpretations a human scholar might accept.

Ethical Considerations

All Sanskrit source texts are public domain. Annotations, graders, and environment code are original to this project. The model is not intended to replace human Sanskrit scholars β€” its role is as an upstream disambiguation tool that routes ambiguous passages for expert review. Philological claims should be verified against canonical commentaries before use in academic or government digitization workflows.


Citation

@misc{sanskritenv2026,
  title   = {SanskritEnv: A Reinforcement Learning Environment for Sanskrit Manuscript Interpretation},
  author  = {Meta\_Mesh},
  year    = {2026},
  url     = {https://huggingface.co/spaces/Adityahars/Sanskrit-env},
  note    = {Fine-tuned model: archijaiswal07/Qwen_Finetuned}
}

Acknowledgements

Meta Γ— HuggingFace OpenEnv Β· Gyan Bharatam Mission Β· Monier-Williams Sanskrit Dictionary Β· eGangotri Β· Murugesh et al. (2019) A Survey of Sanskrit NLP

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for archijaiswal07/Qwen_Finetuned

Adapter
(992)
this model

Evaluation results