Model Card β sanskrit-qwen-grpo-v2
Published at: archijaiswal07/Qwen_Finetuned Β· Environment: Adityahars/Sanskrit-env Β· Framework: HuggingFace TRL Β· GRPO Β· License: Apache 2.0
Model Overview
sanskrit-qwen-grpo-v2 is a LoRA-adapted fine-tune of Qwen2.5-1.5B-Instruct, trained via Group Relative Policy Optimization (GRPO) on SanskritEnv β the first reinforcement learning environment for structured linguistic ambiguity resolution in Sanskrit manuscripts.
The model resolves six layers of classical Sanskrit interpretation that block automated translation of India's manuscript corpus: lexical disambiguation, sandhi resolution, samΔsa classification, referential coherence tracking, evidence-driven manuscript restoration, and full cross-phase compositional consistency.
This is a v2 checkpoint, continuing from Adityahars/sanskrit-qwen-grpo (v1) with an additional 675 global GRPO steps across all six linguistic task types.
Intended Use
Primary use cases
- Sanskrit manuscript translation assistance: resolving lexical, phonological, morphological, and discourse-level ambiguity in classical texts (Ayurvedic, astronomical, philosophical, narrative).
- Structured NLP research: benchmarking on multi-layer linguistic decision tasks with deterministic graders.
- RL post-training research: reference implementation for GRPO on a humanistic domain with fully verifiable rewards and no LLM judge.
- Upstream disambiguation module for the Gyan Bharatam Mission and national manuscript digitization pipelines.
Out-of-scope
- Modern Hindi, Pali, or Prakrit (trained on classical Sanskrit only).
- Open-ended translation generation (decision/classification model, not a generative translator).
- General-purpose instruction following unrelated to Sanskrit linguistic analysis.
Training Details
Base Model
| Property | Value |
|---|---|
| Base model | Qwen/Qwen2.5-1.5B-Instruct |
| Starting checkpoint | Adityahars/sanskrit-qwen-grpo (v1) |
| Architecture | Transformer decoder (causal LM) |
| Parameters | 1.5B |
| Adapter type | LoRA |
Hyperparameters
| Setting | Value |
|---|---|
| Algorithm | GRPO (Group Relative Policy Optimization) |
| Group size | 8 |
| Epochs | 3 |
| Steps per epoch | 225 |
| Total global steps | 675 |
| Optimizer | AdamW |
| LR schedule | Cosine decay |
| Peak learning rate | 4.0 Γ 10β»βΆ |
| Final learning rate | 5.9 Γ 10β»βΉ |
| Per-device batch size | 4 |
| Gradient accumulation | 4 |
| Hardware | HuggingFace Jobs Β· A100-large |
| Training duration | ~6 hours |
Training Data
Episodes are generated dynamically at training time using seed variation over 150 unique hand-annotated base episodes per task (900 total). With EPISODES_PER_TASK=1500, the trainer samples 1500 (prompt, seed) pairs per task.
| Task | Base Episodes | Domains |
|---|---|---|
| Glossary Anchoring | 150 | Ayurveda, Astronomy, Philosophy |
| Sandhi Resolution | 150 | Philosophy, Ayurveda, Narrative |
| SamΔsa Classification | 150 | Philosophy, Narrative, Ayurveda, Astronomy |
| Referential Coherence | 150 | Narrative, Philosophy |
| Manuscript Restoration | 150 | Ayurveda, Philosophy, Narrative, Astronomy |
| Full Manuscript Session | 150 | All domains |
Source passages: Bhagavad Gita, Charaka Samhita, Mahabharata, Ramayana, Kalidasa. All texts are public domain (composed before 1928).
Environment & Reward Design
Task Architecture (SanskritEnv)
| Task ID | Type | Steps/Episode | Core Challenge |
|---|---|---|---|
glossary_anchoring |
Single-step MCQ | 1 | Domain-specific term disambiguation |
sandhi_resolution |
Single-step MCQ | 1 | Phonological compound splitting |
samasa_classification |
Single-step MCQ | 1 | Grammatical compound type identification |
referential_coherence |
Multi-step MCQ | 4β7 | Cross-verse pronoun tracking |
manuscript_restoration |
Tool-use POMDP | Variable | Evidence gathering + deterministic commit |
full_manuscript_session |
Long-horizon chain | Multi-phase | All skills + cross-phase consistency |
Reward Functions
Tasks 1β4 β Shaped MCQ reward (zero floor preserves GRPO group advantage variance):
| Outcome | Raw | Shaped |
|---|---|---|
| Full credit | 1.00 | 0.95 |
| Partial credit | 0.40 | 0.50 |
| Adjacent sandhi | 0.25 | 0.25 |
| Wrong | 0.00 | 0.00 |
Shaping: raw β [0.40, 1.00] β shaped = 0.50 + (raw β 0.40) Γ (0.45 / 0.60). Wrong answers are hard zero (never soft-floored) to maximize inter-group reward std for GRPO.
Task 5 β POMDP tool reward + terminal commit:
per-step: tool_reward = relevance_bonus + workflow_bonus β redundancy_penalty
terminal: terminal_reward = r_correctness Γ M_evidence β P_budget
M_evidence = 0.60 + 0.40 Γ (relevant_tools_used / tools_needed)
P_budget = 0.10 Γ max(0, steps_used β ideal_steps) / tool_budget
| Condition | Reward |
|---|---|
| PRIMARY tool (first use) | +0.08 |
| SECOND tool (PRIMARY already called) | +0.05 |
Workflow pair bonus (lexicon_lookup β commentary_fetch, etc.) |
+0.03 β +0.05 |
| Redundant or irrelevant call | β0.05 |
| Wrong commit | 0.00 (regardless of evidence) |
Task 6 β Cross-phase consistency:
session_score = mean(phase_rewards) β 0.05 Γ contradictions + 0.05 Γ (zero_contradictions)
Grader Design
All six graders are fully deterministic β no LLM judge, no BLEU/ROUGE. Scoring uses exact string match against pre-annotated answer tables in the data JSON. Identical seeds always produce identical scores across runs, models, and hardware.
Training Results
Reward Trajectory
| Phase | Steps | Mean Reward | Std | Entropy |
|---|---|---|---|---|
| Early (5β50) | 10 | 0.452 | 0.34 | 0.91 |
| Mid-epoch 1 (50β225) | 35 | 0.541 | 0.32 | 0.94 |
| Epoch 2 (225β450) | 45 | 0.561 | 0.30 | 0.95 |
| Epoch 3 (450β675) | 45 | 0.572 | 0.30 | 0.96 |
Starting reward (step 5): 0.475 Β· Peak (step 545): 0.733 Β· Final mean: 0.576 Β· Relative lift: +27%
GRPO Health Metrics
| Metric | Range | Status |
|---|---|---|
| Group reward std | 0.21 β 0.39 | β Healthy advantage signal |
frac_reward_zero_std |
0.0 throughout | β No group collapse at any step |
| Policy entropy | 0.84 β 1.07 | β No greedy-mode collapse |
clipped_ratio |
0.99375 β 1.00 | β Minimal clipping |
frac_reward_zero_std == 0.0 for all 675 steps means the advantage A_i = (r_i β ΞΌ) / (Ο + Ξ΅) is well-defined throughout. The hard zero floor on wrong answers drives group std into the 0.35β0.45 range required for productive GRPO training.
Per-Task Evaluation (Pre vs Post)
Evaluated on the live SanskritEnv API Β· 25 episodes per Task 1β4 Β· 10 episodes per Task 5β6:
| Task | Pre-train | Post-train | Notes |
|---|---|---|---|
| Glossary Anchoring | 0.444 | β | Largest MCQ lift |
| Sandhi Resolution | 0.630 | β | Strong baseline, further improved |
| SamΔsa Classification | 0.386 | β | Significant lift |
| Referential Coherence | 0.278 | β | Hardest task, lowest baseline |
| Manuscript Restoration | 0.573 | β | Mean tools used: 1.4, mean steps: 2.4 |
| Full Manuscript Session | 0.824 | β stable | High baseline maintained |
Model Lineage
Qwen/Qwen2.5-1.5B (base)
ββ Qwen/Qwen2.5-1.5B-Instruct (instruction-tuned)
ββ Adityahars/sanskrit-qwen-grpo (v1 Β· LoRA Β· GRPO on SanskritEnv)
ββ archijaiswal07/Qwen_Finetuned (v2 Β· this model Β· +675 steps)
Inference
from transformers import pipeline
generator = pipeline(
"text-generation",
model="archijaiswal07/Qwen_Finetuned",
device="cuda",
)
prompt = "The term 'agni' appears in this Ayurvedic passage. Which meaning is correct in this domain context?"
output = generator(
[{"role": "user", "content": prompt}],
max_new_tokens=128,
return_full_text=False,
)[0]
print(output["generated_text"])
Against the live environment:
export HF_TOKEN=your_token
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=archijaiswal07/Qwen_Finetuned
python inference.py
Limitations
- Domain specificity: Classical Sanskrit only. Modern registers, Pali, and Prakrits are untested.
- Classification, not generation: Selects from candidates; not a Sanskrit-to-English translator.
- Scale: At 1.5B parameters, long-range tracking tasks (Referential Coherence, Full Session) are near the model's working memory limit.
- Dataset scope: 150 annotated base episodes per task. Seed variation expands training coverage but linguistic diversity is bounded by annotation budget.
- Deterministic graders: Cannot reward paraphrastic or partially correct free-form interpretations a human scholar might accept.
Ethical Considerations
All Sanskrit source texts are public domain. Annotations, graders, and environment code are original to this project. The model is not intended to replace human Sanskrit scholars β its role is as an upstream disambiguation tool that routes ambiguous passages for expert review. Philological claims should be verified against canonical commentaries before use in academic or government digitization workflows.
Citation
@misc{sanskritenv2026,
title = {SanskritEnv: A Reinforcement Learning Environment for Sanskrit Manuscript Interpretation},
author = {Meta\_Mesh},
year = {2026},
url = {https://huggingface.co/spaces/Adityahars/Sanskrit-env},
note = {Fine-tuned model: archijaiswal07/Qwen_Finetuned}
}
Acknowledgements
Meta Γ HuggingFace OpenEnv Β· Gyan Bharatam Mission Β· Monier-Williams Sanskrit Dictionary Β· eGangotri Β· Murugesh et al. (2019) A Survey of Sanskrit NLP
Model tree for archijaiswal07/Qwen_Finetuned
Evaluation results
- Mean Episode Reward (SanskritEnv)self-reported0.576