Model Card — `sanskrit-qwen-grpo-v2`

Published at: archijaiswal07/Qwen_Finetuned · Environment: Adityahars/Sanskrit-env · Framework: HuggingFace TRL · GRPO · License: Apache 2.0

Model Overview

sanskrit-qwen-grpo-v2 is a LoRA-adapted fine-tune of Qwen2.5-1.5B-Instruct, trained via Group Relative Policy Optimization (GRPO) on SanskritEnv — the first reinforcement learning environment for structured linguistic ambiguity resolution in Sanskrit manuscripts.

The model resolves six layers of classical Sanskrit interpretation that block automated translation of India's manuscript corpus: lexical disambiguation, sandhi resolution, samāsa classification, referential coherence tracking, evidence-driven manuscript restoration, and full cross-phase compositional consistency.

This is a v2 checkpoint, continuing from Adityahars/sanskrit-qwen-grpo (v1) with an additional 675 global GRPO steps across all six linguistic task types.

Intended Use

Primary use cases

Sanskrit manuscript translation assistance: resolving lexical, phonological, morphological, and discourse-level ambiguity in classical texts (Ayurvedic, astronomical, philosophical, narrative).
Structured NLP research: benchmarking on multi-layer linguistic decision tasks with deterministic graders.
RL post-training research: reference implementation for GRPO on a humanistic domain with fully verifiable rewards and no LLM judge.
Upstream disambiguation module for the Gyan Bharatam Mission and national manuscript digitization pipelines.

Out-of-scope

Modern Hindi, Pali, or Prakrit (trained on classical Sanskrit only).
Open-ended translation generation (decision/classification model, not a generative translator).
General-purpose instruction following unrelated to Sanskrit linguistic analysis.

Training Details

Base Model

Property	Value
Base model	`Qwen/Qwen2.5-1.5B-Instruct`
Starting checkpoint	`Adityahars/sanskrit-qwen-grpo` (v1)
Architecture	Transformer decoder (causal LM)
Parameters	1.5B
Adapter type	LoRA

Hyperparameters

Setting	Value
Algorithm	GRPO (Group Relative Policy Optimization)
Group size	8
Epochs	3
Steps per epoch	225
Total global steps	675
Optimizer	AdamW
LR schedule	Cosine decay
Peak learning rate	4.0 × 10⁻⁶
Final learning rate	5.9 × 10⁻⁹
Per-device batch size	4
Gradient accumulation	4
Hardware	HuggingFace Jobs · A100-large
Training duration	~6 hours

Training Data

Episodes are generated dynamically at training time using seed variation over 150 unique hand-annotated base episodes per task (900 total). With EPISODES_PER_TASK=1500, the trainer samples 1500 (prompt, seed) pairs per task.

Task	Base Episodes	Domains
Glossary Anchoring	150	Ayurveda, Astronomy, Philosophy
Sandhi Resolution	150	Philosophy, Ayurveda, Narrative
Samāsa Classification	150	Philosophy, Narrative, Ayurveda, Astronomy
Referential Coherence	150	Narrative, Philosophy
Manuscript Restoration	150	Ayurveda, Philosophy, Narrative, Astronomy
Full Manuscript Session	150	All domains

Source passages: Bhagavad Gita, Charaka Samhita, Mahabharata, Ramayana, Kalidasa. All texts are public domain (composed before 1928).

Environment & Reward Design

Task Architecture (SanskritEnv)

Task ID	Type	Steps/Episode	Core Challenge
`glossary_anchoring`	Single-step MCQ	1	Domain-specific term disambiguation
`sandhi_resolution`	Single-step MCQ	1	Phonological compound splitting
`samasa_classification`	Single-step MCQ	1	Grammatical compound type identification
`referential_coherence`	Multi-step MCQ	4–7	Cross-verse pronoun tracking
`manuscript_restoration`	Tool-use POMDP	Variable	Evidence gathering + deterministic commit
`full_manuscript_session`	Long-horizon chain	Multi-phase	All skills + cross-phase consistency

Reward Functions

Tasks 1–4 — Shaped MCQ reward (zero floor preserves GRPO group advantage variance):

Outcome	Raw	Shaped
Full credit	1.00	0.95
Partial credit	0.40	0.50
Adjacent sandhi	0.25	0.25
Wrong	0.00	0.00

Shaping: raw ∈ [0.40, 1.00] → shaped = 0.50 + (raw − 0.40) × (0.45 / 0.60). Wrong answers are hard zero (never soft-floored) to maximize inter-group reward std for GRPO.

Task 5 — POMDP tool reward + terminal commit:

per-step:  tool_reward = relevance_bonus + workflow_bonus − redundancy_penalty
terminal:  terminal_reward = r_correctness × M_evidence − P_budget
           M_evidence = 0.60 + 0.40 × (relevant_tools_used / tools_needed)
           P_budget   = 0.10 × max(0, steps_used − ideal_steps) / tool_budget

Condition	Reward
PRIMARY tool (first use)	+0.08
SECOND tool (PRIMARY already called)	+0.05
Workflow pair bonus (`lexicon_lookup → commentary_fetch`, etc.)	+0.03 – +0.05
Redundant or irrelevant call	−0.05
Wrong commit	0.00 (regardless of evidence)

Task 6 — Cross-phase consistency:

session_score = mean(phase_rewards) − 0.05 × contradictions + 0.05 × (zero_contradictions)

Grader Design

All six graders are fully deterministic — no LLM judge, no BLEU/ROUGE. Scoring uses exact string match against pre-annotated answer tables in the data JSON. Identical seeds always produce identical scores across runs, models, and hardware.

Training Results

Reward Trajectory

Phase	Steps	Mean Reward	Std	Entropy
Early (5–50)	10	0.452	0.34	0.91
Mid-epoch 1 (50–225)	35	0.541	0.32	0.94
Epoch 2 (225–450)	45	0.561	0.30	0.95
Epoch 3 (450–675)	45	0.572	0.30	0.96

Starting reward (step 5): 0.475 · Peak (step 545): 0.733 · Final mean: 0.576 · Relative lift: +27%

GRPO Health Metrics

Metric	Range	Status
Group reward std	0.21 – 0.39	✅ Healthy advantage signal
`frac_reward_zero_std`	0.0 throughout	✅ No group collapse at any step
Policy entropy	0.84 – 1.07	✅ No greedy-mode collapse
`clipped_ratio`	0.99375 – 1.00	✅ Minimal clipping

frac_reward_zero_std == 0.0 for all 675 steps means the advantage A_i = (r_i − μ) / (σ + ε) is well-defined throughout. The hard zero floor on wrong answers drives group std into the 0.35–0.45 range required for productive GRPO training.

Per-Task Evaluation (Pre vs Post)

Evaluated on the live SanskritEnv API · 25 episodes per Task 1–4 · 10 episodes per Task 5–6:

Task	Pre-train	Post-train	Notes
Glossary Anchoring	0.444	↑	Largest MCQ lift
Sandhi Resolution	0.630	↑	Strong baseline, further improved
Samāsa Classification	0.386	↑	Significant lift
Referential Coherence	0.278	↑	Hardest task, lowest baseline
Manuscript Restoration	0.573	↑	Mean tools used: 1.4, mean steps: 2.4
Full Manuscript Session	0.824	↑ stable	High baseline maintained

Model Lineage

Qwen/Qwen2.5-1.5B (base)
  └─ Qwen/Qwen2.5-1.5B-Instruct (instruction-tuned)
       └─ Adityahars/sanskrit-qwen-grpo (v1 · LoRA · GRPO on SanskritEnv)
            └─ archijaiswal07/Qwen_Finetuned (v2 · this model · +675 steps)

Inference

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="archijaiswal07/Qwen_Finetuned",
    device="cuda",
)

prompt = "The term 'agni' appears in this Ayurvedic passage. Which meaning is correct in this domain context?"
output = generator(
    [{"role": "user", "content": prompt}],
    max_new_tokens=128,
    return_full_text=False,
)[0]
print(output["generated_text"])

Against the live environment:

export HF_TOKEN=your_token
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=archijaiswal07/Qwen_Finetuned
python inference.py

Limitations

Domain specificity: Classical Sanskrit only. Modern registers, Pali, and Prakrits are untested.
Classification, not generation: Selects from candidates; not a Sanskrit-to-English translator.
Scale: At 1.5B parameters, long-range tracking tasks (Referential Coherence, Full Session) are near the model's working memory limit.
Dataset scope: 150 annotated base episodes per task. Seed variation expands training coverage but linguistic diversity is bounded by annotation budget.
Deterministic graders: Cannot reward paraphrastic or partially correct free-form interpretations a human scholar might accept.

Ethical Considerations

All Sanskrit source texts are public domain. Annotations, graders, and environment code are original to this project. The model is not intended to replace human Sanskrit scholars — its role is as an upstream disambiguation tool that routes ambiguous passages for expert review. Philological claims should be verified against canonical commentaries before use in academic or government digitization workflows.

Citation

@misc{sanskritenv2026,
  title   = {SanskritEnv: A Reinforcement Learning Environment for Sanskrit Manuscript Interpretation},
  author  = {Meta\_Mesh},
  year    = {2026},
  url     = {https://huggingface.co/spaces/Adityahars/Sanskrit-env},
  note    = {Fine-tuned model: archijaiswal07/Qwen_Finetuned}
}

Acknowledgements

Meta × HuggingFace OpenEnv · Gyan Bharatam Mission · Monier-Williams Sanskrit Dictionary · eGangotri · Murugesh et al. (2019) A Survey of Sanskrit NLP

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for archijaiswal07/Qwen_Finetuned

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Adapter

(992)

this model

Evaluation results

Mean Episode Reward (SanskritEnv)
self-reported

0.576

Model Card — sanskrit-qwen-grpo-v2