GPRM-4B
๐ Join our community. ๐ Read the GPRM technical report. ๐ API access available upon request.
[GitHub] [Technical Report]
Introduction
GPRM (Global Perspective Process Reward Model) is a next-generation process reward model designed to overcome the "local context" limitations of traditional PRMs. While previous models judge each step in isolation, GPRM introduces a Global Perspective, significantly improving error localization and reasoning verification in long-chain tasks.
Previous PRMs often suffer from two major flaws: they ignore historical evaluations and lack visibility into how a step affects future reasoning.
GPRM addresses these via:
- History-Aware Evaluation: Explicitly conditions on previous steps and their associated judgments.
- Future-Informed Reasoning: Incorporates a look-ahead perspective to validate steps against subsequent derivations.
- 4-D Diagnostic Framework: Structured evaluation across Look-back (consistency), Look-ahead (plausibility), Self-check (validity), and Goal alignment.
Benchmark
PRMBench (Overall Score)
| Model | Simplicity | Soundness | Sensitivity | Overall |
|---|---|---|---|---|
| GPT-4o | 59.7 | 70.9 | 75.8 | 66.8 |
| o1-mini | 64.6 | 72.1 | 75.5 | 68.8 |
| Gemini-2.0-flash-exp | 58.1 | 66.0 | 75.4 | 66.9 |
| Qwen2.5-Math-PRM-7B | 52.1 | 71.0 | 75.5 | 65.5 |
| R-PRM-7B-DPO | 55.2 | 71.2 | 76.6 | 66.8 |
| GenPRM-7B | 56.1 | 71.8 | 77.0 | 67.4 |
| Skywork-PRM-7B | 59.6 | 68.5 | 73.3 | 65.1 |
| GPRM-4B-SFT | 65.0 | 75.2 | 78.8 | 72.9 |
| GPRM-4B-GRPO | 65.8 | 76.2 | 79.3 | 73.9 |
| GPRM-14B-GRPO | 67.2 | 77.6 | 80.2 | 74.6 |
ProcessBench (Avg. F1 Score)
| Model | GSM8K | MATH | OlympiadBench | OmniMath | Avg. F1 |
|---|---|---|---|---|---|
| GPT-4o | 79.2 | 63.6 | 51.4 | 53.5 | 61.9 |
| o1-mini | 93.2 | 88.9 | 87.2 | 82.4 | 87.9 |
| Qwen2.5-Math-PRM-7B | 68.2 | 62.6 | 50.7 | 44.3 | 58.5 |
| R-PRM-7B-DPO | 80.7 | 76.9 | 63.8 | 60.1 | 70.4 |
| GenPRM-7B | 73.7 | 77.9 | 71.8 | 73.8 | 74.1 |
| Skywork-PRM-7B | 70.8 | 53.6 | 22.9 | 21.0 | 42.1 |
| GPRM-4B-SFT | 73.1 | 76.2 | 69.4 | 70.5 | 72.3 |
| GPRM-4B-GRPO | 73.1 | 77.5 | 71.5 | 75.1 | 74.3 |
| GPRM-14B-GRPO | 74.7 | 79.3 | 73.9 | 75.3 | 75.8 |
Agent Error Bench (Accuracy %)
| Model | ALFWorld (S/S+M) | WebShop (S/S+M) | GAIA (S/S+M) | Average (S/S+M) |
|---|---|---|---|---|
| Direct Prompting (GPT-4.1) | 28.0 / 14.0 | 30.0 / 6.0 | 26.0 / 10.0 | 28.0 / 10.0 |
| AgentDebug | 35.0 / 28.0 | 42.0 / 22.0 | 58.0 / 44.0 | 45.0 / 31.3 |
| GPRM-4B | 38.0 / 30.0 | 44.0 / 24.0 | 60.0 / 46.0 | 47.0 / 33.0 |
| GPRM-14B | 46.0 / 37.0 | 51.0 / 29.0 | 67.0 / 51.0 | 54.0 / 39.0 |
PPE - Verifiable Correctness Subset (Mean Score)
| Reward Model | MMLU-Pro | MATH | GPQA | MBPP-Plus | IFEval | Mean |
|---|---|---|---|---|---|---|
| Claude 3.5 (ArenaHard) | 0.81 | 0.86 | 0.63 | 0.54 | 0.58 | 0.68 |
| Athene-RM-70B | 0.77 | 0.79 | 0.59 | 0.68 | 0.62 | 0.69 |
| GPT-4o-mini (ArenaHard) | 0.71 | 0.81 | 0.57 | 0.54 | 0.56 | 0.63 |
| Llama-3.1-70B (ArenaHard) | 0.73 | 0.73 | 0.56 | 0.58 | 0.56 | 0.63 |
| GPRM-4B | 0.67 | 0.71 | 0.58 | 0.65 | 0.59 | 0.64 |
Downstream Test-Time Search (Base: Qwen2.5-7B-Instruct)
Best-of-8 (Accuracy %)
| PRM Guide | AIME24 | AMC23 | MATH | OlympiadBench | College Math | Minerva MATH | Avg. |
|---|---|---|---|---|---|---|---|
| Reference: pass@1 | 11.2 | 47.8 | 73.0 | 38.0 | 38.6 | 37.2 | 41.0 |
| Reference: maj@8 | 20.0 | 57.5 | 79.6 | 47.0 | 41.5 | 42.7 | 48.0 |
| R-PRM-7B-DPO | 20.0 | 62.5 | 82.2 | 48.0 | 41.0 | 44.1 | 49.6 |
| GPRM-4B | 20.0 | 63.0 | 82.6 | 48.5 | 40.5 | 45.0 | 50.1 |
| GPRM-14B | 20.0 | 64.2 | 83.1 | 50.3 | 42.6 | 45.8 | 51.0 |
Greedy Guided Search@8 (Accuracy %)
| PRM Guide | AIME24 | AMC23 | MATH | OlympiadBench | College Math | Minerva MATH | Avg. |
|---|---|---|---|---|---|---|---|
| R-PRM-7B-DPO | 16.7 | 70.0 | 80.0 | 46.5 | 39.5 | 43.4 | 49.4 |
| GPRM-4B | 23.3 | 85.0 | 80.0 | 48.0 | 45.0 | 48.8 | 55.0 |
| GPRM-14B | 23.3 | 87.5 | 85.0 | 45.0 | 39.5 | 50.0 | 55.0 |
Training Strategy
GPRM utilizes a two-stage progressive training pipeline:
- Stage I (Structured SFT): Learns 4-dimensional diagnostic reasoning via targeted error injection (Calculation, Logic, Goal-drift, Inconsistency) using Qwen3-235B-Instruct as teacher for annotation.
- Stage II (GRPO Optimization): Refines evaluation policy under complete global context (History + Current + Future) using Group Relative Policy Optimization on hard-mined samples from PRM800K.
Serve GPRM Locally
The following open-source frameworks support local deployment of GPRM-4B:
- vLLM (v0.19.0+) โ see recipes
- SGLang (v0.5.10+) โ see cookbook
- Transformers (v4.45.0+) โ see docs
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "zai-org/GPRM-4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="bfloat16")
# Input format: [Question] + [History: Step_i | Judgment_i] + [Current Step] + [Future Context (optional)]
- Downloads last month
- 90