GPRM-4B

๐Ÿ‘‹ Join our community. ๐Ÿ“– Read the GPRM technical report. ๐Ÿ“ API access available upon request.

[GitHub] [Technical Report]

Introduction

GPRM (Global Perspective Process Reward Model) is a next-generation process reward model designed to overcome the "local context" limitations of traditional PRMs. While previous models judge each step in isolation, GPRM introduces a Global Perspective, significantly improving error localization and reasoning verification in long-chain tasks.

Previous PRMs often suffer from two major flaws: they ignore historical evaluations and lack visibility into how a step affects future reasoning.

GPRM addresses these via:

  • History-Aware Evaluation: Explicitly conditions on previous steps and their associated judgments.
  • Future-Informed Reasoning: Incorporates a look-ahead perspective to validate steps against subsequent derivations.
  • 4-D Diagnostic Framework: Structured evaluation across Look-back (consistency), Look-ahead (plausibility), Self-check (validity), and Goal alignment.

Benchmark

PRMBench (Overall Score)

Model Simplicity Soundness Sensitivity Overall
GPT-4o 59.7 70.9 75.8 66.8
o1-mini 64.6 72.1 75.5 68.8
Gemini-2.0-flash-exp 58.1 66.0 75.4 66.9
Qwen2.5-Math-PRM-7B 52.1 71.0 75.5 65.5
R-PRM-7B-DPO 55.2 71.2 76.6 66.8
GenPRM-7B 56.1 71.8 77.0 67.4
Skywork-PRM-7B 59.6 68.5 73.3 65.1
GPRM-4B-SFT 65.0 75.2 78.8 72.9
GPRM-4B-GRPO 65.8 76.2 79.3 73.9
GPRM-14B-GRPO 67.2 77.6 80.2 74.6

ProcessBench (Avg. F1 Score)

Model GSM8K MATH OlympiadBench OmniMath Avg. F1
GPT-4o 79.2 63.6 51.4 53.5 61.9
o1-mini 93.2 88.9 87.2 82.4 87.9
Qwen2.5-Math-PRM-7B 68.2 62.6 50.7 44.3 58.5
R-PRM-7B-DPO 80.7 76.9 63.8 60.1 70.4
GenPRM-7B 73.7 77.9 71.8 73.8 74.1
Skywork-PRM-7B 70.8 53.6 22.9 21.0 42.1
GPRM-4B-SFT 73.1 76.2 69.4 70.5 72.3
GPRM-4B-GRPO 73.1 77.5 71.5 75.1 74.3
GPRM-14B-GRPO 74.7 79.3 73.9 75.3 75.8

Agent Error Bench (Accuracy %)

Model ALFWorld (S/S+M) WebShop (S/S+M) GAIA (S/S+M) Average (S/S+M)
Direct Prompting (GPT-4.1) 28.0 / 14.0 30.0 / 6.0 26.0 / 10.0 28.0 / 10.0
AgentDebug 35.0 / 28.0 42.0 / 22.0 58.0 / 44.0 45.0 / 31.3
GPRM-4B 38.0 / 30.0 44.0 / 24.0 60.0 / 46.0 47.0 / 33.0
GPRM-14B 46.0 / 37.0 51.0 / 29.0 67.0 / 51.0 54.0 / 39.0

PPE - Verifiable Correctness Subset (Mean Score)

Reward Model MMLU-Pro MATH GPQA MBPP-Plus IFEval Mean
Claude 3.5 (ArenaHard) 0.81 0.86 0.63 0.54 0.58 0.68
Athene-RM-70B 0.77 0.79 0.59 0.68 0.62 0.69
GPT-4o-mini (ArenaHard) 0.71 0.81 0.57 0.54 0.56 0.63
Llama-3.1-70B (ArenaHard) 0.73 0.73 0.56 0.58 0.56 0.63
GPRM-4B 0.67 0.71 0.58 0.65 0.59 0.64

Downstream Test-Time Search (Base: Qwen2.5-7B-Instruct)

Best-of-8 (Accuracy %)

PRM Guide AIME24 AMC23 MATH OlympiadBench College Math Minerva MATH Avg.
Reference: pass@1 11.2 47.8 73.0 38.0 38.6 37.2 41.0
Reference: maj@8 20.0 57.5 79.6 47.0 41.5 42.7 48.0
R-PRM-7B-DPO 20.0 62.5 82.2 48.0 41.0 44.1 49.6
GPRM-4B 20.0 63.0 82.6 48.5 40.5 45.0 50.1
GPRM-14B 20.0 64.2 83.1 50.3 42.6 45.8 51.0

Greedy Guided Search@8 (Accuracy %)

PRM Guide AIME24 AMC23 MATH OlympiadBench College Math Minerva MATH Avg.
R-PRM-7B-DPO 16.7 70.0 80.0 46.5 39.5 43.4 49.4
GPRM-4B 23.3 85.0 80.0 48.0 45.0 48.8 55.0
GPRM-14B 23.3 87.5 85.0 45.0 39.5 50.0 55.0

Training Strategy

GPRM utilizes a two-stage progressive training pipeline:

  1. Stage I (Structured SFT): Learns 4-dimensional diagnostic reasoning via targeted error injection (Calculation, Logic, Goal-drift, Inconsistency) using Qwen3-235B-Instruct as teacher for annotation.
  2. Stage II (GRPO Optimization): Refines evaluation policy under complete global context (History + Current + Future) using Group Relative Policy Optimization on hard-mined samples from PRM800K.

Serve GPRM Locally

The following open-source frameworks support local deployment of GPRM-4B:

  • vLLM (v0.19.0+) โ€” see recipes
  • SGLang (v0.5.10+) โ€” see cookbook
  • Transformers (v4.45.0+) โ€” see docs
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "zai-org/GPRM-4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="bfloat16")

# Input format: [Question] + [History: Step_i | Judgment_i] + [Current Step] + [Future Context (optional)]
Downloads last month
90
Safetensors
Model size
196k params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for skylenage-ai/GPRM-4B

Finetuned
(1541)
this model
Quantizations
1 model