GPRM-4B

👋 Join our community. 📖 Read the GPRM technical report. 📍 API access available upon request.

[GitHub] [Technical Report]

Introduction

GPRM (Global Perspective Process Reward Model) is a next-generation process reward model designed to overcome the "local context" limitations of traditional PRMs. While previous models judge each step in isolation, GPRM introduces a Global Perspective, significantly improving error localization and reasoning verification in long-chain tasks.

Previous PRMs often suffer from two major flaws: they ignore historical evaluations and lack visibility into how a step affects future reasoning.

GPRM addresses these via:

History-Aware Evaluation: Explicitly conditions on previous steps and their associated judgments.
Future-Informed Reasoning: Incorporates a look-ahead perspective to validate steps against subsequent derivations.
4-D Diagnostic Framework: Structured evaluation across Look-back (consistency), Look-ahead (plausibility), Self-check (validity), and Goal alignment.

Benchmark

PRMBench (Overall Score)

Model	Simplicity	Soundness	Sensitivity	Overall
GPT-4o	59.7	70.9	75.8	66.8
o1-mini	64.6	72.1	75.5	68.8
Gemini-2.0-flash-exp	58.1	66.0	75.4	66.9
Qwen2.5-Math-PRM-7B	52.1	71.0	75.5	65.5
R-PRM-7B-DPO	55.2	71.2	76.6	66.8
GenPRM-7B	56.1	71.8	77.0	67.4
Skywork-PRM-7B	59.6	68.5	73.3	65.1
GPRM-4B-SFT	65.0	75.2	78.8	72.9
GPRM-4B-GRPO	65.8	76.2	79.3	73.9
GPRM-14B-GRPO	67.2	77.6	80.2	74.6

ProcessBench (Avg. F1 Score)

Model	GSM8K	MATH	OlympiadBench	OmniMath	Avg. F1
GPT-4o	79.2	63.6	51.4	53.5	61.9
o1-mini	93.2	88.9	87.2	82.4	87.9
Qwen2.5-Math-PRM-7B	68.2	62.6	50.7	44.3	58.5
R-PRM-7B-DPO	80.7	76.9	63.8	60.1	70.4
GenPRM-7B	73.7	77.9	71.8	73.8	74.1
Skywork-PRM-7B	70.8	53.6	22.9	21.0	42.1
GPRM-4B-SFT	73.1	76.2	69.4	70.5	72.3
GPRM-4B-GRPO	73.1	77.5	71.5	75.1	74.3
GPRM-14B-GRPO	74.7	79.3	73.9	75.3	75.8

Agent Error Bench (Accuracy %)

Model	ALFWorld (S/S+M)	WebShop (S/S+M)	GAIA (S/S+M)	Average (S/S+M)
Direct Prompting (GPT-4.1)	28.0 / 14.0	30.0 / 6.0	26.0 / 10.0	28.0 / 10.0
AgentDebug	35.0 / 28.0	42.0 / 22.0	58.0 / 44.0	45.0 / 31.3
GPRM-4B	38.0 / 30.0	44.0 / 24.0	60.0 / 46.0	47.0 / 33.0
GPRM-14B	46.0 / 37.0	51.0 / 29.0	67.0 / 51.0	54.0 / 39.0

PPE - Verifiable Correctness Subset (Mean Score)

Reward Model	MMLU-Pro	MATH	GPQA	MBPP-Plus	IFEval	Mean
Claude 3.5 (ArenaHard)	0.81	0.86	0.63	0.54	0.58	0.68
Athene-RM-70B	0.77	0.79	0.59	0.68	0.62	0.69
GPT-4o-mini (ArenaHard)	0.71	0.81	0.57	0.54	0.56	0.63
Llama-3.1-70B (ArenaHard)	0.73	0.73	0.56	0.58	0.56	0.63
GPRM-4B	0.67	0.71	0.58	0.65	0.59	0.64

Downstream Test-Time Search (Base: Qwen2.5-7B-Instruct)

Best-of-8 (Accuracy %)

PRM Guide	AIME24	AMC23	MATH	OlympiadBench	College Math	Minerva MATH	Avg.
Reference: pass@1	11.2	47.8	73.0	38.0	38.6	37.2	41.0
Reference: maj@8	20.0	57.5	79.6	47.0	41.5	42.7	48.0
R-PRM-7B-DPO	20.0	62.5	82.2	48.0	41.0	44.1	49.6
GPRM-4B	20.0	63.0	82.6	48.5	40.5	45.0	50.1
GPRM-14B	20.0	64.2	83.1	50.3	42.6	45.8	51.0

Greedy Guided Search@8 (Accuracy %)

PRM Guide	AIME24	AMC23	MATH	OlympiadBench	College Math	Minerva MATH	Avg.
R-PRM-7B-DPO	16.7	70.0	80.0	46.5	39.5	43.4	49.4
GPRM-4B	23.3	85.0	80.0	48.0	45.0	48.8	55.0
GPRM-14B	23.3	87.5	85.0	45.0	39.5	50.0	55.0

Training Strategy

GPRM utilizes a two-stage progressive training pipeline:

Stage I (Structured SFT): Learns 4-dimensional diagnostic reasoning via targeted error injection (Calculation, Logic, Goal-drift, Inconsistency) using Qwen3-235B-Instruct as teacher for annotation.
Stage II (GRPO Optimization): Refines evaluation policy under complete global context (History + Current + Future) using Group Relative Policy Optimization on hard-mined samples from PRM800K.

Serve GPRM Locally

The following open-source frameworks support local deployment of GPRM-4B:

vLLM (v0.19.0+) — see recipes
SGLang (v0.5.10+) — see cookbook
Transformers (v4.45.0+) — see docs

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "zai-org/GPRM-4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="bfloat16")

# Input format: [Question] + [History: Step_i | Judgment_i] + [Current Step] + [Future Context (optional)]

Downloads last month: 90

Safetensors

Model size

196k params

Tensor type

BF16

Model tree for skylenage-ai/GPRM-4B

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1541)

this model

Quantizations

1 model