GRPO Algorithm
GRPO is an efficient RL algorithm that removes the need for a separate critic model.
Core idea
For each question q, GRPO samples a group of G outputs {oi}i=1G from the current policy πθold.
Each output is scored by reward functions, producing rewards {ri}i=1G.
The advantage for each sample is computed by normalizing rewards within the group:
Ai=std({rj}j=1G)ri−mean({rj}j=1G).
This relative normalization is the key: each sample is judged against peers from the same prompt, not by an external value baseline.
Objective
The policy is updated with:
JGRPO(θ)=Eq∼D,{oi}i=1G∼πθold(⋅∣q)[G1∑i=1Gmin(ri(θ)Ai,clip(ri(θ),1−ϵ,1+ϵ)Ai)−βDKL(πθ∥πref)].
where
ri(θ)=πθold(oi∣q)πθ(oi∣q).
ϵ is the clipping coefficient, and β controls KL regularization toward a reference policy πref.