Back to Blog

GRPO Algorithm

GRPO is an efficient RL algorithm that removes the need for a separate critic model.

Core idea

For each question qq, GRPO samples a group of GG outputs {oi}i=1G\{o_i\}_{i=1}^{G} from the current policy πθold\pi_{\theta_{\text{old}}}.

Each output is scored by reward functions, producing rewards {ri}i=1G\{r_i\}_{i=1}^{G}.

The advantage for each sample is computed by normalizing rewards within the group:

Ai=rimean({rj}j=1G)std({rj}j=1G)A_i = \frac{r_i - \operatorname{mean}\left(\{r_j\}_{j=1}^{G}\right)}{\operatorname{std}\left(\{r_j\}_{j=1}^{G}\right)}.

This relative normalization is the key: each sample is judged against peers from the same prompt, not by an external value baseline.

Objective

The policy is updated with:

JGRPO(θ)=EqD,{oi}i=1Gπθold(q)[1Gi=1Gmin ⁣(ri(θ)Ai,clip ⁣(ri(θ),1ϵ,1+ϵ)Ai)βDKL ⁣(πθπref)]J_{\text{GRPO}}(\theta)=\mathbb{E}_{q\sim D,\{o_i\}_{i=1}^{G}\sim\pi_{\theta_{\text{old}}}(\cdot\mid q)}\left[\frac{1}{G}\sum_{i=1}^{G}\min\!\left(r_i(\theta)A_i,\operatorname{clip}\!\left(r_i(\theta),1-\epsilon,1+\epsilon\right)A_i\right)- \beta D_{\mathrm{KL}}\!\left(\pi_{\theta}\|\pi_{\text{ref}}\right)\right].

where

ri(θ)=πθ(oiq)πθold(oiq)r_i(\theta)=\frac{\pi_{\theta}(o_i\mid q)}{\pi_{\theta_{\text{old}}}(o_i\mid q)}.

ϵ\epsilon is the clipping coefficient, and β\beta controls KL regularization toward a reference policy πref\pi_{\text{ref}}.