GRPO Algorithm

GRPO is an efficient RL algorithm that removes the need for a separate critic model.

Core idea

For each question $q$ , GRPO samples a group of $G$ outputs $\{o_i\}_{i=1}^{G}$ from the current policy $\pi_{\theta_{\text{old}}}$ .

Each output is scored by reward functions, producing rewards $\{r_i\}_{i=1}^{G}$ .

The advantage for each sample is computed by normalizing rewards within the group:

$A_i = \frac{r_i - \operatorname{mean}\left(\{r_j\}_{j=1}^{G}\right)}{\operatorname{std}\left(\{r_j\}_{j=1}^{G}\right)}$ .

This relative normalization is the key: each sample is judged against peers from the same prompt, not by an external value baseline.

Objective

The policy is updated with:

$J_{\text{GRPO}}(\theta)=\mathbb{E}_{q\sim D,\{o_i\}_{i=1}^{G}\sim\pi_{\theta_{\text{old}}}(\cdot\mid q)}\left[\frac{1}{G}\sum_{i=1}^{G}\min\!\left(r_i(\theta)A_i,\operatorname{clip}\!\left(r_i(\theta),1-\epsilon,1+\epsilon\right)A_i\right)- \beta D_{\mathrm{KL}}\!\left(\pi_{\theta}\|\pi_{\text{ref}}\right)\right]$ .

where

$r_i(\theta)=\frac{\pi_{\theta}(o_i\mid q)}{\pi_{\theta_{\text{old}}}(o_i\mid q)}$ .

$\epsilon$ is the clipping coefficient, and $\beta$ controls KL regularization toward a reference policy $\pi_{\text{ref}}$ .