Policy Gradient Methods

  • REINFORCE
  • Vanilla PG

Best for: Direct policy optimization Aliases: REINFORCE, Vanilla PG

How it works

$$\nabla_\theta J(\theta)=\mathbb{E}\bigl[\nabla_\theta\log\pi_\theta(a\mid s)\,G_t\bigr]$$

Directly parameterises the policy $\pi_\theta(a\mid s)$ and raises expected return $J(\theta)=\mathbb{E}\bigl[\sum_t\gamma^t r_t\bigr]$ by stochastic gradient ascent. The score-function (likelihood-ratio) trick gives the REINFORCE gradient $\nabla_\theta J=\mathbb{E}\bigl[\nabla_\theta\log\pi_\theta(a\mid s)\,G_t\bigr]$, where $G_t=\sum_{k\ge 0}\gamma^k r_{t+k}$ is the Monte Carlo return. Subtracting a state-dependent baseline $b(s)$ — typically the value estimate $V_w(s)$ — yields $\nabla_\theta\log\pi_\theta(a\mid s)(G_t-b(s))$, which is unbiased and dramatically lowers variance.

When to use

When actions are continuous or stochastic and you want to directly optimize expected return.

Watch out

High variance; needs baselines and many rollouts; slow and unstable without variance reduction.

Common fields

Robotics · control · games