Actor-Critic Methods
- A2C
- A3C
- SAC
- TD3
Best for: Stable deep RL Aliases: A2C, A3C, SAC, TD3
How it works
$$A_t^{\text{GAE}}=\sum_{l=0}^{\infty}(\gamma\lambda)^l\delta_{t+l},\quad \delta_t=r_t+\gamma V_w(s_{t+1})-V_w(s_t)$$Combines a policy network (the actor $\pi_\theta$) with a value network (the critic $V_w$); the critic turns the noisy return $G_t$ into a low-variance advantage $A_t=r_t+\gamma V_w(s_{t+1})-V_w(s_t)$, the one-step TD error $\delta_t$. The actor follows $\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,A_t$ while the critic regresses toward $r_t+\gamma V_w(s_{t+1})$. Generalised Advantage Estimation (GAE) blends Monte Carlo and TD($\lambda$) via $A_t^{\text{GAE}}=\sum_l(\gamma\lambda)^l\delta_{t+l}$; SAC adds an entropy bonus $-\alpha\log\pi_\theta$ to encourage exploration.
When to use
Modern deep RL where you want stable learning combining value and policy networks.
Watch out
Twin-critic and entropy tuning matter (SAC); sensitive to hyperparameters and reward shaping; sample efficiency varies.
Common fields
Robotics · autonomous systems