08 Chapter

Reinforcement Learning

Learn decision-making policies by maximizing reward.

Reinforcement learning trains agents to make sequences of decisions by interacting with an environment and maximizing cumulative reward. The families below range from value-based and policy-gradient methods to bandits for the explore-vs-exploit tradeoff.

  • Use bandits for online decision optimization.
  • Use PPO for many practical RL tasks, and SAC/TD3 for continuous control.
#AlgorithmBest forCommon fields
1Q-Learning / Deep Q-Networks Discrete action problems
  • Games
  • robotics simulation
  • optimization
2Policy Gradient Methods Direct policy optimization
  • Robotics
  • control
  • games
3Actor-Critic Methods Stable deep RL
  • Robotics
  • autonomous systems
4PPO: Proximal Policy Optimization Practical deep RL baseline
  • Games
  • robotics
  • RLHF-style optimization
5A3C / A2C Parallel RL training
  • Simulation
  • games
6DDPG / TD3 / SAC Continuous control
  • Robotics
  • autonomous driving
  • industrial control
7Multi-Armed Bandits Exploration vs exploitation
  • Ads
  • recommendations
  • pricing
  • clinical trials