Bandit Algorithms

Best for: Online recommendation optimization

How it works

$$a_t=\arg\max_a\left[\hat{\mu}_a+\sqrt{\tfrac{2\ln t}{N_t(a)}}\right]$$

Frames recommendation as sequential decision-making that must balance exploiting the best-seen arm against exploring others, minimising cumulative regret over rounds. UCB chooses $a_t=\arg\max_a[\hat{\mu}_a+c\sqrt{\ln t/N_t(a)}]$, where the exploration bonus shrinks as the arm is sampled; Thompson sampling instead draws $\theta_a\sim\mathrm{Beta}(\alpha_a,\beta_a)$ per arm and plays $\arg\max_a\theta_a$, updating parameters from observed rewards. Both achieve $O(\log T)$ regret on suitable bandits and power online ad and content personalisation.

Common fields

Ads · personalization · experimentation