Recurrent Neural Networks / LSTM / GRU

LSTM
GRU
Seq2Seq

Best for: Sequential data, older NLP/time-series systems Aliases: LSTM, GRU, Seq2Seq

How it works

$$h_t=\tanh(W_h h_{t-1}+W_x x_t+b)$$

A vanilla RNN folds the sequence into a hidden state updated at each step, $h_t=\tanh(W_h h_{t-1}+W_x x_t+b)$, so $h_t$ summarises all past inputs and is read out by a decoder. This recurrence suffers from vanishing/exploding gradients on long sequences, which LSTMs and GRUs fix with learned gates: an LSTM keeps a cell state $c_t=f_t\odot c_{t-1}+i_t\odot \tilde c_t$ with $\tilde c_t=\tanh(W_c h_{t-1}+U_c x_t+b_c)$ and forget/input/output gates $f_t,i_t,o_t\in(0,1)$, so gradients flow through the additive cell path. All variants train by backpropagation-through-time.

When to use

Low-latency sequential modeling where memory/state is needed and a Transformer isn’t justified.

Watch out

Vanishing/exploding gradients on long sequences; sequential computation limits parallelism; largely superseded by Transformers.

Common fields

Speech · time series · sensor data · finance