Transformer Forecasting Models

Best for: Large-scale multivariate forecasting

How it works

$$\text{Attn}(Q,K,V)=\text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$

Forecasts with stacks of self-attention layers where each query attends over all past tokens via $\text{Attn}(Q,K,V)=\text{softmax}(QK^\top/\sqrt{d_k})V$, giving every past step a learned weight regardless of distance. Causal masking preserves the time order, while positional encodings inject time information since the attention operator is permutation-equivariant. Variants like Informer, Autoformer, and PatchTST add sparse attention, decomposition, or patching to scale to very long multivariate series.

Common fields

Retail · logistics · cloud metrics