Transformer Forecasting Models
Best for: Large-scale multivariate forecasting
How it works
$$\text{Attn}(Q,K,V)=\text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$Forecasts with stacks of self-attention layers where each query attends over all past tokens via $\text{Attn}(Q,K,V)=\text{softmax}(QK^\top/\sqrt{d_k})V$, giving every past step a learned weight regardless of distance. Causal masking preserves the time order, while positional encodings inject time information since the attention operator is permutation-equivariant. Variants like Informer, Autoformer, and PatchTST add sparse attention, decomposition, or patching to scale to very long multivariate series.
Common fields
Retail · logistics · cloud metrics