Vision Transformers
- ViT
- Swin
- CLIP
Best for: Modern image classification and multimodal models Aliases: ViT, Swin, CLIP
How it works
$$\text{softmax}(QK^\top/\sqrt{d_k})\,V$$Splits the image into fixed patches, linearly embeds each patch, prepends a [CLS] token, and adds positional embeddings. Patches then flow through a transformer encoder where scaled dot-product self-attention $\text{softmax}(QK^\top/\sqrt{d_k})\,V$ lets every patch attend to every other patch, mixing global information per layer. The [CLS] output is classified; CLIP trains the same backbone contrastively against text. Lacking locality bias, ViTs need large-scale pretraining to outperform CNNs.
When to use
Large-scale image classification and multimodal vision-language tasks with abundant data and compute.
Watch out
Weak inductive bias — underperforms CNNs without large pretraining; patch size and data hunger are key.
Common fields
Research · large-scale vision · document AI