TF-IDF + Linear Models

  • Bag-of-Words SVM
  • TF-IDF Logistic

Best for: Fast classical NLP baseline Aliases: Bag-of-Words SVM, TF-IDF Logistic

How it works

$$\text{tfidf}(t,d)=\text{tf}(t,d)\cdot\log\frac{N}{\text{df}(t)}$$

Represents each document as a sparse bag-of-words vector weighted by TF-IDF, $\text{tfidf}(t,d)=\text{tf}(t,d)\cdot\log\frac{N}{\text{df}(t)}$, which up-weights terms frequent in a document but rare across the corpus of size $N$. A linear classifier — Logistic Regression $\sigma(\beta^\top x)$ or a linear SVM — is then trained on these vectors, giving $O(d)$ prediction and interpretable per-term coefficients. Despite ignoring word order it remains a strong, cheap, explainable baseline for text classification.

When to use

Cheap, fast, explainable text classification when data or compute is limited.

Watch out

Ignores word order and context; loses to transformers on semantic tasks; OOV handling matters.

Common fields

Spam · legal search · support-ticket routing