Probablistic Latent Semantic Indexing (1999)

#Latent Semantic Analysis

December 2, 2023

Latent Semantic Analysis approximates an original term-document matrix by singular value decomposition (SVD). The components consist of the frequencies with which each term occurred in each document. The Latent Semantic Analysis thresholds all but the largest $K$ singular values to zero. The left singular vectors are a $t \times K$ matrix, and the transpose of the right singular vectors are a $K\times d$ matrix where $t$ is the number of the terms and $d$ is the number of the documents.

Probabilistic Latent Semantic Indexing (PLSI) is a generative model that associates each term and document with an unobserved class variable $z \in \mathcal{Z} = \{z_1,\dots , z_K\}$. The likelihoods of the documents and terms are estimated by the Expectation Maximization (EM) algorithm, and the joint probability model $\textbf{P}$ can be written as $\textbf{P}=\hat{\textbf{U}}\hat{\Sigma}\hat{\textbf{V}}^t$ where $\hat{\textbf{U}}=(P(d_i|z_k))_{i, k}$, $\hat{\textbf{V}}=(P(w_j|z_k))_{j, k}$, $\hat{\Sigma}=\text{diag} (P(z_k))_k$. Compared to the Latent Semantic Analysis, the components of $\textbf{P}$ have a clear probabilistic meaning in terms of mixture component distributions.

The EM algorithm seeks to find posterior probabilities $P(z|d, w)$ that maximize the log-likelihood function $\mathcal{L}$: $$ \mathcal{L}=\sum_{d\in \mathcal{D}}\sum_{w\in \mathcal{W}}n(d, w)\log P(d, w) $$ where $n(d, w)$ denotes the term frequency in $d$, and $P(d, w)=\sum_{z\in \mathcal{Z}}P(z)P(w|z)P(d|z)$.

E step computes posterior probabilities $P(z|d, w)$ for the latent variables. $$ P(z|d, w)\frac{P(z)P(d|z)P(w|z)}{\sum_{z’}P(z’)P(d|z’)P(w|z’)} $$

In the M step, parameters are updated for the given posterior probabilities computed in the previous E step. $$ \begin{align*} P(w|z)&=\frac{\sum_dn(d,w)P(z|d, w)}{\sum_{d, w’}n(d, w’)P(z|d, w’)’}\\ P(d|z)&=\frac{\sum_wn(d, w)P(z|d, w)}{\sum_{d’, w}n(d’, w)P(z|d’, w)’}\\ P(z)&=\frac{1}{R}\sum_{d, w}n(d, w)P(z|d, w), R\equiv\sum_{d, w}n(d, w) \end{align*} $$