Probablistic Latent Semantic Indexing (1999)
December 2, 2023Latent Semantic Analysis approximates an original term-document matrix by singular value decomposition (SVD). The components consist of the frequencies with which each term occurred in each document. The Latent Semantic Analysis thresholds all but the largest \(K\) singular values to zero. The left singular vectors are a \(t \times K\) matrix, and the transpose of the right singular vectors are a \(K\times d\) matrix where \(t\) is the number of the terms and \(d\) is the number of the documents.
Probabilistic Latent Semantic Indexing (PLSI) is a generative model that associates each term and document with an unobserved class variable \(z \in \mathcal{Z} = \{z_1,\dots , z_K\}\). The likelihoods of the documents and terms are estimated by the Expectation Maximization (EM) algorithm, and the joint probability model \(\textbf{P}\) can be written as \(\textbf{P}=\hat{\textbf{U}}\hat{\Sigma}\hat{\textbf{V}}^t\) where \(\hat{\textbf{U}}=(P(d_i|z_k))_{i, k}\), \(\hat{\textbf{V}}=(P(w_j|z_k))_{j, k}\), \(\hat{\Sigma}=\text{diag} (P(z_k))_k\). Compared to the Latent Semantic Analysis, the components of \(\textbf{P}\) have a clear probabilistic meaning in terms of mixture component distributions.
The EM algorithm seeks to find posterior probabilities \(P(z|d, w)\) that maximize the log-likelihood function \(\mathcal{L}\): $$ \mathcal{L}=\sum_{d\in \mathcal{D}}\sum_{w\in \mathcal{W}}n(d, w)\log P(d, w) $$ where \(n(d, w)\) denotes the term frequency in \(d\), and \(P(d, w)=\sum_{z\in \mathcal{Z}}P(z)P(w|z)P(d|z)\).
E step computes posterior probabilities \(P(z|d, w)\) for the latent variables. $$ P(z|d, w)\frac{P(z)P(d|z)P(w|z)}{\sum_{z’}P(z’)P(d|z’)P(w|z’)} $$
In the M step, parameters are updated for the given posterior probabilities computed in the previous E step. $$ \begin{align*} P(w|z)&=\frac{\sum_dn(d,w)P(z|d, w)}{\sum_{d, w’}n(d, w’)P(z|d, w’)’}\\ P(d|z)&=\frac{\sum_wn(d, w)P(z|d, w)}{\sum_{d’, w}n(d’, w)P(z|d’, w)’}\\ P(z)&=\frac{1}{R}\sum_{d, w}n(d, w)P(z|d, w), R\equiv\sum_{d, w}n(d, w) \end{align*} $$