Wide and Deep Learning for Recommender Systems (2016)
December 16, 2023Generalized linear models are widely used for large-scale regression and classification problems with sparse inputs because they are simple, scale and interpretable. One limitation of interactions or cross-prodcution transformations in generalized linear models is that they do not generalize to query-item feature pairs that have not appeared in the training data. Compared with generalized linear models, deep neural networks can improve the diversity of the recommendations. Howerver it is difficult to learn effective low-dimentional dense embedding vectors. Wide & Deep Learning for Recommender Systems jointly trains a generalized linear model and a feed-forward neural network (FFN) to combines their benefits.
A generalized linear model is in the form \(y=\textbf{w}^T\textbf{x}+b\) where \(y\) is the prediction, \(\textbf{x}=[x_1,x_2,\dots,x_d]\) is a vector of \(d\) features, \(\textbf{w}\) are the parameters, and \(b\) is the bias.
The \(k\)-th interaction or cross-product transformation can be represented as: $$ \phi_k(\textbf{x})=\prod^d_{i=1}x^{c_{ki}}_i\ c_{ki}\in\{0, 1\} $$ where \(c_{ki}=1\) if the \(i\)-th feature is a part of the \(k\)-th intaction, and 0 otherwise.
In an FFN, the first layer, often referred to as an embedding layer, converts the high-dimensional features into low-dimensional dense real-valued vectors. The embedding vectors are then fed into the hidden layers in the forward pass: $$ a^{(l+l)}=f(W^{(l)}a^{(l)}+b^{(l)}) $$ where \(l\) is the layer number and \(f\) is the activation function.
The proposed model combines a generalized linear model and an FFN by using a weighted sum of their output log odds that are fed into a logistic loss function: $$ P(Y=1|\textbf{x})=\sigma(\textbf{w}^T_{\textit{wide}}[\textbf{x}, \phi(\textbf{x})]+\textbf{w}^T_{\textit{deep}}a^{(l_f)}+b) $$