Deep Neural Networks for Youtube Recommendations (2016)

September 23, 2025

Deep Neural Networks for YouTube Recommendations, published in 2016, describes YouTube’s large-scale recommendation system.
The system consists of two feed-forward neural networks: a candidate generation model and a ranking model, together containing approximately one billion parameters.

The candidate generation network suggests hundreds of videos using collaborative filtering. A user’s representation includes coarse features such as the IDs of watched videos and tokens from search queries. The ranking network predicts the expected watch time of the videos suggested by the candidate generation network.

The candidate generation model learns embeddings of users and videos through a process similar to matrix factorization. The model frames recommendation as an extreme multiclass classification problem: it classifies which video a user is expected to watch. Because the number of possible videos is too large to compute a full softmax, the output layer uses sampled softmax as its loss approximation.

After training, video embeddings are stored in a nearest-neighbor index. At serving time, the system performs nearest-neighbor search to generate candidate recommendations.

Features for the candidate generation model include the user’s watch history. Since watch history is sequential, the sparse ID vectors of the sequence are averaged to produce a fixed-size input vector. Another feature is the age of each training example, which allows the model to prioritize newer videos over older ones and correct bias toward past data.

Sébastien Jean et al. proposed sampled softmax to efficiently train neural networks with very large output vocabularies. TensorFlow implements this method as tf.nn.sampled_softmax_loss, also explained in What is Candidate Sampling.

In sampled softmax, the task is to predict the correct class from a subset of all possible classes. Let the set of all classes be $ L $, the sampling function be $ Q(y|x) $, and the sampled subset of $ L $ at the $ i $-th iteration be $ S_i $. For an input $ x_i $:

$$ P(S_i = S \mid x_i) = \prod_{y \in S} Q(y|x_i) \prod_{y \in (L - S)} (1 - Q(y|x_i)). $$

Let the correct class of $ x_i $ be $ t_i $. The training procedure updates model parameters to maximize the probability of $ t_i $ within the sampled subset:

$$ C_i = S_i \cup \{t_i\}. $$

Applying Bayes’s rule step by step:

$$ \begin{align*} P(t_i = y \mid x_i, C_i) &= \frac{P(t_i = y, C_i \mid x_i)}{P(C_i \mid x_i)} \\ &= \frac{P(t_i = y \mid x_i) P(C_i \mid t_i = y, x_i)}{P(C_i \mid x_i)} \\ &= \frac{P(y \mid x_i)}{P(C_i \mid x_i)} P(C_i \mid t_i = y, x_i). \end{align*} $$

Given a constant $ K(x_i, C_i) $ independent of $ y $:

$$ P(C_i \mid t_i = y, x_i) = \frac{1}{Q(y|x_i)} \prod_{y’ \in C_i} Q(y’|x_i) \prod_{y’ \in (L - C_i)} (1 - Q(y’|x_i)). $$

Substituting and simplifying:

$$ P(t_i = y \mid x_i, C_i) = \frac{P(y|x_i)}{Q(y|x_i)} \cdot \frac{1}{K(x_i, C_i)}. $$

Taking logs gives:

$$ \log P(t_i = y \mid x_i, C_i) = \log P(y|x_i) - \log Q(y|x_i) + K’(x_i, C_i). $$

Since $ K’(x_i, C_i) $ is independent of $ y $, it is ignored during training. Thus, the model effectively learns to predict

$$ \log P(y|x_i) - \log Q(y|x_i). $$

The weights of the output layer correspond to video embeddings, and the inputs are user embeddings. These are later stored in the nearest-neighbor index for serving.

The ranking model predicts the watch time of videos using weighted logistic regression. During training, positive examples (clicked videos) are weighted by their actual watch times. Let $ N $ be the total number of examples, $ k $ the number of positive examples, and $ T_i $ the watch time of the $ i $-th positive example. The model’s odds are approximately:

$$ \frac{\sum T_i}{N - k} $$

Assuming $ \frac{1}{1 - x} \approx 1 + x $, and letting $ P $ denote the probability of an impression, the ranking model predicts $ E[T](1 + P) $. Since $ P $ is small in YouTube’s case, this is effectively $ E[T] $, the expected watch time.