OUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER (2017)
August 1, 2025Increasing a model’s number of parameters enhances its capacity to learn complex patterns, but it also raises computational costs. OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER (MoE) introduces a method to scale model capacity efficiently by using a gating network that dynamically selects a sparse subset of feed-forward networks (called experts) for each input, while skipping computation for the others.
For a given input \(x\), let \(G(x)\) denote the output of the gating network, and \(E_i(x)\) represent the output of the \(i\)-th expert. The MoE layer produces its output (y) as a weighted sum of the selected experts’ outputs: $$ y=\sum^n_{i=1}G(x)_iE_i(x) $$ Here, \(G(x)\) is a sparse vector of size \(n\), where only \(k\) elements are positive (less than 1), and the rest are zero. Since the unselected experts contribute nothing to \(y\), only \(k\) experts need to be evaluated, which significantly reduces computation.
The gating network computes \(G(x)\) using two trainable parameters: \(W_g\), which determines expert selection, and \(W_{noise}\), which injects controlled noise to promote balanced usage of experts: $$ \begin{align*} G(x)&=\text{Softmax}(\text{KeepTopK}(H(x), k))\\ H(x)_i&=(x\cdot W_g)_i + \text{StandardNormal()}\cdot \text{Softplus}((x\cdot W_{\text{noise}})_i)\\ \text{KeepTopK}(v,k)_i&= \begin{cases} v_i&\text{if }v_i \text{ is in the top }k\text{ elements of }v\\ -\infty&\text{otherwise} \end{cases} \end{align*} $$
An imbalanced selection of experts leads to inefficient computation and underutilization of model capacity. To encourage balanced expert usage, two auxiliary loss terms are introduced: \(L_{\text{importance}}(X)\) and \(L_{\text{load}}(X)\).
The importance loss encourages equal contribution from all experts. Given a batch \(X\), the importance of each expert is the sum of its gate activations: $$ \begin{align*} \text{Importance}(X)&=\sum_{x\in X}G(x)\\ L_{\text{importance}}(X)&=w_{\text{importance}}\cdot \textit{CV}(\text{Importance}(X))^2 \end{align*} $$ where \(\text{CV}\) denotes the coefficient of variation (standard deviation divided by mean), and \(w_{\text{importance}}\) is a hand-tuned scaling factor.
The load loss addresses the imbalance in the number of examples assigned to each expert. Since the number of examples is discrete and non-differentiable, a smooth estimator is used. For each expert \(i\), \(P(x, i)\) represents the probability that \(G(x)_i\) is non-zero, conditioned on the sampled noise: $$ \begin{align*} P(x, i) &=\textit{Pr}\left((x\cdot W_g)_i + \text{StandardNormal}()\cdot \text{Softplus}((x\cdot W_{\text{noise}})_i) > \text{kth\_excluding}(H(x), k, i)\right)\\ &=\Phi\left(\frac{(x\cdot W_g)_i - \text{kth\_excluding}(H(x), k, i)}{\text{Softplus}((x\cdot W_{\text{noise}})_i)}\right)\\ \end{align*} $$ where \(\Phi\) is the cumulative distribution function of the standard normal distribution, and \(\text{kth\_excluding}(v, k, i)\) denotes the \(k\)-th largest element of \(v\) excluding position \(i\).
The load for each expert across a batch \(X\) is then estimated as: $$ \text{Load}(X)_i = \sum_{x\in X} P(x,i) $$
The load loss is defined as: $$ L_{\text{load}}(X)=w_{\text{load}}\cdot\text{CV}(\text{Load}(X))^2 $$ where \(w_{\text{load}}\) is another hand-tuned scaling factor.
By combining sparse gating with these load-balancing loss terms, the MoE layer achieves massive model capacity while keeping computation efficient. This technique enables training models with billions of parameters without a corresponding increase in computational cost.