OUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER (2017)

#Mixture-of-Experts

August 1, 2025

Increasing a model’s number of parameters enhances its capacity to learn complex patterns, but it also raises computational costs. OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER (MoE) introduces a method to scale model capacity efficiently by using a gating network that dynamically selects a sparse subset of feed-forward networks (called experts) for each input, while skipping computation for the others.

For a given input $x$, let $G(x)$ denote the output of the gating network, and $E_i(x)$ represent the output of the $i$-th expert. The MoE layer produces its output $y$ as a weighted sum of the selected experts’ outputs:

$$ y=\sum^n\_{i=1}G(x)\_iE\_i(x) $$

Here, $G(x)$ is a sparse vector of size $n$, where only $k$ elements are positive (less than 1), and the rest are zero. Since the unselected experts contribute nothing to $y$, only $k$ experts need to be evaluated, which significantly reduces computation.

The gating network computes $G(x)$ using two trainable parameters: $W_g$, which determines expert selection, and $W_{noise}$, which injects controlled noise to promote balanced usage of experts:

$$ \begin{align*} G(x)&=\text{Softmax}(\text{KeepTopK}(H(x), k))\\\\ H(x)\_i&=(x\cdot W\_g)\_i + \text{StandardNormal()}\cdot \text{Softplus}((x\cdot W\_{\text{noise}})\_i)\\\\ \text{KeepTopK}(v,k)\_i&= \begin{cases} v\_i&\text{if }v\_i \text{ is in the top }k\text{ elements of }v\\\\ -\infty&\text{otherwise} \end{cases} \end{align*} $$

An imbalanced selection of experts leads to inefficient computation and underutilization of model capacity. To encourage balanced expert usage, two auxiliary loss terms are introduced: $L_{\text{importance}}(X)$ and $L_{\text{load}}(X)$.

The importance loss encourages equal contribution from all experts. Given a batch $X$, the importance of each expert is the sum of its gate activations:

$$ \begin{align*} \text{Importance}(X)&=\sum\_{x\in X}G(x)\\\\ L\_{\text{importance}}(X)&=w\_{\text{importance}}\cdot \textit{CV}(\text{Importance}(X))^2 \end{align*} $$

where $\text{CV}$ denotes the coefficient of variation (standard deviation divided by mean), and $w_{\text{importance}}$ is a hand-tuned scaling factor.

The load loss addresses the imbalance in the number of examples assigned to each expert. Since the number of examples is discrete and non-differentiable, a smooth estimator is used. For each expert $i$, $P(x, i)$ represents the probability that $G(x)_i$ is non-zero, conditioned on the sampled noise:

$$ \begin{align*} P(x, i) &=\textit{Pr}\left((x\cdot W\_g)\_i + \text{StandardNormal}()\cdot \text{Softplus}((x\cdot W\_{\text{noise}})\_i) > \text{kth\\_excluding}(H(x), k, i)\right)\\\\ &=\Phi\left(\frac{(x\cdot W\_g)\_i - \text{kth\\_excluding}(H(x), k, i)}{\text{Softplus}((x\cdot W\_{\text{noise}})\_i)}\right)\\\\ \end{align*} $$

where $\Phi$ is the cumulative distribution function of the standard normal distribution, and $\text{kth\_excluding}(v, k, i)$ denotes the $k$-th largest element of $v$ excluding position $i$.

The load for each expert across a batch $X$ is then estimated as:

$$ \text{Load}(X)\_i = \sum\_{x\in X} P(x,i) $$

The load loss is defined as:

$$ L\_{\text{load}}(X)=w\_{\text{load}}\cdot\text{CV}(\text{Load}(X))^2 $$

where $w_{\text{load}}$ is another hand-tuned scaling factor.

By combining sparse gating with these load-balancing loss terms, the MoE layer achieves massive model capacity while keeping computation efficient. This technique enables training models with billions of parameters without a corresponding increase in computational cost.