Layer Normalization (2016)
June 17, 2023Layer Normalization, introduced in the 2016 paper Layer Normalization, is a technique used to normalize the summed inputs to the neurons in a layer by using the mean and standard deviation of those inputs within that layer.
Batch Normalization uses the mean and the standard deviation of the summed inputs of a neuron in a mini-batch to normalize the summed inputs to it. While Batch Normalization can accelerate training, it depends on mini-batch size and is challenging to implement in RNNs.
Let the bottom-up inputs in the \(l\) th hidden layer of a deep feed-forward neural network be \(h^l\), with \(w_i^l\) representing the incoming weights to the \(i\) th neuron, and \(b^l_i\) as the scalar bias parameter. The weighted sum of inputs \(a_i\) is computed as: $$ a^l_i = {w^l_i}^{\top}h^l\ \ \ \ h^{l+1}_i = f(a^l_i + b^l_i) $$
In Layer Normalization, the neurons within the same layer share the same mean and standard deviation. Let \(H\) be the number of the neurons in layer \(l\). The mean \(\mu^l\) and standard deviation \(\sigma^l\) are given as follows. $$ \mu^l=\frac{1}{H}\sum^H_{i=1}a^l_i\ \ \ \ \sigma^l=\sqrt{\frac{1}{H}\sum^H_{i=1}(a^l_i-\mu^l)^2} $$
Using a scaling hyperparameter \(g^l_i\), Layer Normalization normalizes the summed inputs as follows: $$ \bar{a}^l_i=\frac{g^l_i}{\sigma^l_i}(a^l_i-\mu^l_i) $$
In RNNs, consider \({\rm x}^t\) as the input at time \(t\), \({\rm h}^{t-1}\) as the vector of previous hidden states, \(W_{hh}\) as recurrent hidden-to-hidden weights, and \(W_{xh}\) as bottom-up input-to-hidden weights. The summed inputs at time \(t\) are represented as: $$ \boldsymbol{\rm a}^t=W_{hh}h^{t-1}+W_{xh}{\rm x}^t $$
Finally, the normalized hidden state \({\rm h}^t\) is computed as follows: $$ \boldsymbol{h}^t = f\left[\frac{\boldsymbol{g}}{\sigma^t}\odot (\boldsymbol{a}^t-\mu^t )+\boldsymbol{b}\right] \ \ \ \ \mu^t=\frac{1}{H}\sum^H_{i=1}a^t_i\ \ \ \ \sigma^t=\sqrt{\frac{1}{H}\sum^H_{i=1}(a^t_i-\mu^t)^2} $$