Role of normalization
By transforming the input of the hidden layer to a distribution with mean value of 0 and variance of 1, it is ensured that the input distribution of each layer will not cause excessive jitter due to the different distribution of different mini batches. It avoids that the network parameters need to adapt to different distributed inputs, and it can avoid the input of the activation function from being distributed to both ends of the inactive region.
 Accelerate network convergence and improve training speed
 Avoid over fitting. (BN associates the samples in a batch, and the model will not separately adapt the results of a sample, so as to avoid over fitting.)
Batch Normalization
 The parameter calculation of BN takes a mini batch as a group, and the output of each layer is $z = R ^ {batch \ _size, hidden \ _size} $for each neuron, i.e. each output is $Z, before entering the activation function_ I = R ^ {batch \ _size, 1} $for statistical normalization. The statistical indicator is the mean and variance of one dimension of the output of each batch, corresponding to two hidden_ Vector of dimension:
$$\mu_{ Z_i} = \frac{1}{batch\_size} \sum\limits^{batch\_size}_{j=1} Z_{i,j} $$$$\sigma_{Z_i}=\sqrt{\frac{1}{batch\_size}\sum\limits_{j=1}^{batch\_size}(\mu_{Z_i} – Z_i)^2} $$
 BN update
After BN obtains two parameter vectors, it will transform the output $Z $of the layer, that is, update the output of the layer to a reasonable distribution. The update is divided into two steps: 1. Transform to a distribution with mean value of 0 and variance of 1 (for each dimension). 2. Do another translation and scaling operation. This step is obtained through a linear transformation, and the translation parameters can be learned. Activate the function after twostep update.
$$ Z’_i = \frac {Z_i – \mu_{Z_i}}{\sigma_{Z_i}} $$$$ Z”_i = gZ’_i+b $$

BN features:
 All operations are performed in one dimension, and the calculation of parameters depends on batch_ size
 They tend to think that the information of the same dimension of the characteristics of the same location should have the same distribution, or they can think that the characteristics of the same location have the same distribution.
 Because parameters and batch_ Size related, batch_ Too small size will lead to inconsistent distribution and overall. Therefore, the batch of train_ The size should not be too small, and should be scattered as much as possible to ensure that the distribution is close to the whole.
 There will be different strategies in the infer phase and the train phase. There is often only one sample in the infer stage, and it is obviously infeasible to calculate the parameters. Therefore, the $\ Mu and \ sigma $parameters are generally updated iteratively in the train stage to estimate the parameters of all training samples and save them as the parameters used in the infer stage.
Layer Normalization
 The calculation of LN is normalized at the single sample level. Output for each layer $z = R ^ {batch \ _size, hidden \ _size} $for each sample $Z before activation_ J = R ^ {1, hidden_size} $calculates the parameters $\ Mu and \ sigma $. The calculation method is similar to BN, but in different directions. Final batch_ Size samples get batch_ Size group parameters.
 Ln update, update each dimension of each sample one by one according to the parameters of each sample. Then do the same zoom and pan. Translation and scaling parameters can be learned.

Ln features:
 All operations are calculated on a single sample, so they do not depend on batch_ Size, so there is no need to store $\ Mu and \ sigma $parameters to the infer stage. Train and infer are consistent.
 They tend to think that all features within a sample are similar and can be normed directly in all dimensions within the sample. This assumption is obvious in NLP tasks. Each feature of a sentence is obtained after processing each word, which is basically similar. However, if the input features are concatenated with multiple different features such as age, gender and height, and then directly calculate the norm within the sample, there will be great problems due to the different distribution of each feature.
contrast

BN is not suitable for variable length tasks in NLP tasks.
 For samples of variable length, the effective samples in the back position will be less, which is equivalent to a disguised batch_ Size is getting smaller.
 In RNN type cyclic networks, the longer the samples, the fewer effective samples, and the worse the BN effect.
 In the infer phase, if a very long sample is not seen in the train phase, the corresponding parameters will not be found.

Ln is more suitable for NLP tasks
 There is no problem with variable length samples.
 Each feature of a sample is generally similar. For example, the feature of a sentence is each word. After concat, the distribution of each dimension is also consistent.
 In the implementation of TF, pytoch, etc., LN generally makes norm parameters for the innermost vector. An output such as [batch_size, seq_len, hidden_size] is for the innermost hidden layer in LN_ The parameter calculation of size neurons will not conflict with the padding feature. BN is equivalent to doing BN on a [batch_size * seq_len, hidden_size] matrix. At this time, the padded samples will be included in the calculation of mean variance.
 Both LN and BN need twostep transformation. The first step is to obtain a distribution of $\ Mu = 0 and \ sigma = 1 $. The purpose of this step is to transform the output of neurons to a more balanced distribution and achieve the purpose of parameter stability. The second part zooms and translates through two learnable system parameters ($\ gamma and \ beta $) in order to restore the expression ability of the original data.