The initialization step can be critical to the model's performance, and it requires the right method.
* Initializing the weights to zero leads the network to learn zero output which makes the network not learn anything.
* Initializing the weights to be too large causes the network to experience exploding gradients.
* Initializing the weights to be too small causes the network to experience vanishing gradients.
To find the perfect initialization, there are a few rules of thumb to follow :
* The mean of activations should be zero.
* The variance of activations should stay the same across every layer.
View More :
Deeplearning