Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes
Abstract
A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each of the regimes possesses its own unique features and has strong parallels with previous research on both general and specific scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in the conventional training of normalized networks.
Similar publications
partnership