Few-bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction
Abstract
Memory footprint is one of the main limiting fac- tors for large neural network training. In back- propagation, one needs to store the input to each operation in the computational graph. Every mod- ern neural network model has quite a few point- wise nonlinearities in its architecture, and such operations induce additional memory costs that, as we show, can be significantly reduced by quan- tization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear func- tions with only a few bits per each element. We show that such approximation can be achieved by computing an optimal piecewise-constant ap- proximation of the derivative of the activation function, which can be done by dynamic program- ming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.
Similar publications
partnership