Few-bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction

Source

ICML

DATE OF PUBLICATION

07/29/2023

Authors

Ivan Oseledets

Denis Dimitrov

Alex Shonenkov Georgii Novikov Daniel Bershatsky Julia Gusak

Abstract

Memory footprint is one of the main limiting fac- tors for large neural network training. In back- propagation, one needs to store the input to each operation in the computational graph. Every mod- ern neural network model has quite a few point- wise nonlinearities in its architecture, and such operations induce additional memory costs that, as we show, can be significantly reduced by quan- tization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear func- tions with only a few bits per each element. We show that such approximation can be achieved by computing an optimal piecewise-constant ap- proximation of the derivative of the activation function, which can be done by dynamic program- ming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.

Full text