Source
ICML
DATE OF PUBLICATION
07/29/2023
Authors
Ivan Oseledets Denis Dimitrov Alex Shonenkov Georgii Novikov Daniel Bershatsky Julia Gusak
Share

Few-bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction

Abstract

Memory footprint is one of the main limiting fac- tors for large neural network training. In back- propagation, one needs to store the input to each operation in the computational graph. Every mod- ern neural network model has quite a few point- wise nonlinearities in its architecture, and such operations induce additional memory costs that, as we show, can be significantly reduced by quan- tization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear func- tions with only a few bits per each element. We show that such approximation can be achieved by computing an optimal piecewise-constant ap- proximation of the derivative of the activation function, which can be done by dynamic program- ming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.

Join AIRI