Memory Efficient LM Compression using Fisher Information from Low-Rank Representations
Abstract
Although modern language models (LMs) demonstrate excellentperformance in diverse text processing tasks, the substantial GPUmemory required to load and infer these models can be prohibitive tousers. To compress and accelerate LMs, various techniques, such as quantization,distillation, pruning, and low-rank factorization, are used. Inthis work, we focus on improving a method from the latter category,namely, a recent technique Fisher-Weighted Singular Value Decomposition(FWSVD). Despite its efficiency, FWSVD requires fine-tuning ofthe whole model on a downstream task. We introduce a simple, yetpowerful, modification of FWSVD that enables compression of modelspreviously unavailable with the original approach. By combining LoRAwith FWSVD we demonstrate that low-rank-based compression can beachieved without storing the full gradients, sometimes even outperformingthe original full fine-tuning. We evaluate our proposed approach onvarious NLP tasks, including NLU, NER, text summarization, and QA,showing its effectiveness compared to strong baselines.
Similar publications
partnership