Source
NLDB
DATE OF PUBLICATION
07/01/2025
Authors
Daniil Moskovskiy Sergey Pletenev Sergey Zagoruyko Alexander Panchenko
Share

Memory Efficient LM Compression using Fisher Information from Low-Rank Representations

Abstract

Although modern language models (LMs) demonstrate excellentperformance in diverse text processing tasks, the substantial GPUmemory required to load and infer these models can be prohibitive tousers. To compress and accelerate LMs, various techniques, such as quantization,distillation, pruning, and low-rank factorization, are used. Inthis work, we focus on improving a method from the latter category,namely, a recent technique Fisher-Weighted Singular Value Decomposition(FWSVD). Despite its efficiency, FWSVD requires fine-tuning ofthe whole model on a downstream task. We introduce a simple, yetpowerful, modification of FWSVD that enables compression of modelspreviously unavailable with the original approach. By combining LoRAwith FWSVD we demonstrate that low-rank-based compression can beachieved without storing the full gradients, sometimes even outperformingthe original full fine-tuning. We evaluate our proposed approach onvarious NLP tasks, including NLU, NER, text summarization, and QA,showing its effectiveness compared to strong baselines.

Join AIRI