Источник
ICML LCFM
Дата публикации
13.07.2025
Авторы
Andrew Argatkiny Илья Макаров
Поделиться

MatMuls are Enough for Efficient and Performant Linear-Time Attention

Аннотация

Transformers, despite empowering current AI revolution,are bottlenecked by suboptimal hardwareutilization and quadratic runtime complexity ofsoftmax attention w.r.t. input sequence length.Many recent architectures aspire to bring the complexitydown to sub-quadratic level without compromisingmodeling quality. However, they areeither much slower on all but very long sequencesor rely on low-level code tailored to a narrowsubset of modern hardware. To simultaneouslyachieve linear complexity, hardware efficiency,and portability, we completely eliminate softmaxfrom self-attention; remove, modify, or rearrangeother transformations in the Transformer block;and reduce number of attention heads. The resultingarchitecture, DenseAttention Network, iscomposed entirely of dense matrix multiplicationsin the attention which allows for efficient trainingand inference in both quadratic and linearmodes. It performs similarly with standard Transformerin language modeling and surpasses previousTransformer-based SOTA by 5% on challengingLong Range Arena benchmarks. DenseAttentionmodel written in plain PyTorch is up to 22%faster even on small context sizes, and by ordersof magnitude on longer sequences, than Transformerwith low-level FlashAttention kernel.

Присоединяйтесь к AIRI в соцсетях