Источник

ICLR Workshop

Дата публикации

24.04.2025

Авторы

Andrew Argatkiny Илья Макаров

Поделиться

MatMuls are Enough for Linear-Time Dense Attention

Аннотация

Transformers, despite empowering current AI revolution, are bottlenecked by suboptimalhardware utilization and quadratic runtime complexity of softmax attentionw.r.t. input sequence length. Many recent architectures aspire to bring thecomplexity down to sub-quadratic level without compromising modeling quality.However, they are either much slower on all but very long sequences orrely on low-level code tailored to a narrow subset of modern hardware. To simultaneouslyachieve linear complexity, hardware efficiency, and portability, wecompletely eliminate softmax from self-attention; remove, modify, or rearrangeother transformations in the Transformer block; and reduce number of attentionheads. The resulting architecture, DenseAttention Network, is composed entirelyof dense matrix multiplications in the attention which allows for efficient trainingand inference in both quadratic and linear modes. It performs similarly withstandard Transformer in language modeling and surpasses previous TransformerbasedSOTA by 5% on challenging Long Range Arena benchmarks. DenseAttentionmodel written in plain PyTorch is up to 57% faster even on small 512context size, and by orders of magnitude on longer sequences, than Transformeraugmented with low-level FlashAttention kernel.

MatMuls are Enough for Efficient and Performant Linear-Time Attention

Andrew Argatkiny, Илья Макаров

Читать источник

Optimizing state monitoring with domain degradation knowledge

Дмитрий Жевненко, Илья Макаров

Читать источник

SODAOpt: Socio-Demographic and Textual Adaptive Fusion for Optimizing Developer Task Assignment

Карина Романова, Сергей Сеничев, Лина Вельтман, Иван Насонов, Андрей Кузнецов, Илья Макаров

Читать источник

Poster Abstract: Minimizing Labeling Efforts for Fault Detection and Diagnosis

Мария Штарк, Александр Кожевников, Петр Иванов, Илья Макаров

Читать источник

Poster Abstract: Exploring the Autoencoder Sequence Pooling

Петр Иванов, Мария Штарк, Александр Кожевников, Илья Макаров

Читать источник

Poster Abstract: Autonomous AI-Driven Grid Protection: Sub-Cycle Fault Response via NPU-Optimized Neural Networks

Александр Коваленко, Алексей Евдаков, Галина Филатова, Андрей Яблоков, Александр Булашов, Илья Макаров

Читать источник

Enhancing Emotion Recognition in Speech based on Self-Supervised Learning: Cross-Attention Fusion of Acoustic and Semantic Features

Bashar M. Deeb, Андрей Савченко, Илья Макаров

Читать источник