Source
ICLR / Workshop
DATE OF PUBLICATION
03/14/2024
Authors
Mikhail Burtsev Yuri Kuratov Olga Kardymon Veniamin Fishman Aleksei Shmelev
Share

Recurrent memory augmentation of GENA-LM improves performance on long DNA sequence tasks

Abstract

Utilizing DNA language models based on the transformer architecture represents a significant advancement in the field of computational genomics. However, these models face a critical challenge due to their inherent limitations in handling input lengths comparable to those of individual vertebrate genes (ranging from 104  to 105 nucleotides) and complete genomes (typically around 109  nucleotides). Currently, the architecture with the longest sequence input among publicly available transformer-based DNA language models, GENA-LM, is constrained to a maximum input length of merely 3⋅104 nucleotides. In this study, we investigate the efficacy of the Recurrent Memory Transformer (RMT) in enhancing GENA-LM for multiple genomic analysis tasks that require processing long DNA sequence inputs. Our results demonstrate that augmenting GENA-LMs with RMT leads to a substantial enhancement in performance, particularly in tasks such as species classification and prediction of epigenetic features. This underscores the significance of the recurrent memory approach in advancing the field of computational genomics and its potential for addressing critical challenges associated with processing long sequence inputs.




Join AIRI