Recurrent memory augmentation of GENA-LM improves performance on long DNA sequence tasks

DNA language models, long sequence processing, recurrent models, computational genomics

Abstract

Utilizing DNA language models based on the transformer architecture represents a significant advancement in the field of computational genomics. However, these models face a critical challenge due to their inherent limitations in handling input lengths comparable to those of individual vertebrate genes (ranging from 104 to 105 nucleotides) and complete genomes (typically around 109 nucleotides). Currently, the architecture with the longest sequence input among publicly available transformer-based DNA language models, GENA-LM, is constrained to a maximum input length of merely 3⋅104 nucleotides. In this study, we investigate the efficacy of the Recurrent Memory Transformer (RMT) in enhancing GENA-LM for multiple genomic analysis tasks that require processing long DNA sequence inputs. Our results demonstrate that augmenting GENA-LMs with RMT leads to a substantial enhancement in performance, particularly in tasks such as species classification and prediction of epigenetic features. This underscores the significance of the recurrent memory approach in advancing the field of computational genomics and its potential for addressing critical challenges associated with processing long sequence inputs.

Full text DOWNLOAD pdf