Source
ICLR / MLGenX
DATE OF PUBLICATION
04/24/2025
Authors
Aleksei Shmelev Artem Shadskiy Yuri Kuratov Mikhail Burtsev Olga Kardymon Veniamin Fishman
Share

GENATATOR: de novo Gene Annotation With DNA Language Model

Abstract

Inference of gene structure and location based on genome sequences, also known as \textit{de novo} gene annotation, is a critical first step in biological research. However, rules of encoding gene structure in the DNA sequence are complex and poorly understood, often necessitating the use of costly transcriptomic data to achieve accurate gene annotation. Here, we present GENATATOR --- Genome Annotator Using the GENA DNA Language Model --- an advanced machine learning tool for inferring gene annotations directly from DNA sequences. Unlike previous approaches that rely on explicitly defined gene segmentation rules derived from protein-coding sequences, GENATATOR learns how to infer gene structure directly from the data. This enables GENATATOR to perform correct segmentation for previously untraceable class of non-coding transcripts and identify subset of protein-coding genes missed by other models, achieving top performance in the gene segmentation benchmarks. Finally, with in-depth analysis of GENATATOR’s model embeddings and predictions, we reveal how DNA language models utilize memory to learn the biological rules underlying gene encoding.

Join AIRI