Memory transformer with hierarchical attention for long document processing
Transformers have attracted lots of interest from the researchers. Up to now, transformers achieved state-of-the-art results in a wide range of natural language processing tasks such as different sequence modeling tasks like language understanding, text summarization and translation, and definitely more transformers to come. Still, transformers has their limitations in the tasks requiring long document processing. This paper introduces a new version of transformer, a Sentence level transformer with global memory pooling and hierarchical attention to cope with long text. We replace self-attention of vanilla transformer with multi-head attention between memory and a sequence, and also add a decoder sequence selector on the top of the encoder output. In our architecture sentences are encoded in parallel and then summarized with soft-attention on every decoding step. Proposed model was validated in machine translation task. We hypothesize that attaching memory slots to each sequence improves the quality of translation, besides tuning the model on context-aware data set by using pre-trained sequence-level weights will help to get more precise translation and promote translating long documents. Results show that extending each sentence with a memory slot and employing the attention over the encoder outputs improves translation results.