ImproveYourVideos: Architectural Improvements for Text-to-Video Generation Pipeline
Abstract
In recent years, text-to-video models have made impressive strides toward producing high-quality results. However, the architectural aspects of these models remain largely unexplored, as prior studies have used well-known approaches that offer minimal variation between models. This study aims to address this gap by systematically exploring alternative options for text-to-video architecture building blocks, specifically focusing on the temporal layer, frame interpolation model, and autoencoder. Firstly, we propose and compare various alternative temporal consistency modules, referred to as temporal blocks, which demonstrate improvements in generation quality metrics and human evaluation. Secondly, we introduce a novel frame interpolation architecture that offers an inference speed boost over three times that of the standard Masked Frame interpolation approach. Finally, we evaluate different approaches for extending the MoVQ image autoencoder to the video domain. The experimental results show that the model with the proposed modifications strongly outperforms the model with classical blocks quantitatively and qualitatively. Our final model is available here: https://github.com/ai-forever/KandinskyVideo .
Similar publications
partnership