Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion.
Text-to-image generation is a significant do- main in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhance- ments. These models are generally split into two categories: pixel-level and latent-level ap- proaches. We present Kandinsky1, a novel ex- ploration of latent diffusion architecture, com- bining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. An- other distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B pa- rameters. We also deployed a user-friendly demo system that supports diverse genera- tive modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpaint- ing/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source per- former in terms of measurable image genera- tion quality.