VQGAN: Taming Transformers for High-Resolution Image Synthesis [Paper Explained]

The authors introduce VQGAN which combines the efficiency of convolutional approaches with the expressivity of transformers. VQGAN is essentially a GAN that learns a codebook of context-rich visual parts and uses it to quantize the bottleneck representation at every forward pass. The self-attention model is used to learn a prior distribution of codewords. Sampling from this model then allows producing plausible constellations of the codewords which are then fed through a decoder to generate realistic images in arbitrary resolution. Paper: Code: (with pretrained models) Colab notebook: Colab notebook to compare the first stage models in VQGAN and in DALL-E: Abstract:
Back to Top