190 — StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks
Generating high quality, high-detail images from scratch is tricky. Generating specific detailed images based on user input is even harder. And generating specific, detailed images from freeform text is an entirely different kind of story altogether.
StackGAN is able to convert text like “This bird is red and brown in color, with a stubby beak” into a detailed 256×256 image representation.
GANs are a perfect candidate for this type of complex synthesis, and the authors use two: The first GAN (Stage-I) is responsible for “sketching” the output, incorporating the right colors and shapes as per the user’s request. The second GAN, Stage-II, refines the output of Stage-I in order to improve details and improve shape continuity.
One might wonder why the role of the Stage-II GAN couldn’t simply be accomplished by adding more “upscaling” layers to the Stage-I GAN. In comparison with existing systems, the authors demonstrate that this stacked architecture of two GANS, each with a simple job, produces very realistic results, whereas a “deeper” single GAN generally results in unstable, nonsensical outputs where shapes fail to fully resolve.
The researchers demonstrate the StackGAN architecture on the CUB dataset (a ~12K image dataset of 200 bird species) and it dramatically outperforms other models (such as GAN-INT-CLS or GAWWN) in terms of image realism. (In fact, I first though the bottom row of Figure 3 was a ground truth example, not the output of the network. So I guess this discriminator was fooled, at least.)