In this paper, we propose an end-to-end model for generating encoding vectors for sentences describing an image. We use the encoded vectors from our network to generate images using Stacked Generative Adversarial Networks to generate high resolution quality images. We present different model architectures with word and character level representations and provide an extensive qualitative and quantitative evaluation on images generated.
Full project report can be found here.
In this paper, we propose an end-to-end hybrid CNN-RNN model to learn the relation between an image and its text description. The learned model is then used to train Stack GAN conditioned on an input caption, whose encodings are generated from our trained model.
From the results, we conclude that due to the lack of volume of images generated, quantitative scores cannot be completely trusted. However, human evaluations suggests that images generated by GANs using our trained image encodings perform at par with the Stack GAN pre-trained model. Another noticeable fact is that our text representations are currently trained on captions for 1 class (Dog or Bird) only. As next steps for this project, we would like to train our embeddings on the entire COCO dataset, and evaluate images generated using recommended volume by both Inception score and our own hybrid model and compare these against the current state of the art models.
Another interesting conclusion we draw from our results is that hybrid CNN-RNN word level encodings with attention generate best images for both Stack GAN layer 1 and layer 2. layer 1 images, as expected, provide a structural representation/outline of image for the caption, and layer 2 images fill in the final details detailing the skeleton drawn in by layer 1 generators.
General Adversarial Networks have revolutionized the world of image generation, and we would like to continue our efforts further to gap the bridge between natural language and computer vision to improve image synthesis from captions and produce better models and results.