Extracting Semantic and Geometric Information in Images and Videos using GANs

The success of Generative Adversarial Networks (GANs) has resulted in unprecedented quality both for image generation and manipulation. Recent state-of-the-art GANs (e.g., the StyleGAN series) have demonstrated outstanding results in photo-realistic image generation. In this dissertation, we explore the latent space properties, including image manipulation, extraction of 3D properties, and performing various weakly supervised and unsupervised downstream tasks using StyleGAN and its derivative architectures.

Overview

Abstract

The success of Generative Adversarial Networks (GANs) has resulted in unprecedented quality both for image generation and manipulation. Recent state-of-the-art GANs (e.g., the StyleGAN series) have demonstrated outstanding results in photo-realistic image generation. In this dissertation, we explore the latent space properties, including image manipulation, extraction of 3D properties, and performing various weakly supervised and unsupervised downstream tasks using StyleGAN and its derivative architectures. First, we study the images' projection into StyleGAN's latent space and analyze the properties of embedded images in a proposed extended W+ latent space. Second, we demonstrate rich semantic interpretations of the images in the latent space, which indirectly creates a compelling semantic understanding of the underlying latent space. Specifically, we combine W+ space with Noise space optimization and tensor manipulations to enable high-quality reconstruction and local editing in images. For example, we can perform image inpainting where these regularized latent spaces reconstruct the image's content, and the details of the missing regions are filled by the GAN prior. Next, we study if a 2D image-based GAN learns a meaningful semantic model and 3D properties in an image. Using our analysis, we can extract a plausible interpretation of 3D geometry, lighting, materials, and other semantic attributes of the source images by modeling the latent space using conditional continuous normalizing flows. As a result, we can perform non-linear sequential edits on the source image without affecting the quality and identity of the image. Furthermore, we propose a technique to extract underlying latent space properties using an unsupervised method to generalize our analysis on unseen datasets where human knowledge is limited. Specifically, we use an information-rich visual-linguistic model, CLIP, trained on internet scale data of image-text pairs. The proposed framework extracts, labels, and projects important directions into the GAN latent space without human supervision.

Finally, inspired by the findings of our analysis, we investigate additional related unexplored questions: Can we perform foreground object segmentation? Can an image-based GAN be used to edit videos? Can we generate view-consistent editable 3D animations? Investigating these research questions helps us use GANs to tackle a spectrum of tasks outside the usual image generation task. Specifically, we propose a technique to segment foreground objects from the generated images using the information stored in the StyleGAN feature maps. This framework can be used to create synthetic datasets, which can be used to train existing supervised segmentation networks. Then, we study the regularized W+, activation S, and Fourier feature Ff spaces to embed and edit videos in the image-based StyleGAN3, a variant of StyleGAN. We can generate high-quality videos at 1024x1024 resolution using a single image and driving videos. Finally, we propose a framework for domain adaptation in 3D-GANs that can link latent spaces of different models together. We build upon EG3D, a 3D-GAN derived from StyleGAN, to enable the generation, editing, and animation of personalized 3D avatars. Technically, we propose a method to align the camera distribution of two domains i.e., faces and avatars. Then we propose a method for domain adaptation in 3D-GANs using texture, geometric, and depth regularization with an option to model more exaggerated geometries. Finally, we propose a method to link and project real faces into the 3D artistic domain. These frameworks allow us to develop tools distilled from an unconditional GAN for unsupervised image segmentation, video editing, and personalized 3D animation generation and manipulation with state-of-the-art performance. We create these tools without needing extra annotated object segmentation, video, or 3D data.

Brief Biography

Rameen Abdal is currently a Ph.D. candidate in Computer Science at Visual Computing Center (VCC), advised by Professor Peter Wonka. He received his Master’s degree in Computer Science from KAUST in 2020. He received his Bachelor's degree in Electronics and Communication Engineering from the National Institute of Technology, Srinagar, India in 2018. During his Ph.D. studies at KAUST, he collaborated with Adobe Research and worked as a research intern at Snap Research. His research interests include representation learning, generative modeling, image and video editing, 3D content generation and editing, and animations using deep learning algorithms. His research has been published in top-tier conferences and journals such as CVPR, ICCV, SIGGRAPH, SIGGRAPH ASIA, TOG, ECCV, and ICLR.

Presenters