Ahmad Badary

Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs)

Auto-Regressive Models VS Variational Auto-Encoders VS GANs:
Auto-Regressive Models defined a tractable (discrete) density function and, then, optimized the likelihood of training data:
$$p_\theta(x) = p(x_0) \prod_1^n p(x_i | x_{i<})$$

While VAEs defined an intractable (continuous) density function with latent variable $z$:

$$p_\theta(x) = \int p_\theta(z) p_\theta(x|z) dz$$

but cannot optimize directly; instead, derive and optimize a lower bound on likelihood instead.

On the other hand, GANs rejects explicitly defining a probability density function, in favor of only being able to sample.
Generative Adversarial Networks:
GANs are a class of AI algorithms used in unsupervised machine learning, implemented by a system of two neural networks contesting with each other in a zero-sum game framework.
Motivation:
- Problem: we want to sample from complex, high-dimensional training distribution; there is no direct way of doing this.
- Solution: we sample from a simple distribution (e.g. random noise) and learn a transformation that maps to the training distribution, by using a neural network.
- Generative VS Discriminative: discriminative models had much more success because deep generative models suffered due to the difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging the benefits of piecewise linear units in the generative context.
  GANs propose a new framework for generative model estimation that sidesteps these difficulties.
Structure:
- Goal:
  estimating generative models that capture the training data distribution
- Framework:
  an adversarial process in which two models are simultaneously trained a generative model $G$ that captures the data distribution, and a discriminative model $D$ that estimates the probability that a sample came from the training data rather than $G$.
- Training:
  $G$ maximizes the probability of $D$ making a mistake
Training:
Generator network: try to fool the discriminator by generating real-looking images
Discriminator network: try to distinguish between real and fake images
- Train jointly in minimax game.
  - Minimax objective function:
    $$\min _{\theta_{g}} \max _{\theta_{d}}\left[\mathbb{E}_{x \sim p_{\text {data }}} \log D_{\theta_{d}}(x)+\mathbb{E}_{z \sim p(z)} \log \left(1-D_{\theta_{d}}\left(G_{\theta_{g}}(z)\right)\right)\right]$$
    
    - Discriminator outputs likelihood in $(0,1)$ of real image
    - $D_{\theta_{d}}(x)$: Discriminator output for real data $\boldsymbol{x}$
    - $D_{\theta_{d}}\left(G_{\theta_{g}}(z)\right)$: Discriminator output for generated fake data $G(z)$
    - Discriminator $\left(\theta_{d}\right)$ wants to maximize objective such that $\mathrm{D}(\mathrm{x})$ is close to $1$ (real) and $\mathrm{D}(\mathrm{G}(\mathrm{z}))$ is close to $0$ (fake)
    - Generator $\left(\mathrm{f}_ {\mathrm{g}}\right)$ “wants to minimize objective such that $\mathrm{D}(\mathrm{G}(\mathrm{z}))$ is close to $1$ (discriminator is fooled into thinking generated $\mathrm{G}(\mathrm{z})$ is real)
- Alternate between*:
  1. Gradient Ascent on Discriminator:
    $$\max _{\theta_{d}}\left[\mathbb{E}_{x \sim p_{\text {data}}} \log D_{\theta_{d}}(x)+\mathbb{E}_{z \sim p(z)} \log \left(1-D_{\theta_{d}}\left(G_{\theta_{g}}(z)\right)\right)\right]$$
  2. Gradient Ascent on Generator (different objective):
    $$\max _{\theta_{g}} \mathbb{E}_{z \sim p(z)} \log \left(D_{\theta_{d}}\left(G_{\theta_{g}}(z)\right)\right)$$
GAN Training Algorithm:

- # of Training steps $\mathrm{k}$: some find $\mathrm{k}=1$ more stable, others use $\mathrm{k}>1$ no best rule.
- Recent work (e.g. Wasserstein GAN) alleviates this problem, better stability!
Notes:
- * Instead of minimizing likelihood of discriminator being correct, now maximize likelihood of discriminator being wrong. Same objective of fooling discriminator, but now higher gradient signal for bad samples => works much better! Standard in practice.
  - Previously we used to do gradient descent on generator:
    $$\min _{\theta_{g}} \mathbb{E}_{z \sim p(z)} \log \left(1-D_{\theta_{d}}\left(G_{\theta_{g}}(z)\right)\right)$$
    
    In practice, optimizing this generator objective does not work well.
  - Now we are doing gradient ascent on the generator:
    $$\max _{\theta_{g}} \mathbb{E}_{z \sim p(z)} \log \left(D_{\theta_{d}}\left(G_{\theta_{g}}(z)\right)\right)$$
- Jointly training two networks is challenging, can be unstable. Choosing objectives with better loss landscapes helps training, is an active area of research.
- The representations have nice structure:
  - Average $\boldsymbol{z}$ vectors, do arithmetic:
  - Interpolating between random points in latent space is possible
Generative Adversarial Nets: Convolutional Architectures:
- Discriminator is a standard convolutional network.
- Generator is an upsampling network with fractionally-strided convolutions.

Architecture guidelines for stable Deep Convolutional GANs:
- Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).
- Use batchnorm in both the generator and the discriminator.
- Remove fully connected hidden layers for deeper architectures.
- Use ReLU activation in generator for all layers except for the output, which uses Tanh.
- Use LeakyReLU activation in the discriminator for all layers.
Pros, Cons and Research:
- Pros:
  - Beautiful, state-of-the-art samples!
- Cons:
  - Trickier / more unstable to train
  - Can’t solve inference queries such as $p(x), p(z\vert x)$
- Active areas of research:
  - Better loss functions, more stable training (Wasserstein GAN, LSGAN, many others)
  - Conditional GANs, GANs for all kinds of applications

Notes:

Generative Adversarial Network (GAN). This has been one of the most popular models (research-wise) in recent years (Goodfellow et al., 2014). This OpenAI blogpost gives a good overview of a few architectures based on the GAN (although only from OpenAI). Other interesting models include pix2pix (Isola et al., 2017), CycleGAN (Zhu et al., 2017) and WGAN (Arjovsky et al., 2017). The first two deal with image-to-image translation (eg. photograph to Monet/Van Gogh or summer photo to winter photo), while the last work focuses on using Wasserstein distance as a metric for stabilizing the GAN (since GANs are known to be unstable and difficult to train).
Limitation of GANs:
Cool GAN papers:
Cycle-GAN, BigGAN, PGGAN, WGAN-GP, StyleGAN, SGAN,
GAN discussion usefulness and applications (twitter)

Generative Adversarial Networks

Table of Contents

Generative Adversarial Networks (GANs)