Margi Pandya - Portfolio

Generative Models

Generative models aim to learn the underlying distribution of each class unlike the discriminative models. The objective is to estimate the probability of generating an image from the joint distribution \( p(x, y) \), particularly when \( y = 1 \) or \( y = -1 \).

Dis_vs_Gen — Figure: Discriminative vs Generative Models.

Implicit vs Explicit Generative Models

Implicit Generative Models
- Give up estimating the explicit form of \( p(x) \)
- Aim only to sample images from the model; no explicit likelihood assignment
- Capable of generating high-quality samples
- Support fast sampling speeds
Explicit Generative Models
- Write an explicit function \( p(x) = f(x, \theta) \)
- Input: Image \( x \)
- Output: Likelihood value for image
- Parameter: Weights \( \theta \)
- Assign explicit likelihoods to images
- Enables tasks like outlier detection
- Often compromises image quality or sampling speed

Explicit models struggle with low fidelity in generated images and often underperform compared to implicit models.

Variational Autoencoder (VAE)

A Variational Autoencoder is a type of generative model that learns a latent representation of data and can be used to generate similar data samples. While VAEs were originally used for images, changing the decoder enables them to generate text or other data types.

Main Motivation

Given a training dataset of faces, a neural network can be trained to model the distribution of the latent (hidden) semantic features that describe those faces. Once trained, we can sample new latent features and decode them to generate similar, realistic faces.

For the dataset \( X \), the goal is to maximize the likelihood of the data under the model: \( p(X) \). By marginalizing over latent variables \( z \), we write:

\[ p(X) = \prod_{i=1}^m p(x^{(i)}) = \prod_{i=1}^m \int p(x^{(i)}, z) dz \]

For a single data point:

\[ p(x) = \int p(x, z) dz \]

Computing this integral exactly is intractable. Thus, we use sampling-based techniques that focus on the regions of \( z \) that significantly contribute to the probability mass of \( p(x, z) \).

Sampling Techniques

1. Monte Carlo Sampling

Most values of \( z \) contribute negligibly to \( p(x, z) \), so instead of averaging over all possible \( z \), we approximate the integral using samples:

\[ p(x) \approx \frac{1}{k} \sum_{i=1}^{k} p(x, z^{(i)}), \quad z^{(i)} \sim \text{Uniform} \]

2. Importance Sampling

Rather than sampling \( z \) uniformly, we sample from a distribution \( q(z) \) that is closer to the posterior. This improves the efficiency of the estimate:

\[ p(x) = \int \frac{p(x, z)}{q(z)} q(z) dz = \mathbb{E}_{z \sim q(z)}\left[\frac{p(x, z)}{q(z)}\right] \] Approximating this expectation with \( k \) samples: \[ p(x) \approx \frac{1}{k} \sum_{i=1}^{k} \frac{p(x, z^{(i)})}{q(z^{(i)})}, \quad z^{(i)} \sim q(z) \]

  Note: In VAEs, the proposal distribution \( q(z) \) is replaced by the encoder \( q_\phi(z|x) \).

Likelihood Estimator & ELBO Derivation

The marginal likelihood of a single data point \( x \) (parameterized by \( \theta \)) is:

\[ p(x; \theta) = \int p(x, z; \theta) dz \]

For i.i.d. data \( X = \{x^{(1)}, \dots, x^{(m)}\} \), the total data likelihood is:

\[ p(X; \theta) = \prod_{i=1}^{m} p(x^{(i)}; \theta) \quad \Rightarrow \quad \log p(X; \theta) = \sum_{i=1}^{m} \log p(x^{(i)}; \theta) \]

Step-by-Step Derivation Using Jensen's Inequality

Rewrite the log-likelihood using importance sampling:
\[ \log p(x; \theta) = \log \left( \int q_\phi(z|x) \cdot \frac{p(x, z)}{q_\phi(z|x)} dz \right) = \log \mathbb{E}_{z \sim q_\phi(z|x)} \left[ \frac{p(x, z)}{q_\phi(z|x)} \right] \]
Apply Jensen’s inequality (since log is concave):
\[ \log \mathbb{E}_{z \sim q_\phi(z|x)} \left[ \frac{p(x, z)}{q_\phi(z|x)} \right] \geq \mathbb{E}_{z \sim q_\phi(z|x)} \left[ \log \left( \frac{p(x, z)}{q_\phi(z|x)} \right) \right] \]
Decompose the right-hand side (ELBO):
\[ \mathcal{L}(x; \theta, \phi) = \mathbb{E}_{z \sim q_\phi(z|x)} \left[ \log p_\theta(x|z) \right] - D_{KL}(q_\phi(z|x) \| p(z)) \]
Condition for equality:
\[ \log p(x) = \mathbb{E}_{z \sim p(z|x)} \left[ \log \left( \frac{p(x, z)}{p(z|x)} \right) \right] = \log p(x) \]

The inequality becomes equality when \( q_\phi(z|x) = p(z|x) \), i.e., the approximate posterior equals the true posterior.

Approximate Posterior via Neural Network

To approximate \( p(z|x) \), we define \( q_\phi(z|x) \) as a neural network that outputs a Gaussian distribution with mean and variance conditioned on \( x \):

\[ q_\phi(z|x) = \mathcal{N}(z; \mu_\phi(x), \sigma_\phi^2(x)I) \]

We typically assume a standard normal prior:

\[ p(z) = \mathcal{N}(0, I) \]

The encoder (inference network) learns \( q_\phi(z|x) \), while the decoder models \( p_\theta(x|z) \).

Generation Process

Once trained, we can generate new samples as follows:

Sample \( z \sim p(z) = \mathcal{N}(0, I) \)
Generate \( x \sim p_\theta(x|z) \) using the decoder

The joint distribution used during training is:

\[ p(x, z) = p_\theta(x|z) \cdot p(z) \]

Final / Overall Loss Function (ELBO):

\[ \log p(x; \theta, \phi) \geq \mathbb{E}_{z \sim q_\phi(z|x)} \left[\log\left(\frac{p_\theta(x, z)}{q_\phi(z|x)}\right)\right] \] \[ = \mathbb{E}_{z \sim q_\phi(z|x)} \left[\log p_\theta(x|z) + \log\left(\frac{p(z)}{q_\phi(z|x)}\right)\right] \] \[ = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z)) \]

KL Divergence Loss:

Given:

\[ q_\phi(z|x) = \mathcal{N}(z; \mu_\phi(x), \sigma_\phi^2(x)I), \quad p(z) = \mathcal{N}(z; 0, I) \]

KL divergence:

\[ D_{\text{KL}}(q_\phi(z|x) \| p(z)) = \int q_\phi(z|x) \log \left(\frac{q_\phi(z|x)}{p(z)}\right) dz = \mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x) - \log p(z)] \]

Using log-density of multivariate Gaussians:

\[ \log \mathcal{N}(z; \mu, \Sigma) = -\frac{d}{2} \log(2\pi) - \frac{1}{2} \log|\Sigma| - \frac{1}{2}(z - \mu)^T \Sigma^{-1} (z - \mu) \]

For \( q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x)I) \), the closed-form KL divergence:

\[ D_{\text{KL}}(q_\phi(z|x) \| p(z)) = \frac{1}{2} \sum_{j=1}^d \left( -\log \sigma_{\phi,j}^2(x) + \mu_{\phi,j}^2(x) + \sigma_{\phi,j}^2(x) - 1 \right) \]

Rewriting to match the ELBO loss expression (with negative sign):

\[ - D_{\text{KL}}(q_\phi(z|x) \| p(z)) = \frac{1}{2} \sum_{j=1}^d \left( 1 + \log \sigma_{\phi,j}^2(x) - \mu_{\phi,j}^2(x) - \sigma_{\phi,j}^2(x) \right) \]

Reconstruction Loss Gradient Derivation:

\[ \nabla_\phi \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] = \nabla_\phi \int q_\phi(z|x) \log p_\theta(x|z) dz \] Assuming interchange of gradient and integral: \[ = \int \log p_\theta(x|z) \nabla_\phi q_\phi(z|x) dz \] Use identity: \[ \nabla_\phi q_\phi(z|x) = q_\phi(z|x) \nabla_\phi \log q_\phi(z|x) \] So, \[ = \int \log p_\theta(x|z) q_\phi(z|x) \nabla_\phi \log q_\phi(z|x) dz = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z) \nabla_\phi \log q_\phi(z|x)] \]

Reparameterization Trick:

The VAE is a high-variance model due to stochastic sampling, making it non-differentiable. To backpropagate through stochastic nodes, we use the reparameterization trick:

VAE architecture diagram — Figure: VAE using the reparameterization trick to enable backpropagation through the encoder.

\[ z = \mu_\phi(x) + \sigma_\phi(x) \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) \]

For Image Datasets:

The encoder is typically a stack of convolutional layers that maps the input image to latent variables (mean and variance).
The decoder consists of transposed convolutions (deconvolution layers) to reconstruct the image from the latent representation.
Each sample from the latent space (determined by predicted \(\mu_\phi(x)\) and \(\sigma_\phi(x)\)) results in a potentially different reconstructed image.

Contents