GAN 초안

BeomseoChoi · Nov 20, 2024 · 8cf214c · 8cf214c
1 parent 0082a30
commit 8cf214c
Show file tree

Hide file tree

Showing 3 changed files with 276 additions and 0 deletions.
diff --git a/_posts/2024-11-20-GAN.md b/_posts/2024-11-20-GAN.md
@@ -0,0 +1,276 @@
+---
+layout: post
+title: Generative Adverserial Networks
+date: 2024-11-20 19:04:00+0900
+description: GANs
+tags: formatting links
+disqus_comments: true
+categories: Deep Learning
+---
+
+Generative Adverserial Networks를 논하기 전에 Likelohood-based learning에 대해 짚고 가야한다.
+지금까지 AR, VAE, NF는 모두 likelihood-based leraning이었다. 데이터가 충분하다면 MLE는 MVUE (Minimum Variance Unbiased Estimator)를 만족한다.
+
+MLE는 말 그대로 likelihood를 maximize한다. likelihood가 높으면 샘플 퀄리티가 좋아질까?
+"For imperfect models, achieving high log-likelihoods might not always imply good sample quality, and vice versa. (Theis et al., 2016)
+항상 그런건 아니라는 연구 결과가 있다.
+
+GAN은 likelihood-free learning이다.
+
+
+서로 다른 분포 P와 Q에서 샘플링을 한다. 샘플이 어느 분포에서부터 샘플되었는지 구분할 방법이 있을까?
+two-sample test로 가설검정하면 된다. T statistic이 일정 threshold를 넘으면 Null hypothesis를 reject한다.
+
+여기서 키 포인트는, Test statistic은 likelihood-free라는 거다. density를 포함하고 있지 않기 때문이다. 그저 샘플만으로 판단한다.
+
+생성모델은 P_data와 P_theta의 거리를 좁혀 학습된다. two-sample test를 minimize하는 generative 모델을 학습하면 된다.
+하지만 통계량을 최소화하는건 쉬운 일이 아니다. 두 분포 P, Q가 가우시안을 따른다고 해도 평균이 다를 수 있고, 평균이 같아도 분산이 다를 수 있다. 평균과 분산이 같더라도 분포가 다르면 최소화할 수 없다. 그러니까 단순히 평균과 분산같은 통계량으로는 판단하기 어렵다. 그래서 두 샘플 집합이 어느 분포에서 왔는지 자동으로 학습시킨다. 이게 GAN에서 discriminator라고 불리는 classifier다.
+
+
+
+$$
+\max_{\mathcal{D}_{\phi}}V(P_{\theta}, \mathcal{D}_{\phi}) = \mathbb{E}_{x \sim P_{data}}[\text{log}\mathcal{D}_{\phi}(x)] + \mathbb{E}_{x \sim P_{\theta}}[\text{log}(1 - \mathcal{D}_{\phi}(x))]
+$$
+
+$$\mathcal{D_{\phi}}$$가 베르누이를 따른다고 하자. $$P_{data}$$에서 뽑으면 1(real), $$P_{\theta}$$에서 샘플링한건 0(fake)라고 하자.
+
+$$\mathcal{D_{\phi}}$$의 목표는 주어진 샘플이 1인지 0인지 잘 판단하는 것이다. Optimal한 경우는 다음과 같다.
+
+$$
+\mathcal{D^{*}_{\phi}} = \frac{P_{data}}{P_{data} + P_{\theta}}.
+$$
+
+$$
+\text{If } P_{data} = P_{\theta} \text{, then} \mathcal{D^{*}_{\phi}} = \frac{1}{2}.
+$$
+
+유도는 변분(variational differenciate) 하면 나온다. 
+
+$$
+\min_{\mathcal{G_{\theta}}}\max_{\mathcal{D}_{\phi}}V(\mathcal{G_{\theta}}, \mathcal{D}_{\phi}) = \mathbb{E}_{x \sim P_{data}}[\text{log}\mathcal{D}_{\phi}(x)] + \mathbb{E}_{x \sim \mathcal{G_{\theta}}}[\text{log}(1 - \mathcal{D}_{\phi}(x))]
+$$
+
+기존 목표대로 statistic을 최소화하는 생성모델을 만들면 위 식이 된다. 이 식은 GAN의 loss function과 같다. GAN의 loss function에 $$P_{\theta}$$가 없다. 그 말은 density를 몰라도 된다는거다. 오직 샘플만 있다면 학습시킬 수 있다. 그렇기에 $$P_{\theta}$$는 implicit하게 $$P_{data}$$의 분포에 가까워질 것이다. 그래서 GAN은 implicit한 generative model이라고 말한다.
+
+$$
+D_{JSD}\left[p, q\right] = \frac{1}{2}\left(D_{KL}\left[p, \frac{p + q}{2}\right] + D_{KL}\left[q, \frac{p + q}{2}\right]\right)
+$$
+
+1. $$D_{JSD}\left[p, q\right] \ge 0$$
+2. $$D_{JSD}\left[p, q\right] = 0 \ iif p=q$$ 
+3. $$D_{JSD}\left[p, q\right] = D_{JSD}\left[q, p\right] $$ 
+
+Optimal한 $$\mathcal{D_{\phi}}$$를 가지고 있다면,
+
+$$
+V(\mathcal{G}_{\theta}, \mathcal{D}^{*}_{\mathcal{G}_{\theta}}(x)) = 2D_{JSD}\left[P_{data}, P_{\theta}\right] - \log{4}.
+$$
+
+
+Optimal한 $$\mathcal{G_{\theta}}$$도 가지고 있다면,
+
+$$
+V(\mathcal{G}^{*}_{\theta}, \mathcal{D}^{*}_{\mathcal{G}_{\theta}}(x)) = -\log{4}.
+$$
+
+GAN을 학습하는 방식은 다음과 같다.
+1. $$\mathcal{D}$$에서 $$x$$를 샘플링한다.
+2. $$\mathcal{G}$$에서 $$z$$를 샘플링한다.
+3. $$\phi$$를 gradient ascent로 업데이트한다.
+4. $$\theta$$를 gradient descent로 업데이트한다.
+5. 반복한다.
+
+Pros.
+1. No likelihood. Don't need to know the density.
+2. Flexibility for neural network architecture. We haven't constraint the architecture of G.
+3. Fast sampling. Sing forward pass through G.
+
+Cons.
+1. Very difficult to train in practice.
+
+GAN의 세 가지 Challenges가 있다.
+1. Unstable optimization
+2. Mode collapse
+3. Evaluation
+
+
+{% include figure.liquid path="assets/img/2024-11-20-GAN/loss.jpg" class="img-fluid rounded z-depth-1" zoomable=true %}
+첫 번째 문제점은 oscillating이 심하다는거다. 위 그래프 보고 어디서 멈춰야할지 감이 안온다.
+
+두 번째 문제점은 G collapses to one or few samples라는거다. D가 너무 학습이 잘돼서 G가 학습이 안된다.
+
+세 번째 문제점은 
+
+
+---
+
+Likelihood-free training이기 때문에 특별한 density의 architecture를 선택할 필요가 없다.
+
+f-divergence.
+지금까지 두 가지 divergnce가 나왔다. KLD, JSD.
+
+Divergence를 일반화한 방식이 f-divergence다. 
+
+$$
+D_{f}(p, q) = \mathbb{E}_{x \sim q}\left[f\left(\frac{p(x)}{q(x)}\right)\right] \text{, where f is convex and lower-semicontinuous with f(1) = 0.}
+$$
+
+{% include figure.liquid path="assets/img/2024-11-20-GAN/f-divergence.jpg" class="img-fluid rounded z-depth-1" zoomable=true %}
+
+Nowozin et al., 2016.
+
+
+f-divergence를 잘 유도하면 f-GAN 식이 나온다.
+
+$$
+\min_{\mathcal{G}_{\theta}}\max_{\mathcal{D}_{\phi}} F(\theta, \phi) = \mathbb{E}_{x \sim P_{data}}\left[\mathcal{T}_{\phi}(x)\right] - \mathbb{E}_{x \sim P_{\theta}}\left[f^{*}(\mathcal{T}_{\phi}(x))\right]
+$$
+
+막간 support
+
+$$ supp(p) = { x \in \mathbb{R}^{n} \mid p(x) > 0 }$$
+확률분포 P의 support는 P(x)가 0보다 큰 x 집합을 의미한다. 유니폼 분포 U[a, b]의 support는 [a, b]다.
+
+1. $$\frac{p(x)}{q(x)}$$에서 $$q(x) = 0$$이고 $$p(x)>0$$인 지점이 있다면 수렴하지 않고 발산하거나 정의되지 않음. Discontinuity 문제.
+2. $$q(x)$$의 support가 $$p(x)$$의 suport를 cover 해야함. ($$p(x) > 0 \longrightarrow q(x)>0$$). 그렇지 않으면 발산하거나 정의되지 않음. (Discontinuity).
+
+$$
+D_{f}(p, q)= \mathcal{E}_{x \sim q} \left[f\left(\frac{p(x)}{q(x)}\right)\right]
+$$
+
+p랑 q가 suupport를 share하지 않음. 학습 초반에 G가 생성하는 샘플이랑 훈련 데이터가 많이 달라서 발생함.
+
+needed a "smmother" distance D(p, q) that is defined when p and q have disjoint supports.
+support를 cover하지 않는, 즉 discontinuity problem이 arise할 때 어떻게 해결할 것인가 -> wasserstein distance.
+
+wasserstein distance
+
+$$
+D_{w}(p, q) = \inf\limits_{\gamma \in \prod} \mathbb{E}_{(x, y) \sim \gamma}\left[ \left\| x - y \right\| \right]
+$$
+
+$$
+\text{,where } \prod \text{contains all possible joint distributions of } (x, y).
+$$
+
+$$
+\text{marginal of } x \text{ is } p(x) = \int \gamma(x, y)dy.
+$$
+
+$$
+\text{marginal of } y \text{ is } p(y) = \int \gamma(x, y)dx.
+$$
+
+$$
+\gamma(y \mid x) \text{ : a probabilistic earth moving plan that warps } p(x) \text{ to } q(y).
+$$
+
+Kantorovich-Rubinstein duality.
+
+$$
+D_{w}(p, q) = \sup\limits_{\left\|f\right\|_{L} \le 1} \mathbb{E}_{x \sim p}\left[ f(x) \right] - \mathbb{E}_{x \sim q} \left[ f(x) \right]
+$$
+
+$$\left\|f\right\|_{L} \le 1$$ means the Lipschitz constant of $$f(x)$$ is 1.
+
+technically, $$\forall x, y: \mid f(x) - f(y) \mid \le \left\| x - y \right\|_{1}$$.
+Intuitively, $$f$$ cannot change too rapidly.
+
+scalar function f를 optimization하는 problem is equal to the wasserstein distance. (right-hand side를 최적화 -> wasserstein distance랑 같은 결과.)
+
+
+Wasserstein GAN with $$\mathcal{D}_{\phi}(x)$$ and $$\mathcal{G}_{\theta}(z)$$
+
+$$
+\min_{\theta} \max_{\phi} \mathbb{E}_{x \sim P_{data}} \left[ \mathcal{D}_{\phi}(x) \right] - \mathbb{E}_{q \sim P(z)} \left[ \mathcal{D}_{\phi} \left( \mathcal{G}_{\theta}(z) \right) \right]
+$$
+
+Lipschitzness of $$\mathcal{D}_{\phi}(x)$$ is enforced through weight clipping or gradient penalty on $$\mathcal{D}_{x}(x)$$
+
+
+Vanila GAN
+```python
+# Generator
+class Generator(nn.Module):
+    def __init__(self, latent_dim, output_size):
+        super(Generator, self).__init__()
+        self.model = nn.Sequential(
+            nn.Linear(latent_dim, 256),
+            nn.ReLU(inplace=True),
+            nn.Linear(256, 512),
+            nn.ReLU(inplace=True),
+            nn.Linear(512, 1024),
+            nn.ReLU(inplace=True),
+            nn.Linear(1024, output_size),
+            nn.Tanh()  # Output values in range [-1, 1]
+        )
+
+    def forward(self, z):
+        return self.model(z)
+
+# Discriminator
+class Discriminator(nn.Module):
+    def __init__(self, input_size):
+        super(Discriminator, self).__init__()
+        self.model = nn.Sequential(
+            nn.Linear(input_size, 1024),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Linear(1024, 512),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Linear(512, 256),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Linear(256, 1),
+            nn.Sigmoid()  # Output single probability
+        )
+
+    def forward(self, x):
+        return self.model(x)
+```
+
+DCGAN
+```python
+# Generator
+class Generator(nn.Module):
+    def __init__(self, latent_dim):
+        super(Generator, self).__init__()
+        self.init_size = 8  # Starting image size: 8x8
+        self.fc = nn.Linear(latent_dim, 128 * self.init_size * self.init_size)
+
+        self.model = nn.Sequential(
+            nn.BatchNorm2d(128),
+            nn.ConvTranspose2d(128, 128, kernel_size=4, stride=2, padding=1),  # Upsample to 16x16
+            nn.BatchNorm2d(128),
+            nn.ReLU(inplace=True),
+            nn.ConvTranspose2d(128, 64, kernel_size=4, stride=2, padding=1),  # Upsample to 32x32
+            nn.BatchNorm2d(64),
+            nn.ReLU(inplace=True),
+            nn.Conv2d(64, 3, kernel_size=3, stride=1, padding=1),  # Output RGB image
+            nn.Tanh()  # [-1, 1] output range
+        )
+
+    def forward(self, z):
+        out = self.fc(z)
+        out = out.view(out.size(0), 128, self.init_size, self.init_size)  # Reshape to (batch, 128, 8, 8)
+        return self.model(out)
+
+# Discriminator
+class Discriminator(nn.Module):
+    def __init__(self):
+        super(Discriminator, self).__init__()
+        self.model = nn.Sequential(
+            nn.Conv2d(3, 64, kernel_size=4, stride=2, padding=1),  # Downsample to 16x16
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(64, 128, kernel_size=4, stride=2, padding=1),  # Downsample to 8x8
+            nn.BatchNorm2d(128),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(128, 256, kernel_size=4, stride=2, padding=1),  # Downsample to 4x4
+            nn.BatchNorm2d(256),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Flatten(),
+            nn.Linear(256 * 4 * 4, 1),  # Fully connected layer
+            nn.Sigmoid()  # Output probability
+        )
+
+    def forward(self, img):
+        return self.model(img)
+```
diff --git a/assets/img/2024-11-20-GAN/f-divergence.png b/assets/img/2024-11-20-GAN/f-divergence.png
diff --git a/assets/img/2024-11-20-GAN/loss.png b/assets/img/2024-11-20-GAN/loss.png