초안 정리(진행중)

BeomseoChoi · Nov 21, 2024 · 1aabc89 · 1aabc89
1 parent 088ff0b
commit 1aabc89
Showing 1 changed file with 30 additions and 23 deletions.
diff --git a/_posts/2024-11-20-GAN.md b/_posts/2024-11-20-GAN.md
@@ -1,78 +1,85 @@
 ---
 layout: post
-title: Generative Adverserial Networks
+title: Generative Adverserial Networks (GANs)
 date: 2024-11-20 19:04:00+0900
 description: GANs
 tags: formatting links
 disqus_comments: true
 categories: Deep Learning
 ---
 
-Generative Adverserial Networks를 논하기 전에 Likelohood-based learning에 대해 짚고 가야한다.
-지금까지 AR, VAE, NF는 모두 likelihood-based leraning이었다. 데이터가 충분하다면 MLE는 MVUE (Minimum Variance Unbiased Estimator)를 만족한다.
+GAN은 데이터의 분포를 adverserial 방식으로 implicit하게 학습하는 generative model입니다.
 
-MLE는 말 그대로 likelihood를 maximize한다. likelihood가 높으면 샘플 퀄리티가 좋아질까?
-"For imperfect models, achieving high log-likelihoods might not always imply good sample quality, and vice versa. (Theis et al., 2016)
-항상 그런건 아니라는 연구 결과가 있다.
+## Generative model
+Generative model에 대한 설명은 여기를 참고.
 
-GAN은 likelihood-free learning이다.
+## Likelihood-free training
+Autoregressive model, VAE, Normalizing flow, Diffusion은 Maximum Likelihood Estimator (MLE)를 사용하여 parameters를 optimize합니다. 데이터를 충분히 가지고 있다면 MLE는 Minimum Variance Unbiased Estimaor (MVUE)를 만족합니다. 충분한 샘플 데이터를 가지면 MLE는 이론적으로 estimator입니다. 하지만 likelihood가 높다고 능사는 아닙니다. 
 
+>"For imperfect models, achieving high log-likelihoods might not always imply good sample quality, and vice versa. (Theis et al., 2016)
 
-서로 다른 분포 P와 Q에서 샘플링을 한다. 샘플이 어느 분포에서부터 샘플되었는지 구분할 방법이 있을까?
-two-sample test로 가설검정하면 된다. T statistic이 일정 threshold를 넘으면 Null hypothesis를 reject한다.
-
-여기서 키 포인트는, Test statistic은 likelihood-free라는 거다. density를 포함하고 있지 않기 때문이다. 그저 샘플만으로 판단한다.
-
-생성모델은 P_data와 P_theta의 거리를 좁혀 학습된다. two-sample test를 minimize하는 generative 모델을 학습하면 된다.
-하지만 통계량을 최소화하는건 쉬운 일이 아니다. 두 분포 P, Q가 가우시안을 따른다고 해도 평균이 다를 수 있고, 평균이 같아도 분산이 다를 수 있다. 평균과 분산이 같더라도 분포가 다르면 최소화할 수 없다. 그러니까 단순히 평균과 분산같은 통계량으로는 판단하기 어렵다. 그래서 두 샘플 집합이 어느 분포에서 왔는지 자동으로 학습시킨다. 이게 GAN에서 discriminator라고 불리는 classifier다.
+Two-sample Test는 두 독립적인 집단에 대해 통계적 특성이 서로 동일한지 여부를 검정하는 통계적 방법입니다. Test statistic이 threshold를 초과하면 null hypothesis를 reject하고 alternative hypothesis를 accept합니다. 눈 여겨 볼 부분은, 이 방법이 likelihood를 사용하지 않는다는 점입니다. Sample과 test statistic만 사용합니다. 이처럼 GAN은 likelihood-free traning이기 때문에 $$P_{data}$$를 implicit하게 학습합니다.
 
+### Discriminator
+Likelihood를 사용하지 않는 학습을 위해 test statistic을 minimize하는 objective를 설정할 수 있습니다. 하지만 test statistic을 최소화하기 쉽지 않습니다. 동일한 분포를 따르더라도 평균이 다를 수 있으며, 평균이 같아도 분산이 다를 수 있고, 평균과 분산이 같아도 분포가 다를 수 있기 때문입니다. 그래서 두 샘플이 각각 어느 분포에서 왔는지 자동으로 학습시킵니다. 이 부분이 GAN에서 discriminator라고 불리는 classifier입니다.
 
+Discriminator $$\mathcal{D_{\phi}}$$는 아래의 objective를 따릅니다. 
 
 $$
 \max_{\mathcal{D}_{\phi}}V(P_{\theta}, \mathcal{D}_{\phi}) = \mathbb{E}_{x \sim P_{data}}[\text{log}\mathcal{D}_{\phi}(x)] + \mathbb{E}_{x \sim P_{\theta}}[\text{log}(1 - \mathcal{D}_{\phi}(x))]
 $$
 
-$$\mathcal{D_{\phi}}$$가 베르누이를 따른다고 하자. $$P_{data}$$에서 뽑으면 1(real), $$P_{\theta}$$에서 샘플링한건 0(fake)라고 하자.
+주어진 샘플이 real(1)인지 fake(0)인지 구분하는게 $$\mathcal{D_{\phi}}$$의 objective입니다. $$\mathcal{D_{\phi}}$$는 베르누이 분포의 파라미터를 estimate합니다.
 
-$$\mathcal{D_{\phi}}$$의 목표는 주어진 샘플이 1인지 0인지 잘 판단하는 것이다. Optimal한 경우는 다음과 같다.
+Optimal $$\mathcal{D_{\phi}}$$는 다음과 같습니다.
 
 $$
 \mathcal{D^{*}_{\phi}} = \frac{P_{data}}{P_{data} + P_{\theta}}.
 $$
 
 $$
-\text{If } P_{data} = P_{\theta} \text{, then} \mathcal{D^{*}_{\phi}} = \frac{1}{2}.
+\text{If } P_{data} = P_{\theta} \text{, then } \mathcal{D^{*}_{\phi}} = \frac{1}{2}.
 $$
 
-유도는 변분(variational differenciate) 하면 나온다. 
+$$\mathcal{D^{*}_{\phi}}$$는 $$\mathcal{D_{\phi}}$$의 objective function을 variation하면 쉽게 유도됩니다.
+
+### Generator
+
+Test statistic을 minimize하는 objective를 가진 generative model을 만드는게 목표입니다. Test statistic을 minimize하기 어려워서 $$\mathcal{D_{\phi}}$$를 이용했습니다. 이제 $$\mathcal{D_{\phi}}$$를 이용해서 test statistic을 minimize하는 generative model을 만듭니다.
 
 $$
 \min_{\mathcal{G_{\theta}}}\max_{\mathcal{D}_{\phi}}V(\mathcal{G_{\theta}}, \mathcal{D}_{\phi}) = \mathbb{E}_{x \sim P_{data}}[\text{log}\mathcal{D}_{\phi}(x)] + \mathbb{E}_{x \sim \mathcal{G_{\theta}}}[\text{log}(1 - \mathcal{D}_{\phi}(x))]
 $$
 
-기존 목표대로 statistic을 최소화하는 생성모델을 만들면 위 식이 된다. 이 식은 GAN의 loss function과 같다. GAN의 loss function에 $$P_{\theta}$$가 없다. 그 말은 density를 몰라도 된다는거다. 오직 샘플만 있다면 학습시킬 수 있다. 그렇기에 $$P_{\theta}$$는 implicit하게 $$P_{data}$$의 분포에 가까워질 것이다. 그래서 GAN은 implicit한 generative model이라고 말한다.
+$$\mathcal{D_{\phi}}$$의 objective에 $$P_{\theta}$$ 대신 Generator $$\mathcal{G_{\theta}}$$를 사용합니다. 위 식은 GAN에서 사용하는 loss function이 됩니다. GAN은 $$\mathcal{G_{\theta}}$$와 $$\mathcal{D_{\phi}}$$가 minimax로 경쟁하며 학습합니다. 그래서 adverserial network라고 부릅니다.
+
+## Jensen–Shannon divergence (JSD)
+
 
 $$
 D_{JSD}\left[p, q\right] = \frac{1}{2}\left(D_{KL}\left[p, \frac{p + q}{2}\right] + D_{KL}\left[q, \frac{p + q}{2}\right]\right)
 $$
 
 1. $$D_{JSD}\left[p, q\right] \ge 0$$
-2. $$D_{JSD}\left[p, q\right] = 0 \ iif p=q$$ 
-3. $$D_{JSD}\left[p, q\right] = D_{JSD}\left[q, p\right] $$ 
+2. $$D_{JSD}\left[p, q\right] = 0 \text{ iif. } p=q$$
+3. $$D_{JSD}\left[p, q\right] = D_{JSD}\left[q, p\right] $$
 
-Optimal한 $$\mathcal{D_{\phi}}$$를 가지고 있다면,
+GAN은 $$P_{data}$$와 $$P_{\theta}$$의 확률 분포를 JSD를 사용해 근사시킵니다. 
+만약 $$\mathcal{D^{*}_{\phi}}$$를 가지고 있다면,
 
 $$
 V(\mathcal{G}_{\theta}, \mathcal{D}^{*}_{\mathcal{G}_{\theta}}(x)) = 2D_{JSD}\left[P_{data}, P_{\theta}\right] - \log{4}.
 $$
 
 
-Optimal한 $$\mathcal{G_{\theta}}$$도 가지고 있다면,
+$$\mathcal{G}^{*}_{\theta}$$도 가지고 있다면,
 
 $$
 V(\mathcal{G}^{*}_{\theta}, \mathcal{D}^{*}_{\mathcal{G}_{\theta}}(x)) = -\log{4}.
 $$
 
+## Training GAN
+
 GAN을 학습하는 방식은 다음과 같다.
 1. $$\mathcal{D}$$에서 $$x$$를 샘플링한다.
 2. $$\mathcal{G}$$에서 $$z$$를 샘플링한다.