Add explanation of convergence

calvinytong · Sep 24, 2018 · b520a10 · b520a10
1 parent 39b7951
commit b520a10
Show file tree

Hide file tree

Showing 2 changed files with 51 additions and 0 deletions.
diff --git a/2018-09-17-metalearning.ipynb b/2018-09-17-metalearning.ipynb
@@ -52,6 +52,57 @@
     "    * At a high level: add some randomess to your actions, if your result was better than expected do more in the future repeat\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Why This Works?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Convergence](images/convergence.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This is a somewhat informal argument on why this method works. The argument is that the algorithms converge to a vector of parameters that is close in Euclidean distance to each task's manifold of optimal solutions. As such we define the problem as \n",
+    "\n",
+    "$$\n",
+    "\\begin{equation*}\n",
+    "\\begin{aligned}\n",
+    "\\underset{\\phi}{\\text{minimize}} && \\mathbb{E_\\tau} [\\frac{1}{2} D (\\phi, W_{\\tau})]\n",
+    "\\end{aligned}\n",
+    "\\end{equation*}\n",
+    "$$\n",
+    "\n",
+    "Where $W_{\\tau}$ is the set of optimal paramaters for a task $\\tau$ and D is the euclidean distance function. We introduce the $\\frac{1}{2}$ to make the math easier later. \n",
+    "\n",
+    "In order to deal with this, we have to review some math. Let's define a non-pathological set $S \\subset R^d$ and $\\phi \\in \\mathbb{R}^d$. Given that we are working with an appropriatly well behaved subset of $\\mathbb{R}^d$, the gradient of the squared distance function $D(\\phi, S)^2$ can be well approximated by $2(\\phi - proj_S(\\phi))$ (TODO: Flush this fact out more intuitivly, something having to do with distance being the min over all points in the set of absolute distances). Recall that this projection is just the closest value (in the euclidean sense) to the vector in the set. Once we have this approximation, it becomes clear that we can rewrite the gradient of our objective function as\n",
+    "\n",
+    "$$\n",
+    "\\begin{aligned}\n",
+    "\\nabla_\\phi \\mathbb{E_\\tau} [\\frac{1}{2} D (\\phi, W_{\\tau})] &= \\mathbb{E_\\tau} [\\frac{1}{2} \\nabla_\\phi D (\\phi, W_{\\tau})] \\\\\n",
+    "&= \\mathbb{E_\\tau} [\\phi - proj_{W_{\\tau}}(\\phi)]\n",
+    "\\end{aligned}\n",
+    "$$\n",
+    "\n",
+    "So we can rewrite our gradient update as\n",
+    "$$\n",
+    "\\begin{aligned}\n",
+    "\\phi &\\leftarrow \\phi - \\alpha \\nabla_\\phi \\frac{1}{2} D (\\phi, W_{\\tau})^2 \\\\\n",
+    "&\\leftarrow \\phi - \\alpha (\\phi - proj_{W_{\\tau}}(\\phi)) \\\\\n",
+    "&\\leftarrow (1-\\alpha) \\phi - \\alpha proj_{W_{\\tau}}(\\phi)\n",
+    "\\end{aligned}\n",
+    "$$\n",
+    "\n",
+    "Even though we can't compute $proj_{W_{\\tau}}(\\phi)$ because it requires us to find the set of minimizers for the given task, we can approximate it with gradient decent. So, for each interation of reptile we see that we sample a task and replace $W_{\\tau}$ with the result of running k steps of gradient decent."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/images/convergence.png b/images/convergence.png