typora-copy-images-to |
---|
./img |
GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.
Spring 2018: Optimization for Machine Learning
[TOC]
- general optimization problem :
$minimize;f(x);with;x\in\R^d$ - objective function :
$f:\R^d\to\R$ , typically assume $f $ continuous and differentiable - mathematical modeling : defining and measuring machine learning model
- computational optimization : learning model parameters
- convex set : if line segment between any two points lies in
$C$ ,$\lambda x+(1-\lambda) y\in C$ for any$x, y\in C$ and any$0\le\lambda\le 1$ , border need to be in the set- intersection :
$C=\cap_{i\in I} C_i$ convex - projections onto
$C$ :$P_C(x')=\arg\min_{y\in C}\norm{y-x'}$ unique, efficient to compute
- intersection :
- convex function : if
$dom(f)$ convex set and$f(\lambda x+(1-\lambda)y)\le\lambda f(x)+(1-\lambda)f(y)$ for any$x, y\in dom(f)$ and any$0\le \lambda \le 1$ - geometrically : line above $f $ graph
- linear :
$f=\sum_{i=1}^m \lambda_i f_i$ convex on$dom(f)=\cap_{i=1}^m dom(f_i)$ with$\lambda_i\in\R_+$ - affine :
$g(x)=Ax+b$ and $f $ convex,$f\circ g$ convex on$dom(f\circ g)={x\in \R^m : g(x)\in dom(f)}$
- norm : convex
- positivity :
$f(x)\ge 0$ , without this called semi-norm - triangle inequality :
$f(x+y)\le f(x)+f(y)$ - homogeneity :
$f(ax)=\abs{a}f(x)$
- positivity :
- convex optimization problems :
$\min f(x);s.t.; x\in C$ with $f $ and$C$ convex- local minimum : global minimum
- function graph :
${(x,f(x))\mid x\in dom(f)}$ - epigraph :
$epi(f)={(x,\alpha)\in\R^{d+1}\mid x\in dom(f),\alpha\ge f(x)}$ , $f $ convex iff epigraph convex - jensen inequality : $f $ convex,
$\lambda_i\in\R_+$ s.t.$\sum\lambda_i = 1$ ,$f(\sum_{i=1}^m\lambda_i x_i)\le\sum_{i=1}^m\lambda_i f(x_i)$ - gradient :
$dom(f)$ open, $f $ differentiable,$\nabla f(x)=(\frac{\partial f(x)}{\partial x_1},\ldots,\frac{\partial f(x)}{\partial x_d})$ exists at every point- $f $ convex : iff
$f(y)\ge f(x)+\nabla f(x)^\top(y-x)$ for every point in$dom(f)$ convex
- $f $ convex : iff
- positive semidefinite : if
$x\top M x\ge 0;\forall x$ noted$M\succeq 0$ - hessian :
$dom(f)$ open, twice differentiable, $\nabla^2 f(x)=\begin{pmatrix}\frac{\partial^2 f(x)}{\partial x_1,\partial x_1} & \cdots & \frac{\partial^2 f(x)}{\partial x_1,\partial x_d} \ \vdots & \ddots & \vdots \ \frac{\partial^2 f(x)}{\partial x_d,\partial x_1} & \cdots & \frac{\partial^2 f(x)}{\partial x_d,\partial x_d} \end{pmatrix}$ exists at every point and symmetric- $f $ convex : iff
$\nabla^2 f(x)\succeq 0$ for every point in$dom(f)$ convex
- $f $ convex : iff
- local minimum :
$f:dom(f)\to\R$ , point$x$ s.t.$\exists\epsilon>0$ with$f(x)\le f(y)~~\forall y\in dom(f)$ satisfying$\norm{y-x}<\epsilon$ - global minimum :
$f:dom(f)\to\R$ convex with local minimum $x^$, $f(x^)\le f(y)~~\forall y\in dom(f)$- convex and differentiable :
$\nabla f(x)=0$ implies global minimum
- convex and differentiable :
- strictly convex : if
$dom(f)$ convex,$x\not=y\in dom(f)$ ,$0\le \lambda \le 1$ $f(\lambda x+(1-\lambda) y)< \lambda f(x)+(1-\lambda) f(y)$ - at most one global minimum
- sublevel sets :
$f^{\le\alpha}={x\in dom(f):f(x)\le\alpha }$ - Weierstrass theorem :
$dom(f)$ open, convex, nonempty and bounded sublevel implies $f $ has a global minimum
- gradient descent : convex, differentiable, global minimum $x^$, goal $f(x)-f(x^)\le\epsilon$
- iterative algorithm :
$x_{t+1}=x_t-\gamma\nabla f(x_t)$ with stepsize$\gamma\ge 0$ - trick :
$2v^\top w=\norm{v}^2+\norm{w}^2-\norm{v-w}^2$ , sum over steps - average error upper bound : $\sum_{t=0}^{T-1}(f(x_t)-f(x^))\le\frac{\gamma}{2}\sum_{t=0}^{T-1}\norm{\nabla f(x_t)}^2+\frac{1}{2\gamma}\norm{x_0-x^}^2$
- last iterate not necessarily best one
- stepsize crucial
-
$O(1/\epsilon^2)$ steps (notation pour$\epsilon$ bound) : convex and differentiable with global minimum$x^*$ - suppose :
$\norm{x_0-x^*}\le R$ ,$\norm{\nabla f(x)}\le L$ , stepsize$\gamma=\frac{R}{L\sqrt{T}}$ - yields :
$\frac{1}{T}\sum_{t=0}^{T-1}f(x_t)-f(x^*)\le\frac{RL}{\sqrt{T}}$ , dimension independent, holds for best and average iterate - bounded gradient
$\iff$ Lipschitz continuity of$f $
- suppose :
- iterative algorithm :
- matrix (spectral) norm :
$\norm{A}=\max_{\norm{x}=1}\norm{Ax}$ - smooth : convex and differentiable,
$L\in R_+$ ,$f(y)\le f(x)+\nabla f(x)^\top(y-x)+\frac{L}{2}\norm{x-y}^2~~\forall x, y\in \R^d$ - geometrically : graph of $f $ below not-too-steep tangential paraboloid
- linear :
$L_i$ smooth,$\lambda_i\in R_+$ , implies$f=\sum \lambda_i f_i$ smooth with$L=\sum \lambda_iL_i$ - affine :
$g(x)=Ax+b$ with $f $$L$ smooth implies$f\circ g$ smooth with$L\norm {A}^2$ -
$O(1/\epsilon)$ steps : convex and differentiable with global minimum$x^*$ - suppose :
$\gamma=\frac{1}{L}$ - yields : monotone decreasing
$f(x_{t+1})\le f(x_t)-\frac{1}{2L}\norm{\nabla f(x_t)}^2$ and $f(x_T)-f(x^)\le\frac{L}{2T}\norm{x_0-x^}^2$ - smoothness
$\iff$ Lipschitz continuity of$\nabla f$ - $f $ convex, differentiable, smooth with parameter
$L$ $\iff$ $\norm{\nabla f(x)-\nabla f(y)}\le L\norm {x-y}~~\forall x, y\in \R^d$
- suppose :
- convergence in one step :
$f(x)=x^2$ ($L-2$ smooth) and$\gamma=\frac{1}{2}$ ,$x_{t+1}=0$ - converge exponential :
$f(x)=x^2$ ($L-4$ smooth) and$\gamma=\frac{1}{4}$ ,$x_{t+1}=\frac{x^t}{2}$ - strong convexity : not too curved, not too flat,
$f$ convex, differentiable,$f(y)\ge f(x)+\nabla f(x)^\top(y-x)+\frac{\mu}{2}\norm{x-y}^2~~\forall x,y\in\R^d~~\mu>0$ , unique global minimum- step norm bound : $\norm{x_{t+1}-x^}^2\le 2\gamma(f(x^)-f(x_t))+\gamma^2\norm{\nabla f(x_t)}^2+(1-\mu\gamma)\norm{x_t-x^*}$
-
$O(\log(1/\epsilon))$ steps : convex, differentiable,$L$ smooth and$\mu$ strongly convex- suppose :
$\gamma=\frac{1}{L}$ - yields : squares distances decreasing $\norm{x_{t+1}-x^}^2\le (1-\frac{\mu}{L})\norm{x_t-x^}^2$ and $f(x_t)-f(x^)\le\frac{L}{2}(1-\frac{\mu}{L})^t\norm{x_0-x^}^2$
- suppose :
- separator matrix
$W$ : $(Wx){d(x)}=\max{j=0}^9 (Wx)_j~~\forall x\in P$- trivial separator :
$(Wx)_i=(Wx)_j~~\forall i,j$ - loss
$l$ with global minimum$\iff$ all separator trivial
- trivial separator :
-
constrained optimization problem : minimize
$f(x)$ s.t.$x\in X$ either transform it into unconstrained problem (by penalization) or project the gradient -
projected gradient descent : project onto convex set
$X\subseteq\R^d$ after every step,$y_{t+1}=x_t-\gamma\nabla f(x_t)$ ,$x_{t+1}=\Pi_X(y_t+1)=\arg\min_{x\in X}\norm{x-y_{t+1}}^2$ -
for
$x\in X$ and$y\in\R^d$ :$(x-\Pi_X(y))^\top(y-\Pi_X(y))\le 0$ and$\norm{x-\Pi_X(y)}^2+\norm{y-\Pi_X(y)}^2\le\norm{x-y}^2$ -
$O(1/\epsilon^2)$ steps : convex and differentiable with minimum$x^*$ over$X$ closed- suppose :
$\norm{x_0-x^*}\le R$ ,$\norm{\nabla f(x)}\le L$ , stepsize$\gamma=\frac{R}{L\sqrt{T}}$ - yields :
$\frac{1}{T}\sum_{t=0}^{T-1}f(x_t)-f(x^*)\le\frac{RL}{\sqrt{T}}$
- suppose :
-
$O(1/\epsilon)$ steps : convex and differentiable with minimum$x^*$ over$X$ ,$L$ -smooth over$X$ - suppose :
$\gamma=\frac{1}{L}$ - yields : monotone decreasing
$f(x_{t+1})\le f(x_t)-\frac{1}{2L}\norm{\nabla f(x_t)}^2+\frac{L}{2}\norm{y_{t+1}-x_{t+1}}^2$ and $f(x_T)-f(x^)\le\frac{L}{2T}\norm{x_0-x^}^2$
- suppose :
-
$O(\log(1/\epsilon))$ steps : convex, differentiable,$L$ -smooth over$X$ and$\mu$ strongly convex- suppose :
$\gamma=\frac{1}{L}$ - strengthen constrained vanilla bound :
$\frac { 1} { 2\gamma } \left( \gamma ^ { 2} | \nabla f \left( \mathbf { x } _ { t } \right) | ^ { 2} + | \mathbf { x } _ { t } - \mathbf { x } ^ { \star } | ^ { 2} - | \mathbf { x } ^ { + } - \mathbf { x } ^ { \star } | ^ { 2} - | \mathbf { y } ^ { + } - \mathbf { x } ^ { + } | ^ { 2} \right) - \frac { \mu } { 2} | \mathbf { x } _ { t } - \mathbf { x } ^ { \star } | ^ { 2}$ - yields :
$| \mathbf { x } _ { t + 1} - \mathbf { x } ^ { \star } | ^ { 2} \leq \left( 1- \frac { \mu } { L } \right) | \mathbf { x } _ { t } - \mathbf { x } ^ { \star } | ^ { 2}$ and$f \left( \mathbf { x } _ { t } \right) - f \left( \mathbf { x } ^ { \star } \right) \leq \frac { L } { 2} \left( 1- \frac { \mu } { L } \right) ^ { t } | \mathbf { x } _ { 0} - \mathbf { x } ^ { \star } | ^ { 2}$
- suppose :
-
projecting onto
$l_1$ -balls :$X = B _ { 1} ( R ) = \left{ \mathbf { x } \in \mathbb { R } ^ { d } : | \mathbf { x } | _ { 1} = \sum _ { i = 1} ^ { d } | x _ { i } | \leq R \right}$ - w.l.o.g. :
$R=1$ , positive face only$v_i\ge 0~~\forall i$ , outside simplex$\sum_{i=1}^d v_i>1$ - standard simplex :
$\Delta _ { d } : = \left{ \mathbf { X } \in \mathbb { R } ^ { d } : \sum _ { i = 1} ^ { d } x _ { i } = 1,x _ { i } \geq 0~~\forall i \right}$ with$\Pi _ { X } ( \mathbf { v } ) = \operatorname{arg} \min _ { x \in \Delta _ { d } } | \mathbf { x } - \mathbf { v } | ^ { 2}$ - optimality criterion for constrained :
$\nabla d _ { \mathbf { v } } \left( \mathbf { x } ^ { \star } \right) ^ { T } \left( \mathbf { x } - \mathbf { x } ^ { \star } \right) = 2\left( \mathbf { x } ^ { \star } - \mathbf { v } \right) ^ { T } \left( \mathbf { x } - \mathbf { x } ^ { \star } \right) \geq 0\quad \forall x \in \Delta d$ - unique index :
$p\in{1,\ldots, d}$ with$\bf v$ ordered decreasingly s.t.$x_i^\star >0$ for$i\le p$ and$x_i^\star=0$ for$i >p$ - nonzero :
$x _ { i } ^ { \star } = v _ { i } - \Theta _ { p } \quad i \leq p$ with$\Theta _ { p } = \frac { 1} { p } \left( \sum _ { i = 1} ^ { p } v _ { i } - 1\right)$ - summary :
$d$ candidate for$\bf x^\star$ namely$\mathbf { x } ^ { \star } ( p ) : = \left( v _ { 1} - \Theta _ { p ,\cdots } ,v _ { p } - \Theta _ { p } ,0,\dots ,0\right) \quad p \in { 1,\ldots ,d }$ , pick$p$ s.t.$\mathbf{x}^\star(p)$ minimize distance to$\bf v$ in time$O(d\log d)$ (sorting, checking incrementally)
- optimality criterion for constrained :
- w.l.o.g. :
-
-
proximal gradient descent : objective
$f ( x ) = g ( x ) + h ( x )$ with$g$ nice function and$h$ simple additional term but not nice or not differentiable- update :
$\mathbf { x }_{t + 1}= \operatorname{arg} \min _ { \mathbf { y } } \frac { 1} { 2\gamma } | \mathbf { y } - \left( \mathbf { x } _ { t } - \gamma \nabla g \left( \mathbf { x } _ { t } \right) \right) | ^ { 2} + h ( \mathbf { y } )=\operatorname{prox} _ { h ,\gamma } \left( \mathbf { x } _ { t } - \gamma \nabla g \left( \mathbf { x } _ { t } \right) \right)$ with$\operatorname{prox} _ { h ,\gamma } ( \mathbf { z } ) : = \operatorname{arg} \min _ { \mathbf { y } } \left{ \frac{1}{2\gamma}| \mathbf { y } - \mathbf { z } | ^ { 2} + h ( \mathbf { y } ) \right}$ ,$\mathbf { x } _ { t + 1} = \mathbf { x } _ { t } - \gamma G _ { \gamma } \left( \mathbf { x } _ { t } \right)$ - generalized gradient :
$G _ { h ,\gamma } ( \mathbf { x } ) : = \frac { 1} { \gamma } \left( \mathbf { x } - \text{prox} _ { h ,\gamma } ( \mathbf { x } - \gamma \nabla g ( \mathbf { x } ) ) \right)$ -
$h=0$ : gradient descent -
$h=\iota_X$ : projected gradient descent with $\mathbf { x } \mapsto \iota _ { X } ( \mathbf { x } ) : = \left{ \begin{array} { l l } { 0} & { \text{ if } x \in X } \ { + \infty } & { \text{ otherwise } } \end{array} \right.$
-
-
$O(1/\epsilon)$ steps : same as vanilla case but now for any$h$
- update :
- subgradient :
$\mathbf { g } \in \mathbb { R } ^ { d }$ if$f ( \mathbf { y } ) \geq f ( \mathbf { x } ) + \mathbf { g } ^ { T } ( \mathbf { y } - \mathbf { x } );\forall \mathbf { y } \in \operatorname { dom } ( f )$ - set of all possible :
$\partial f ( \mathbf { x } ) \subseteq \mathbb { R } ^ { d }$ - convexity :
$f : \operatorname { dom } ( f ) \rightarrow \mathrm { R }$ convex iff$\operatorname { dom } ( f )$ convex and$\partial f ( \mathbf { x } ) \neq \emptyset;\forall \mathbf { x } \in \mathbf { d o m } ( f )$ - lipschitz continuity :
$f : \mathbb { R } ^ { d } \rightarrow \mathbb { R }$ convex,$B$ positive,$| \mathbf { g } | \leq B;\forall \mathbf { x } \in \mathbb { R } ^ { d };\forall \mathbf { g } \in \partial f ( \mathbf { x } )$ equivalent to$| f ( \mathbf { x } ) - f ( \mathbf { y } ) | \leq B | \mathbf { x } - \mathbf { y } |;\forall \mathbf { x } , \mathbf { y } \in \mathbb { R } ^ { d }$ - update :
$\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - \gamma \mathbf { g } _ { t }$ for$\mathbf { g } _ { t } \in \partial f \left( \mathbf { x } _ { t } \right)$ -
$O(1/\epsilon^2)$ steps : convex and relaxing differentiable- suppose :
$\norm{x_0-x^*}\le R$ , stepsize$\gamma=\frac{R}{B\sqrt{T}}$ - yields :
$\frac{1}{T}\sum_{t=0}^{T-1}f(x_t)-f(x^*)\le\frac{RB}{\sqrt{T}}$
- suppose :
- set of all possible :
- Nesterov theorem : any
$T \leq d - 1$ and starting point$\mathbf { x } _ { 0 }$ , a$f$ exists in$B$ -Lipschitz over$\text { R } ^ { d }$ s.t. any (sub)gradient objective error at least$f \left( \mathbf { x } _ { T } \right) - f \left( \mathbf { x } ^ { \star } \right) \geq \frac { R B } { 2 ( 1 + \sqrt { T + 1 } ) }$
- SGD : sum of structured objective
$f ( \mathbf { x } ) = \frac { 1 } { n } \sum _ { i = 1 } ^ { n } f _ { i } ( \mathbf { x } )$ ,$n$ times cheaper- update :
$\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - \gamma _ { t } \nabla f _ { i } \left( \mathbf { x } _ { t } \right)$ for$i \in [ n ]$ uniformly at random - stochastic gradient :
$\mathbf { g } _ { t } = \nabla f _ { i } \left( \mathbf { x } _ { t } \right)$ - unbiased :
$\mathrm { E } \left[ \mathbf { g } _ { t } | \mathbf { x } _ { t } \right] = \nabla f \left( \mathbf { x } _ { t } \right)$ - comparison : jensen ineq
- GD :
$| \frac { 1 } { n } \sum _ { i } \nabla f _ { i } ( \mathbf { x } ) | ^ { 2 } \leq B _ { \mathrm { GD } } ^ { 2 } \quad \forall _ { \mathbf { X } }$ - SGD :
$\frac { 1 } { n } \sum _ { i } | \nabla f _ { i } ( \mathbf { x } ) | ^ { 2 } \leq B _ { S G D } ^ { 2 } \quad \forall _ { \mathbf { X } }$
- GD :
- tower rule :
$E ( E ( x |y ] ) = E [ x ]$ -
$O(1/\epsilon^2)$ steps : convex and differentiable- suppose :
$\norm{x_0-x^*}\le R$ ,$\mathbb { E } \left[ | \mathbf { g } _ { t } | ^ { 2 } \right] \leq B ^ { 2 }$ , stepsize$\gamma=\frac{R}{B\sqrt{T}}$ - yields :
$\frac { 1 } { T } \sum _ { t = 0 } ^ { T - 1 } \mathrm { E } \left[ f \left( \mathbf { x } _ { t } \right) \right] - f \left( \mathbf { x } ^ { \star } \right) \leq \frac { R B } { \sqrt { T } }$
- suppose :
- extends to constrained optimization
-
$O(1/\epsilon)$ steps : strongly convex$\mu$ and differentiable- time-varing stepsize :
$\gamma _ { t } = \frac { 2 } { \mu ( t + 1 ) }$ decreasing over time - yields :
$E \left[ f \left( \frac { 2 } { T ( T + 1 ) } \sum _ { t = 1 } ^ { T } t \cdot \mathbf { x } _ { t } \right) - f \left( \mathbf { x } ^ { \star } \right) \right] \leq \frac { 2 B ^ { 2 } } { \mu ( T + 1 ) }$
- time-varing stepsize :
- update :
- stochastic subgradient descent
- update :
$\mathbf { x } _ { t + 1 } : = \mathbf { x } _ { t } - \gamma _ { t } \mathbf { g } _ { t }$ for$\mathbf { g } _ { t } \in \partial f _ { i } \left( \mathbf { x } _ { t } \right)$ and$i \in [ n ]$ uniformly at random - unbiased :
$\mathrm { E } \left[ \mathbf { g } _ { t } | \mathbf { x } _ { t } \right] \in \partial f \left( \mathbf { x } _ { t } \right)$ -
$O(1/\epsilon^2)$ steps : same by using subgradient property
- update :
- mini-batch :
$\tilde { \mathbf { g } } _ { t } = \frac { 1 } { m } \sum _ { j = 1 } ^ { m } \mathbf { g } _ { t } ^ { j }$ -
$m=1$ : SGD -
$m=n$ : GD - variance intuition :
$\mathbb { E } \left[ | \mathbf { \tilde { g } } _ { t } - \nabla f \left( \mathbf { x } _ { t } \right) | ^ { 2 } \right] \leq \frac { B ^ { 2 } } { m }$
-
-
Newton-Raphson method
-
goal : find a zero differientable
$f : \mathbb { R } \rightarrow \mathbb { R }$ -
update :
$x _ { t + 1 } = x _ { t } - \frac { f \left( x _ { t } \right) } { f ^ { \prime } \left( x _ { t } \right) }$ -
$O(\log\log(1/\epsilon))$ steps : local, convex$f$ , open ball$X \subseteq \operatorname { dom } ( f )$ with center$\mathbf { x } ^ { \star }$ - bounded inverse Hessians
$| \nabla ^ { 2 } f ( \mathbf { x } ) ^ { - 1 } | \leq \frac { 1 } { \mu } ; \forall \mathbf { x } \in X$ - Lipschitz continuous Hessians :
$| \nabla ^ { 2 } f ( \mathbf { x } ) - \nabla ^ { 2 } f ( \mathbf { y } ) | \leq L | \mathbf { x } - \mathbf { y } | ; \forall \mathbf { x } , \mathbf { y } \in X$ - give :
$| \mathbf { x } _ { t + 1 } - \mathbf { x } ^ { \star } | \leq \frac { L } { 2 \mu } | \mathbf { x } _ { t } - \mathbf { x } ^ { \star } | ^ { 2 }$ - yields :
$| \mathbf { x } _ { T } - \mathbf { x } ^ { \star } | < \frac { 2 \mu } { L } \left( \frac { 1 } { 2 } \right) ^ { 2 ^ { T } }$ for$| \mathbf { x } _ { 0 } - \mathbf { x } ^ { \star } | < \frac { \mu } { L }$ - lemma :
$\int _ { 0 } ^ { 1 } \nabla ^ { 2 } f ( \mathbf { x } + t ( \mathbf { y } - \mathbf { x } ) ) ( \mathbf { y } - \mathbf { x } ) d t = \nabla f ( \mathbf { y } ) - \nabla f ( \mathbf { x } )$ - strong convexity : enforce bounded inverse Hessians, if
$f ( \mathbf { y } ) \geq f ( \mathbf { x } ) + \nabla f ( \mathbf { x } ) ^ { T } ( \mathbf { y } - \mathbf { x } ) + \frac { \mu } { 2 } | \mathbf { x } - \mathbf { y } | ^ { 2 } ; \forall \mathbf { x } , \mathbf { y } \in X$ then$| \nabla ^ { 2 } f ( \mathbf { x } ) ^ { - 1 } | \leq 1 / \mu;\forall \mathbf { x } \in X$
- bounded inverse Hessians
-
equivalently : search for a zero of derivative,
$x _ { t + 1 } = x _ { t } - \frac { f ^ { \prime } \left( x _ { t } \right) } { f ^ { \prime \prime } \left( x _ { t } \right) }$ in$1$ -dim or$\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - \nabla ^ { 2 } f \left( \mathbf { x } _ { t } \right) ^ { - 1 } \nabla f \left( \mathbf { x } _ { t } \right)$ in$d$ -dim -
nondegenerate quadratic :
$f ( \mathbf { x } ) = \frac { 1 } { 2 } \mathbf { x } ^ { T } M \mathbf { x } - \mathbf { q } ^ { T } \mathbf { x } + c$ with$M$ symmetric invertible,$\mathbf { q } \in \mathbb { R } ^ { d }$ ,$c \in R$ , solved in one step$\mathbf { x } _ { 1 } = \mathbf { x } ^ { \star }$ as unique solution$\mathbf { x } ^ { \star } = M ^ { - 1 } \mathbf { q }$ of$\nabla f(x)=0$ -
affine invariance :
$A \in \mathbb { R } ^ { d \times d }$ invertible,$\mathbf { b } \in \mathbb { R } ^ { d }$ ,$g ( \mathbf { y } ) = A \mathbf { y } + \mathbf { b }$ with newton step$N _ { h } ( \mathbf { x } ) = \mathbf { x } - \nabla ^ { 2 } h ( \mathbf { x } ) ^ { - 1 } \nabla h ( \mathbf { x } )$ give$N _ { f \circ g } = g ^ { - 1 } \circ N _ { f } \circ g$ , does not suffer from coordinate scale difference -
alternative interpretation : minimize local second-order Taylor approximation,
$\nabla ^ { 2 } f \left( \mathbf { x } _ { t } \right) \succ 0$ invertible,$\mathbf { x } _ { t + 1 } = \arg \min _ { \mathbf { x } \in \mathbb { R } ^ { d } } f \left( \mathbf { x } _ { t } \right) + \nabla f \left( \mathbf { x } _ { t } \right) ^ { T } \left( \mathbf { x } - \mathbf { x } _ { t } \right)+ \frac { 1 } { 2 } \left( \mathbf { x } - \mathbf { x } _ { t } \right) ^ { T } \nabla ^ { 2 } f \left( \mathbf { x } _ { t } \right) \left( \mathbf { x } - \mathbf { x } _ { t } \right)$ -
invertion of hessian : computational bottleneck
-
- secant method : derivative-free
- difference approximation :
$x _ { t + 1 } = x _ { t } - f \left( x _ { t } \right) \frac { x _ { t } - x _ { t - 1 } } { f \left( x _ { t } \right) - f \left( x _ { t - 1 } \right) }$ for$| x _ { t } - x _ { t - 1 } |$ small - second derivative :
$x _ { t + 1 } = x _ { t } - f ^ { \prime } \left( x _ { t } \right) \frac { x _ { t } - x _ { t - 1 } } { f ^ { \prime } \left( x _ { t } \right) - f ^ { \prime } \left( x _ { t - 1 } \right) }$ - secant condition :
$H _ { t } = \frac { f ^ { \prime } \left( x _ { t } \right) - f ^ { \prime } \left( x _ { t - 1 } \right) } { x _ { t } - x _ { t - 1 } } \approx f ^ { \prime \prime } \left( x _ { t } \right)\iff f ^ { \prime } \left( x _ { t } \right) - f ^ { \prime } \left( x _ { t - 1 } \right) = H _ { t } \left( x _ { t } - x _ { t - 1 } \right)$ - update :
$\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - H _ { t } ^ { - 1 } \nabla f \left( \mathbf { x } _ { t } \right)$
- difference approximation :
- quasi-newton : join secant condition with first-order Taylor,
$\nabla f \left( \mathbf { x } _ { t } \right) - \nabla f \left( \mathbf { x } _ { t - 1 } \right) = H _ { t } \left( \mathbf { x } _ { t } - \mathbf { x } _ { t - 1 } \right) \approx \nabla ^ { 2 } f \left( \mathbf { x } _ { t } \right) \left( \mathbf { x } _ { t } - \mathbf { x } _ { t - 1 } \right)$ - find good matrice
$H _ { t } ^ { - 1 }$ : L-BFGS - Newton's method is quasi-newton iff
$f$ nondegenerate quadratic : no generalize but family of related algorithm
- find good matrice
- Frank-Wolfe
-
update :
$\mathbf { s } = \operatorname { LMO } \left( \nabla f \left( \mathbf { x } _ { t } \right) \right)$ ,$\mathbf { x } _ { t + 1 } = ( 1 - \gamma ) \mathbf { x } _ { t } + \gamma \mathbf { s }$ -
stepsize :
$\gamma = \frac { 2 } { t + 2 }$ - line-search variant :
$\gamma _ { t } = \arg \min _ { \gamma \in [ 0,1 ] } f \left( ( 1 - \gamma ) \mathbf { x } _ { t } + \gamma \mathbf { s } \right)$ - gap-based variant :
$\gamma _ { t } = \min \left{ \frac { g \left( \mathbf { x } _ { t } \right) } { L | \mathbf { s } - \mathbf { x } _ { t } | ^ { 2 } } , 1 \right}$
- line-search variant :
-
linear minimization oracle :
$\mathrm { LMO(g)=arg\operatorname{min } } _{ \mathbf{s} \in X } \langle \mathbf { s } , \mathbf { g } \rangle$ -
properties : always feasible, reduce non-linear to linear, projection-free, sparse iterates
-
duality gap :
$g ( \mathbf { x } ) = \langle \mathbf { x } - \mathbf { s } , \nabla f ( \mathbf { x } ) \rangle$ with certificat$g ( \mathbf { x } ) = \max _ { \mathbf { s } \in X } \langle \mathbf { x } - \mathbf { s } , \nabla f ( \mathbf { x } ) \rangle \geq \left\langle \mathbf { x } - \mathbf { x } ^ { \star } , \nabla f ( \mathbf { x } ) \right\rangle \geq f ( \mathbf { x } ) - f \left( \mathbf { x } ^ { \star } \right)$ - convergence : convex and smooth
$L$ , choosing any stepsize yields$g \left( \mathbf { x } _ { t } \right) \leq \frac { 27 / 4 C _ { f } } { T + 1 }$
- convergence : convex and smooth
-
$O(1/\epsilon)$ steps : convex and smooth$L$ - stepsize : any of above
- time-varing stepsize :
$f \left( \mathbf { x } _ { T } \right) - f \left( \mathbf { x } ^ { \star } \right) \leq \frac { 2 L \operatorname { diam } ( X ) ^ { 2 } } { T + 1 }$ with$\operatorname { diam } ( X ) = \max _ { \mathbf { x } , \mathbf { y } \in X } | \mathbf { x } - \mathbf { y } |$
-
affine invariant : same as Newton but for constrained problems
- curvature constant :
$C _ { f } = \sup _ { \mathbf { x } , \mathbf { s } \in X , \gamma \in [ 0,1 ], \mathbf { y } = \mathbf { x } + \gamma ( \mathbf { s - x } ) } \frac { 2 } { \gamma ^ { 2 } } ( f ( \mathbf { y } ) - f ( \mathbf { x } ) - \langle \mathbf { y } - \mathbf { x } , \nabla f ( \mathbf { x } ) \rangle )$ give$C _ { f } \leq L \operatorname { diam } ( X ) ^ { 2 }$ for any norm
- curvature constant :
-
extensions
- approximate LMO : additive of multiplicative accuracy
- randomized LMO
- unconstrained : matching pursuit variants
-
use cases : when projection more costly than solving linear problem (L1, matrix completion, relaxation of combinatorial problems)
-
- coordinate descent : one coordinate at a time
- update :
$\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } + \gamma \mathbf { e } _ { i _ { t } }$ for$i _ { t } \in [ d ]$ - gradient-based stepsize :
$\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - \frac { 1 } { L } \nabla _ { i t } f \left( \mathbf { x } _ { t } \right) \mathbf { e } _ { i t }$ for$L$ -smooth - exact coordinate minimization :
$\arg \min _ { \gamma \in \mathbb { R } } f \left( \mathbf { x } _ { t } + \gamma \mathbf { e } _ { i _ { t } } \right)$ solve closed form
- gradient-based stepsize :
- randomized update :
$\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - \frac { 1 } { L } \nabla _ { i _ { t } } f \left( \mathbf { x } _ { t } \right) \mathbf { e } _ { i _ { t } }$ for$i _ { t } \in [ d ]$ uniformly at random, fast than GD if step cheaper than full gradient - coordinate-wise smoothness :
$f \left( \mathbf { x } + \gamma \mathbf { e } _ { i } \right) \leq f ( \mathbf { x } ) + \gamma \nabla _ { i } f ( \mathbf { x } ) + \frac { L } { 2 } \gamma ^ { 2 } ; \forall \mathbf { x } \in \mathbb { R } ^ { d } , \forall \gamma \in \mathbb { R } , \forall i$ - equivalent to coordinate-wise Lipschitz gradient :
$| \nabla _ { i } f \left( \mathbf { x } + \gamma \mathbf { e } _ { i } \right) - \nabla _ { i } f ( \mathbf { x } ) | \leq L | \gamma | ; \forall \mathbf { x } \in \mathbb { R } ^ { d } , \forall \gamma \in \mathbb { R } , \forall i$
- equivalent to coordinate-wise Lipschitz gradient :
-
$O(\log(1/\epsilon))$ steps : coordinate-wise smooth$L$ , strongly convex$\mu$ - assume :
$\gamma=1/L$ ,$\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - \frac { 1 } { L } \nabla _ { i _ { t } } f \left( \mathbf { x } _ { t } \right) \mathbf { e } _ { i _ { t } }$ uniformly at random - yields :
$\mathrm { E } \left[ f \left( \mathbf { x } _ { t } \right) - f ^ { \star } \right] \leq \left( 1 - \frac { \mu } { d L } \right) ^ { t } \left[ f \left( \mathbf { x } _ { 0 } \right) - f ^ { \star } \right]$ - lemma : strongly convex satisfy
$\frac { 1 } { 2 } | \nabla f ( \mathbf { x } ) | ^ { 2 } \geq \mu \left( f ( \mathbf { x } ) - f ^ { \star } \right);\forall\mathbf{x}$
- assume :
- Polyak-Lojasiewicz inquality PL :
$\frac { 1 } { 2 } | \nabla f ( \mathbf { x } ) | ^ { 2 } \geq \mu \left( f ( \mathbf { x } ) - f ^ { \star } \right) ; \forall \mathbf { x }$ - sampling : uniformly not always best
- individual smoothness constants
$L_i$ for coordinate$i$ :$f \left( \mathbf { x } + \gamma \mathbf { e } _ { i } \right) \leq f ( \mathbf { x } ) + \gamma \nabla _ { i } f ( \mathbf { x } ) + \frac { L _ { i } } { 2 } \gamma ^ { 2 }$ - selection probability :
$P \left[ i _ { t } = i \right] = \frac { L _ { i } } { \sum _ { i } L _ { i } }$ - stepsize :
$\gamma=1 / L _ { i_t }$ - converge :
$\mathrm { E } \left[ f \left( \mathbf { x } _ { t } \right) - f ^ { \star } \right] \leq \left( 1 - \frac { \mu } { d \overline { L } } \right) ^ { t } \left[ f \left( \mathbf { x } _ { 0 } \right) - f ^ { \star } \right]$ with$\overline { L } = \frac { 1 } { d } \sum _ { i = 1 } ^ { d } L _ { i }$ and often$\overline { L } \ll L = \max _ { i } L _ { i }$
- individual smoothness constants
- steepest coordinate descent :
$i _ { t } = \arg \max _ { i \in [ d ] } | \nabla _ { i } f \left( \mathbf { x } _ { t } \right) |$ , usefull for hardware limited memory (GPU)- same convergence as random :
$\max _ { i } | \nabla _ { i } f ( \mathbf { x } ) | ^ { 2 } \geq \frac { 1 } { d } \sum _ { i } | \nabla _ { i } f ( \mathbf { x } ) | ^ { 2 }$ , max vs avg - stepsize :
$\gamma=1/L$ - yields :
$\mathrm { E } \left[ f \left( \mathbf { x } _ { t } \right) - f ^ { \star } \right] \leq \left( 1 - \frac { \mu } { d L } \right) ^ { t } \left[ f \left( \mathbf { x } _ { 0 } \right) - f ^ { \star } \right] $ - can be faster : strong convexity measured with respect to
$\ell_1$ -norm instead of euclidean,$f ( \mathbf { y } ) \geq f ( \mathbf { x } ) + \langle \nabla f ( \mathbf { x } ) , \mathbf { y } - \mathbf { x } \rangle + \frac { \mu _ { 1 } } { 2 } | \mathbf { y } - \mathbf { x } | _ { 1 } ^ { 2 }$ - can see :
$\frac { \mu } { d } \leq \mu_1 \leq \mu$ , equivalence of norms - stepsize :
$\gamma=1/L$ - yields :
$\mathbb { E } \left[ f \left( \mathbf { x } _ { t } \right) - f ^ { \star } \right] \leq \left( 1 - \frac { \mu _ { 1 } } { L } \right) ^ { t } \left[ f \left( \mathbf { x } _ { 0 } \right) - f ^ { \star } \right]$ , dimension-free - lemma : strongly convex w.r.t
$\ell_1$ -norm$\mu_1$ satisfies$\frac { 1 } { 2 } | \nabla f ( \mathbf { x } ) | _ { \infty } ^ { 2 } \geq \mu _ { 1 } \left( f ( \mathbf { x } ) - f ^ { \star } \right)$
- can see :
- same convergence as random :
- non-smooth objective : CD fails, permanently stuck
- alternative : non-smooth part separable over coordinate,
$f ( \mathbf { x } ) = g ( \mathbf { x } ) + h ( \mathbf { x } )$ with$h ( \mathbf { x } ) = \sum _ { i } h _ { i } \left( x _ { i } \right)$ - state-of-the-art for GLM :
$f ( \mathbf { x } ) = g ( A \mathbf { x } ) + \sum _ { i } h _ { i } \left( x _ { i } \right)$ with regularization
- alternative : non-smooth part separable over coordinate,
- update :
- momentum :
$\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - \gamma \nabla f \left( \mathbf { x } _ { t } \right) + \nu \left[ \mathbf { x } _ { t } - \mathbf { x } _ { t - 1 } \right]$ - accelerated gradient method AGD
- update :
$\mathbf { x } _ { t + 1 } = \tau \mathbf { z } _ { t } + ( 1 - \tau ) \mathbf { y } _ { t }$ ,$\mathbf { y } _ { t + 1 } = \mathbf { x } _ { t + 1 } - \frac { 1 } { L } \nabla f \left( \mathbf { x } _ { t + 1 } \right)$ ,$\mathbf { z } _ { t + 1 } = \mathbf { z } _ { t } - \gamma \nabla f \left( \mathbf { x } _ { t + 1 } \right)$ with$\mathbf { x } _ { 0 } = \mathbf { y } _ { 0 } = \mathbf { z } _ { 0 }$ -
$O(1/\sqrt{\epsilon})$ : convex, differentiable, smooth$L$ - assume :
$| \mathbf { x } _ { 0 } - \mathbf { x } ^ { \star } | \leq R$ ,$| f \left( \mathbf { x } _ { 0 } \right) - f \left( \mathbf { x } ^ { \star } \right) | \leq d$ - stepsize :
$\gamma = \frac { R } { \sqrt { d L } }$ with$\tau$ s.t.$\frac { 1 - \tau } { \tau } = \gamma L$ - satisfies :
$f \left( \frac { 1 } { T } \sum _ { t = 0 } ^ { T - 1 } \mathbf { x } _ { t } \right) - f \left( \mathbf { x } ^ { \star } \right) \leq \frac { d } { 2 }$
- assume :
- strongly convex :
$O(\sqrt{\frac{L}{\mu}})$ update s.t.$| \mathbf { x } - \mathbf { x } ^ { \star } | ^ { 2 } \leq \frac { 1 } { 2 } | \mathbf { x } _ { 0 } - \mathbf { x } ^ { \star } | ^ { 2 }$ -
$O(\log(1/\epsilon))$ convergence in iterate : by repeatedly starting AGD, strongly convex$\mu$ ,$L$ -smooth where constant$\sqrt { \frac { L } { \mu } }$ instead of$\frac{L}{\mu}$ of GD
- update :
- conjugate :
$f ^ { * } ( \mathbf { y } ) : = \max _ { \mathbf { x } } \mathbf { x } ^ { T } \mathbf { y } - f ( \mathbf { x } )$ a.k.a. Legendre transform or Fenchel conjugate- always convex : even if
$f$ not as point-wise maximum of convex (affine) functions in$\mathbf{y}$ - Fenchel's inequality :
$f ( \mathbf { x } ) + f ^ { * } ( \mathbf { y } ) \geq \mathbf { x } ^ { T } \mathbf { y }$ for any$\mathbf { x } , \mathbf { y }$ - conjugate of conjugate :
$f ^ { * * } \leq f$ with equality only if$f$ closed and convex ($\mathbf { y } \in \partial f ( \mathbf { x } ) \Leftrightarrow \mathbf { x } \in \partial f ^ { * } ( \mathbf { y } )\Leftrightarrow f ( \mathbf { x } ) + f ^ { * } ( \mathbf { y } ) = \mathbf { x } ^ { T } \mathbf { y }$ ) - separable functions :
$f ( \mathbf { u } , \mathbf { v } ) = f _ { 1 } ( \mathbf { u } ) + f _ { 2 } ( \mathbf { v } )$ give$f ^ { * } ( \mathbf { w } , \mathbf { z } ) = f _ { 1 } ^ { * } ( \mathbf { w } ) + f _ { 2 } ^ { * } ( \mathbf { z } )$ - indicator function : $\iota _ { C } ( \mathbf { x } ) : = \left{ \begin{array} { l l } { 0 } & { \mathbf { x } \in C } \ { + \infty } & { \text { otherwise } } \end{array} \right.$ with conjugate (support function)
$f ^ { * } ( \mathbf { y } ) = \max _ { \mathbf { x } \in C } \mathbf { y } ^ { T } \mathbf { x }$ - norm :
$f ( \mathbf { x } ) = | \mathbf { x } |$ give$f ^ { * } ( \mathbf { y } ) = \iota _ { \left{ \mathbf { z } : | \mathbf { z } | _ { * } \leq 1 \right} } ( \mathbf { y } )$ - dual :
$| \cdot | _ { 1 } \leftrightarrow | \cdot | _ { \infty }$
- dual :
- GLM :
$\min _ { \mathbf { x } \in \mathbb { R } ^ { d } , \mathbf { w } \in \mathbb { R } ^ { n } } f ( \mathbf { w } ) + g ( \mathbf { x } )$ s.t.$\mathbf { w } = A \mathbf { x }$ - Lagrange dual :
$\mathcal { L } ( \mathbf { u } ) = \min _ { \mathbf { x } \in \mathbb { R } ^ { d } , \mathbf { w } \in \mathbb { R } ^ { n } } f ( \mathbf { w } ) + g ( \mathbf { x } ) + \mathbf { u } ^ { T } ( \mathbf { w } - A \mathbf { x } )= - f ^ { * } ( \mathbf { u } ) - g ^ { * } \left( - A ^ { T } \mathbf { u } \right)$ - dual problem :
$\max _ { \mathbf { u } \in \mathbb { R } ^ { n } } \left[ \mathcal { L } ( \mathbf { u } ) = - f ^ { * } ( \mathbf { u } ) - g ^ { * } \left( - A ^ { T } \mathbf { u } \right) \right]$
- Lagrange dual :
- Lasso :
$\min _ { \mathbf { x } \in \mathbb { R } ^ { d } } \frac { 1 } { 2 } | A \mathbf { x } - \mathbf { b } | ^ { 2 } + \lambda | \mathbf { x } | _ { 1 }$ - dual :
$\max _ { \mathbf { u } \in \mathbb { R } ^ { n } } - f ^ { * } ( \mathbf { u } ) - g ^ { * } \left( - A ^ { T } \mathbf { u } \right) \Leftrightarrow \max _ { \mathbf { u } \in \mathbb { R } ^ { n } } - \frac { 1 } { 2 } | \mathbf { b } | ^ { 2 } + \frac { 1 } { 2 } | \mathbf { b } - \mathbf { u } | ^ { 2 }$ s.t.$| - A ^ { T } \mathbf { u } / \lambda | _ { \infty } \leq 1\Leftrightarrow \min _ { \mathbf { u } \in \mathbb { R } ^ { n } } | \mathbf { b } - \mathbf { u } | ^ { 2 }$ s.t.$| A ^ { T } \mathbf { u } | _ { \infty } \leq \lambda$
- dual :
- why : certificate of current optimization quality
$f ( A \overline { \mathbf { x } } ) + g ( \overline { \mathbf { x } } )\geq \min _ { \mathbf { x } \in \mathbb { R } ^ { d } } f ( A \mathbf { x } ) + g ( \mathbf { x } )\geq \max _ { \mathbf { u } \in \mathbb { R } ^ { n } } - f ^ { * } ( \mathbf { u } ) - g ^ { * } \left( - A ^ { T } \mathbf { u } \right) \geq - f ^ { * } ( \overline { \mathbf { u } } ) - g ^ { * } \left( - A ^ { T } \overline { \mathbf { u } } \right)$ , stopping criterion, dual sometimes easier
- always convex : even if
- zero-order optimization : gradient-free, blackbox
- random search : no communication needed, competitive method for reinforcement learning, hyperparameter optimization
- update :
$\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } + \gamma \mathbf { d } _ { t }$ picking random direction$\mathbf { d } _ { t } \in \mathbb { R } ^ { d }$ and do line-search$\gamma = \arg \min _ { \gamma \in \mathbb { R } } \mathrm { f } \left( \mathbf { x } _ { t } + \gamma \mathbf { d } _ { t } \right)$ - converge : same as GD up to slow-down factor
$d$ - convex : rate
$\mathcal { O } ( d L / \varepsilon )$ - strongly convex :
$\mathcal { O } ( d L \log ( 1 / \varepsilon ) )$
- update :
- reinforcement learning :
$\mathbf { s } _ { t + 1 } = f \left( \mathbf { s } _ { t } , \mathbf { a } _ { t } , \mathbf { e } _ { t } \right)$ - state :
$\mathbf { s } _ { t }$ - action :
$\mathbf a _ { t }$ - noise :
$\mathbf e_t$ - control policy :
$\mathbf { a } _ { t } = \pi \left( \mathbf { a } _ { 1 } , \dots , \mathbf { a } _ { t - 1 } , \mathbf { s } _ { 0 } , \ldots , \mathbf { s } _ { t } \right)$ - maximize reward :
$\max _ { \mathbf { a } _ { t } } \mathrm { E } _ { \mathbf { e } _ { t } } \left[ \sum _ { t = 0 } ^ { N } R _ { t } \left( \mathbf { s } _ { t } , \mathbf { a } _ { t } \right) \right]$ s.t.$\mathbf { s } _ { t + 1 } = f \left( \mathbf { s } _ { t } , \mathbf { a } _ { t } , \mathbf { e } _ { t } \right)$ and$\mathbf s_0$ given
- state :
- adagrad : strong performance in practice
- update :
$\left[ \mathbf { x } _ { t + 1 } \right] _ { i } = \left[ \mathbf { x } _ { t } \right] _ { i } - \frac { \gamma } { \sqrt { \left[ G _ { t } \right] _ { i } } } \left[ \mathbf { g } _ { t } \right] _ { i }$ for each feature$i$ as$\left[ G _ { t } \right] _ { i } : = \sum _ { s = 0 } ^ { t } \left( \left[ \mathbf { g } _ { s } \right] _i \right) ^ { 2 }$ picking stochastic gradient$\mathbf g_t$ - variants : adadelta, adam, RMSprop
- update :
- trends
- privacy in ML
- decentralized training
- ML for trust : intrusion
- trust in ML : adversarial attacks
- training :
$\min _ { \mathbf { w } } \left( f _ { \mathbf { w } } \left( \mathbf { x } _ { i } \right) - y _ { i } \right) ^ { 2 }$ change model$\nabla _ { w } f$ - attacking :
$\min _ { \mathbf { x } _ { i } } \left( f _ { \mathbf { w } } \left( \mathbf { x } _ { i } \right) - y _ { i } \right) ^ { 2 }$ change data$\nabla _ { \mathbf { x } _ { i } } f$
- training :
- tricks
- feature hashing
- limited precision operations
- parallel
- CoCoA : communication efficient distributed optimization
- primal-dual
- autoML
- hyper-parameter optimization
- learning to learn
- neural architecture search