Skip to content

Latest commit

 

History

History
377 lines (277 loc) · 36.1 KB

cs439.md

File metadata and controls

377 lines (277 loc) · 36.1 KB
typora-copy-images-to
./img

$$ \def\R{\mathbb R} \def\abs#1{\left |#1\right |} \def\norm#1{\left|\left|#1\right|\right|} $$

CS439

GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.

Spring 2018: Optimization for Machine Learning

[TOC]

Convexity

  • general optimization problem : $minimize;f(x);with;x\in\R^d$
  • objective function : $f:\R^d\to\R$, typically assume $f $ continuous and differentiable
  • mathematical modeling : defining and measuring machine learning model
  • computational optimization : learning model parameters
  • convex set : if line segment between any two points lies in $C$, $\lambda x+(1-\lambda) y\in C$ for any $x, y\in C$ and any $0\le\lambda\le 1$, border need to be in the set
    • intersection : $C=\cap_{i\in I} C_i$ convex
    • projections onto $C$ : $P_C(x')=\arg\min_{y\in C}\norm{y-x'}$ unique, efficient to compute
  • convex function : if $dom(f)$ convex set and $f(\lambda x+(1-\lambda)y)\le\lambda f(x)+(1-\lambda)f(y)$ for any $x, y\in dom(f)$ and any $0\le \lambda \le 1$
    • geometrically : line above $f $ graph
    • linear : $f=\sum_{i=1}^m \lambda_i f_i$ convex on $dom(f)=\cap_{i=1}^m dom(f_i)$ with $\lambda_i\in\R_+$
    • affine : $g(x)=Ax+b$ and $f $ convex, $f\circ g$ convex on $dom(f\circ g)={x\in \R^m : g(x)\in dom(f)}$
  • norm : convex
    • positivity : $f(x)\ge 0$, without this called semi-norm
    • triangle inequality : $f(x+y)\le f(x)+f(y)$
    • homogeneity : $f(ax)=\abs{a}f(x)$
  • convex optimization problems : $\min f(x);s.t.; x\in C$ with $f $ and $C$ convex
    • local minimum : global minimum
  • function graph : ${(x,f(x))\mid x\in dom(f)}$
  • epigraph : $epi(f)={(x,\alpha)\in\R^{d+1}\mid x\in dom(f),\alpha\ge f(x)}$, $f $ convex iff epigraph convex
  • jensen inequality : $f $ convex, $\lambda_i\in\R_+$ s.t. $\sum\lambda_i = 1$, $f(\sum_{i=1}^m\lambda_i x_i)\le\sum_{i=1}^m\lambda_i f(x_i)$
  • gradient : $dom(f)$ open, $f $ differentiable, $\nabla f(x)=(\frac{\partial f(x)}{\partial x_1},\ldots,\frac{\partial f(x)}{\partial x_d})$ exists at every point
    • $f $ convex : iff $f(y)\ge f(x)+\nabla f(x)^\top(y-x)$ for every point in $dom(f)$ convex
  • positive semidefinite : if $x\top M x\ge 0;\forall x$ noted $M\succeq 0$
  • hessian : $dom(f)$ open, twice differentiable, $\nabla^2 f(x)=\begin{pmatrix}\frac{\partial^2 f(x)}{\partial x_1,\partial x_1} & \cdots & \frac{\partial^2 f(x)}{\partial x_1,\partial x_d} \ \vdots & \ddots & \vdots \ \frac{\partial^2 f(x)}{\partial x_d,\partial x_1} & \cdots & \frac{\partial^2 f(x)}{\partial x_d,\partial x_d} \end{pmatrix}$ exists at every point and symmetric
    • $f $ convex : iff $\nabla^2 f(x)\succeq 0$ for every point in $dom(f)$ convex
  • local minimum : $f:dom(f)\to\R$, point $x$ s.t. $\exists\epsilon>0$ with $f(x)\le f(y)~~\forall y\in dom(f)$ satisfying $\norm{y-x}<\epsilon$
  • global minimum : $f:dom(f)\to\R$ convex with local minimum $x^$, $f(x^)\le f(y)~~\forall y\in dom(f)$
    • convex and differentiable : $\nabla f(x)=0$ implies global minimum
  • strictly convex : if $dom(f)$ convex, $x\not=y\in dom(f)$, $0\le \lambda \le 1$ $f(\lambda x+(1-\lambda) y)< \lambda f(x)+(1-\lambda) f(y)$
    • at most one global minimum
  • sublevel sets : $f^{\le\alpha}={x\in dom(f):f(x)\le\alpha }$
  • Weierstrass theorem : $dom(f)$ open, convex, nonempty and bounded sublevel implies $f $ has a global minimum

Gradient descent

  • gradient descent : convex, differentiable, global minimum $x^$, goal $f(x)-f(x^)\le\epsilon$
    • iterative algorithm : $x_{t+1}=x_t-\gamma\nabla f(x_t)$ with stepsize $\gamma\ge 0$
    • trick : $2v^\top w=\norm{v}^2+\norm{w}^2-\norm{v-w}^2$, sum over steps
    • average error upper bound : $\sum_{t=0}^{T-1}(f(x_t)-f(x^))\le\frac{\gamma}{2}\sum_{t=0}^{T-1}\norm{\nabla f(x_t)}^2+\frac{1}{2\gamma}\norm{x_0-x^}^2$
    • last iterate not necessarily best one
    • stepsize crucial
    • $O(1/\epsilon^2)$ steps (notation pour $\epsilon$ bound) : convex and differentiable with global minimum $x^*$
      • suppose : $\norm{x_0-x^*}\le R​$, $\norm{\nabla f(x)}\le L​$, stepsize $\gamma=\frac{R}{L\sqrt{T}}​$
      • yields : $\frac{1}{T}\sum_{t=0}^{T-1}f(x_t)-f(x^*)\le\frac{RL}{\sqrt{T}}$, dimension independent, holds for best and average iterate
      • bounded gradient $\iff$ Lipschitz continuity of $f $
  • matrix (spectral) norm : $\norm{A}=\max_{\norm{x}=1}\norm{Ax}$
  • smooth : convex and differentiable, $L\in R_+$, $f(y)\le f(x)+\nabla f(x)^\top(y-x)+\frac{L}{2}\norm{x-y}^2~~\forall x, y\in \R^d$
    • geometrically : graph of $f $ below not-too-steep tangential paraboloid
    • linear : $L_i$ smooth, $\lambda_i\in R_+$, implies $f=\sum \lambda_i f_i$ smooth with $L=\sum \lambda_iL_i$
    • affine : $g(x)=Ax+b$ with $f $ $L$ smooth implies $f\circ g$ smooth with $L\norm {A}^2$
    • $O(1/\epsilon)$ steps : convex and differentiable with global minimum $x^*$
      • suppose : $\gamma=\frac{1}{L}​$
      • yields : monotone decreasing $f(x_{t+1})\le f(x_t)-\frac{1}{2L}\norm{\nabla f(x_t)}^2$ and $f(x_T)-f(x^)\le\frac{L}{2T}\norm{x_0-x^}^2$
      • smoothness $\iff$ Lipschitz continuity of $\nabla f$
      • $f $ convex, differentiable, smooth with parameter $L$ $\iff$ $\norm{\nabla f(x)-\nabla f(y)}\le L\norm {x-y}~~\forall x, y\in \R^d$
  • convergence in one step : $f(x)=x^2$ ($L-2$ smooth) and $\gamma=\frac{1}{2}$, $x_{t+1}=0$
  • converge exponential : $f(x)=x^2$ ($L-4$ smooth) and $\gamma=\frac{1}{4}$, $x_{t+1}=\frac{x^t}{2}$
  • strong convexity : not too curved, not too flat, $f$ convex, differentiable, $f(y)\ge f(x)+\nabla f(x)^\top(y-x)+\frac{\mu}{2}\norm{x-y}^2~~\forall x,y\in\R^d~~\mu>0$, unique global minimum
    • step norm bound : $\norm{x_{t+1}-x^}^2\le 2\gamma(f(x^)-f(x_t))+\gamma^2\norm{\nabla f(x_t)}^2+(1-\mu\gamma)\norm{x_t-x^*}$
    • $O(\log(1/\epsilon))$ steps : convex, differentiable, $L$ smooth and $\mu$ strongly convex
      • suppose : $\gamma=\frac{1}{L}$
      • yields : squares distances decreasing $\norm{x_{t+1}-x^}^2\le (1-\frac{\mu}{L})\norm{x_t-x^}^2$ and $f(x_t)-f(x^)\le\frac{L}{2}(1-\frac{\mu}{L})^t\norm{x_0-x^}^2$
  • separator matrix $W$ : $(Wx){d(x)}=\max{j=0}^9 (Wx)_j~~\forall x\in P$
    • trivial separator : $(Wx)_i=(Wx)_j~~\forall i,j$
    • loss $l$ with global minimum $\iff$ all separator trivial

Projected gradient descent

  • constrained optimization problem : minimize $f(x)$ s.t. $x\in X$ either transform it into unconstrained problem (by penalization) or project the gradient

  • projected gradient descent : project onto convex set $X\subseteq\R^d$ after every step, $y_{t+1}=x_t-\gamma\nabla f(x_t)$, $x_{t+1}=\Pi_X(y_t+1)=\arg\min_{x\in X}\norm{x-y_{t+1}}^2$

    • for $x\in X$ and $y\in\R^d$ : $(x-\Pi_X(y))^\top(y-\Pi_X(y))\le 0$ and $\norm{x-\Pi_X(y)}^2+\norm{y-\Pi_X(y)}^2\le\norm{x-y}^2$

    • $O(1/\epsilon^2)$ steps : convex and differentiable with minimum $x^*$ over $X$ closed

      • suppose : $\norm{x_0-x^*}\le R​$, $\norm{\nabla f(x)}\le L​$, stepsize $\gamma=\frac{R}{L\sqrt{T}}​$
      • yields : $\frac{1}{T}\sum_{t=0}^{T-1}f(x_t)-f(x^*)\le\frac{RL}{\sqrt{T}}$
    • $O(1/\epsilon)$ steps : convex and differentiable with minimum $x^*$ over $X$, $L$-smooth over $X$

      • suppose : $\gamma=\frac{1}{L}$
      • yields : monotone decreasing $f(x_{t+1})\le f(x_t)-\frac{1}{2L}\norm{\nabla f(x_t)}^2+\frac{L}{2}\norm{y_{t+1}-x_{t+1}}^2$ and $f(x_T)-f(x^)\le\frac{L}{2T}\norm{x_0-x^}^2$
    • $O(\log(1/\epsilon))$ steps : convex, differentiable, $L$-smooth over $X$ and $\mu$ strongly convex

      • suppose : $\gamma=\frac{1}{L}$
      • strengthen constrained vanilla bound : $\frac { 1} { 2\gamma } \left( \gamma ^ { 2} | \nabla f \left( \mathbf { x } _ { t } \right) | ^ { 2} + | \mathbf { x } _ { t } - \mathbf { x } ^ { \star } | ^ { 2} - | \mathbf { x } ^ { + } - \mathbf { x } ^ { \star } | ^ { 2} - | \mathbf { y } ^ { + } - \mathbf { x } ^ { + } | ^ { 2} \right) - \frac { \mu } { 2} | \mathbf { x } _ { t } - \mathbf { x } ^ { \star } | ^ { 2}$
      • yields : $| \mathbf { x } _ { t + 1} - \mathbf { x } ^ { \star } | ^ { 2} \leq \left( 1- \frac { \mu } { L } \right) | \mathbf { x } _ { t } - \mathbf { x } ^ { \star } | ^ { 2}$ and $f \left( \mathbf { x } _ { t } \right) - f \left( \mathbf { x } ^ { \star } \right) \leq \frac { L } { 2} \left( 1- \frac { \mu } { L } \right) ^ { t } | \mathbf { x } _ { 0} - \mathbf { x } ^ { \star } | ^ { 2}$
    • projecting onto $l_1$-balls : $X = B _ { 1} ( R ) = \left{ \mathbf { x } \in \mathbb { R } ^ { d } : | \mathbf { x } | _ { 1} = \sum _ { i = 1} ^ { d } | x _ { i } | \leq R \right}$

      • w.l.o.g. : $R=1$, positive face only $v_i\ge 0~~\forall i$, outside simplex $\sum_{i=1}^d v_i>1$
      • standard simplex : $\Delta _ { d } : = \left{ \mathbf { X } \in \mathbb { R } ^ { d } : \sum _ { i = 1} ^ { d } x _ { i } = 1,x _ { i } \geq 0~~\forall i \right}$ with $\Pi _ { X } ( \mathbf { v } ) = \operatorname{arg} \min _ { x \in \Delta _ { d } } | \mathbf { x } - \mathbf { v } | ^ { 2}$
        • optimality criterion for constrained : $\nabla d _ { \mathbf { v } } \left( \mathbf { x } ^ { \star } \right) ^ { T } \left( \mathbf { x } - \mathbf { x } ^ { \star } \right) = 2\left( \mathbf { x } ^ { \star } - \mathbf { v } \right) ^ { T } \left( \mathbf { x } - \mathbf { x } ^ { \star } \right) \geq 0\quad \forall x \in \Delta d$
        • unique index : $p\in{1,\ldots, d}$ with $\bf v$ ordered decreasingly s.t. $x_i^\star >0$ for $i\le p$ and $x_i^\star=0$ for $i >p$
        • nonzero : $x _ { i } ^ { \star } = v _ { i } - \Theta _ { p } \quad i \leq p$ with $\Theta _ { p } = \frac { 1} { p } \left( \sum _ { i = 1} ^ { p } v _ { i } - 1\right)$
        • summary : $d$ candidate for $\bf x^\star$ namely $\mathbf { x } ^ { \star } ( p ) : = \left( v _ { 1} - \Theta _ { p ,\cdots } ,v _ { p } - \Theta _ { p } ,0,\dots ,0\right) \quad p \in { 1,\ldots ,d }$, pick $p$ s.t. $\mathbf{x}^\star(p)$ minimize distance to $\bf v$ in time $O(d\log d)$ (sorting, checking incrementally)
  • proximal gradient descent : objective $f ( x ) = g ( x ) + h ( x )$ with $g$ nice function and $h$ simple additional term but not nice or not differentiable

    • update : $\mathbf { x }_{t + 1}= \operatorname{arg} \min _ { \mathbf { y } } \frac { 1} { 2\gamma } | \mathbf { y } - \left( \mathbf { x } _ { t } - \gamma \nabla g \left( \mathbf { x } _ { t } \right) \right) | ^ { 2} + h ( \mathbf { y } )=\operatorname{prox} _ { h ,\gamma } \left( \mathbf { x } _ { t } - \gamma \nabla g \left( \mathbf { x } _ { t } \right) \right)$ with $\operatorname{prox} _ { h ,\gamma } ( \mathbf { z } ) : = \operatorname{arg} \min _ { \mathbf { y } } \left{ \frac{1}{2\gamma}| \mathbf { y } - \mathbf { z } | ^ { 2} + h ( \mathbf { y } ) \right}$, $\mathbf { x } _ { t + 1} = \mathbf { x } _ { t } - \gamma G _ { \gamma } \left( \mathbf { x } _ { t } \right)$
    • generalized gradient : $G _ { h ,\gamma } ( \mathbf { x } ) : = \frac { 1} { \gamma } \left( \mathbf { x } - \text{prox} _ { h ,\gamma } ( \mathbf { x } - \gamma \nabla g ( \mathbf { x } ) ) \right)$
      • $h=0$ : gradient descent
      • $h=\iota_X$ : projected gradient descent with $\mathbf { x } \mapsto \iota _ { X } ( \mathbf { x } ) : = \left{ \begin{array} { l l } { 0} & { \text{ if } x \in X } \ { + \infty } & { \text{ otherwise } } \end{array} \right.$
    • $O(1/\epsilon)$ steps : same as vanilla case but now for any $h$

Subgradient descent

  • subgradient : $\mathbf { g } \in \mathbb { R } ^ { d }$ if $f ( \mathbf { y } ) \geq f ( \mathbf { x } ) + \mathbf { g } ^ { T } ( \mathbf { y } - \mathbf { x } );\forall \mathbf { y } \in \operatorname { dom } ( f )$
    • set of all possible : $\partial f ( \mathbf { x } ) \subseteq \mathbb { R } ^ { d }$
    • convexity : $f : \operatorname { dom } ( f ) \rightarrow \mathrm { R }$ convex iff $\operatorname { dom } ( f )$ convex and $\partial f ( \mathbf { x } ) \neq \emptyset;\forall \mathbf { x } \in \mathbf { d o m } ( f )$
    • lipschitz continuity : $f : \mathbb { R } ^ { d } \rightarrow \mathbb { R }$ convex, $B$ positive, $| \mathbf { g } | \leq B;\forall \mathbf { x } \in \mathbb { R } ^ { d };\forall \mathbf { g } \in \partial f ( \mathbf { x } )$ equivalent to $| f ( \mathbf { x } ) - f ( \mathbf { y } ) | \leq B | \mathbf { x } - \mathbf { y } |;\forall \mathbf { x } , \mathbf { y } \in \mathbb { R } ^ { d }$
    • update : $\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - \gamma \mathbf { g } _ { t }$ for $\mathbf { g } _ { t } \in \partial f \left( \mathbf { x } _ { t } \right)$
    • $O(1/\epsilon^2)$ steps : convex and relaxing differentiable
      • suppose : $\norm{x_0-x^*}\le R$, stepsize $\gamma=\frac{R}{B\sqrt{T}}$
      • yields : $\frac{1}{T}\sum_{t=0}^{T-1}f(x_t)-f(x^*)\le\frac{RB}{\sqrt{T}}$
  • Nesterov theorem : any $T \leq d - 1$ and starting point $\mathbf { x } _ { 0 }$, a $f$ exists in $B$-Lipschitz over $\text { R } ^ { d }$ s.t. any (sub)gradient objective error at least $f \left( \mathbf { x } _ { T } \right) - f \left( \mathbf { x } ^ { \star } \right) \geq \frac { R B } { 2 ( 1 + \sqrt { T + 1 } ) }$

Stochastic gradient descent

  • SGD : sum of structured objective $f ( \mathbf { x } ) = \frac { 1 } { n } \sum _ { i = 1 } ^ { n } f _ { i } ( \mathbf { x } )$, $n$ times cheaper
    • update : $\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - \gamma _ { t } \nabla f _ { i } \left( \mathbf { x } _ { t } \right)$ for $i \in [ n ]$ uniformly at random
    • stochastic gradient : $\mathbf { g } _ { t } = \nabla f _ { i } \left( \mathbf { x } _ { t } \right)$
    • unbiased : $\mathrm { E } \left[ \mathbf { g } _ { t } | \mathbf { x } _ { t } \right] = \nabla f \left( \mathbf { x } _ { t } \right)$
    • comparison : jensen ineq
      • GD : $| \frac { 1 } { n } \sum _ { i } \nabla f _ { i } ( \mathbf { x } ) | ^ { 2 } \leq B _ { \mathrm { GD } } ^ { 2 } \quad \forall _ { \mathbf { X } }$
      • SGD : $\frac { 1 } { n } \sum _ { i } | \nabla f _ { i } ( \mathbf { x } ) | ^ { 2 } \leq B _ { S G D } ^ { 2 } \quad \forall _ { \mathbf { X } }$
    • tower rule : $E ( E ( x |y ] ) = E [ x ]$
    • $O(1/\epsilon^2)$ steps : convex and differentiable
      • suppose : $\norm{x_0-x^*}\le R$, $\mathbb { E } \left[ | \mathbf { g } _ { t } | ^ { 2 } \right] \leq B ^ { 2 }$, stepsize $\gamma=\frac{R}{B\sqrt{T}}$
      • yields : $\frac { 1 } { T } \sum _ { t = 0 } ^ { T - 1 } \mathrm { E } \left[ f \left( \mathbf { x } _ { t } \right) \right] - f \left( \mathbf { x } ^ { \star } \right) \leq \frac { R B } { \sqrt { T } }$
    • extends to constrained optimization
    • $O(1/\epsilon)$ steps : strongly convex $\mu$ and differentiable
      • time-varing stepsize : $\gamma _ { t } = \frac { 2 } { \mu ( t + 1 ) }$ decreasing over time
      • yields : $E \left[ f \left( \frac { 2 } { T ( T + 1 ) } \sum _ { t = 1 } ^ { T } t \cdot \mathbf { x } _ { t } \right) - f \left( \mathbf { x } ^ { \star } \right) \right] \leq \frac { 2 B ^ { 2 } } { \mu ( T + 1 ) }$
  • stochastic subgradient descent
    • update : $\mathbf { x } _ { t + 1 } : = \mathbf { x } _ { t } - \gamma _ { t } \mathbf { g } _ { t }$ for $\mathbf { g } _ { t } \in \partial f _ { i } \left( \mathbf { x } _ { t } \right)$ and $i \in [ n ]$ uniformly at random
    • unbiased : $\mathrm { E } \left[ \mathbf { g } _ { t } | \mathbf { x } _ { t } \right] \in \partial f \left( \mathbf { x } _ { t } \right)$
    • $O(1/\epsilon^2)$ steps : same by using subgradient property
  • mini-batch : $\tilde { \mathbf { g } } _ { t } = \frac { 1 } { m } \sum _ { j = 1 } ^ { m } \mathbf { g } _ { t } ^ { j }$
    • $m=1$ : SGD
    • $m=n$ : GD
    • variance intuition : $\mathbb { E } \left[ | \mathbf { \tilde { g } } _ { t } - \nabla f \left( \mathbf { x } _ { t } \right) | ^ { 2 } \right] \leq \frac { B ^ { 2 } } { m }$

Newton's method

  • Newton-Raphson method

    • goal : find a zero differientable $f : \mathbb { R } \rightarrow \mathbb { R }$

    • update : $x _ { t + 1 } = x _ { t } - \frac { f \left( x _ { t } \right) } { f ^ { \prime } \left( x _ { t } \right) }$

    • $O(\log\log(1/\epsilon))$ steps : local, convex $f$, open ball $X \subseteq \operatorname { dom } ( f )$ with center $\mathbf { x } ^ { \star }$

      • bounded inverse Hessians $| \nabla ^ { 2 } f ( \mathbf { x } ) ^ { - 1 } | \leq \frac { 1 } { \mu } ; \forall \mathbf { x } \in X$
      • Lipschitz continuous Hessians : $| \nabla ^ { 2 } f ( \mathbf { x } ) - \nabla ^ { 2 } f ( \mathbf { y } ) | \leq L | \mathbf { x } - \mathbf { y } | ; \forall \mathbf { x } , \mathbf { y } \in X$
      • give : $| \mathbf { x } _ { t + 1 } - \mathbf { x } ^ { \star } | \leq \frac { L } { 2 \mu } | \mathbf { x } _ { t } - \mathbf { x } ^ { \star } | ^ { 2 }$
      • yields : $| \mathbf { x } _ { T } - \mathbf { x } ^ { \star } | < \frac { 2 \mu } { L } \left( \frac { 1 } { 2 } \right) ^ { 2 ^ { T } }$ for $| \mathbf { x } _ { 0 } - \mathbf { x } ^ { \star } | < \frac { \mu } { L }$
      • lemma : $\int _ { 0 } ^ { 1 } \nabla ^ { 2 } f ( \mathbf { x } + t ( \mathbf { y } - \mathbf { x } ) ) ( \mathbf { y } - \mathbf { x } ) d t = \nabla f ( \mathbf { y } ) - \nabla f ( \mathbf { x } )$
      • strong convexity : enforce bounded inverse Hessians, if $f ( \mathbf { y } ) \geq f ( \mathbf { x } ) + \nabla f ( \mathbf { x } ) ^ { T } ( \mathbf { y } - \mathbf { x } ) + \frac { \mu } { 2 } | \mathbf { x } - \mathbf { y } | ^ { 2 } ; \forall \mathbf { x } , \mathbf { y } \in X$ then $| \nabla ^ { 2 } f ( \mathbf { x } ) ^ { - 1 } | \leq 1 / \mu;\forall \mathbf { x } \in X$
    • equivalently : search for a zero of derivative, $x _ { t + 1 } = x _ { t } - \frac { f ^ { \prime } \left( x _ { t } \right) } { f ^ { \prime \prime } \left( x _ { t } \right) }$ in $1$-dim or $\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - \nabla ^ { 2 } f \left( \mathbf { x } _ { t } \right) ^ { - 1 } \nabla f \left( \mathbf { x } _ { t } \right)$ in $d$-dim

    • nondegenerate quadratic : $f ( \mathbf { x } ) = \frac { 1 } { 2 } \mathbf { x } ^ { T } M \mathbf { x } - \mathbf { q } ^ { T } \mathbf { x } + c$ with $M$ symmetric invertible, $\mathbf { q } \in \mathbb { R } ^ { d }$, $c \in R$, solved in one step $\mathbf { x } _ { 1 } = \mathbf { x } ^ { \star }$ as unique solution $\mathbf { x } ^ { \star } = M ^ { - 1 } \mathbf { q }$ of $\nabla f(x)=0$

    • affine invariance : $A \in \mathbb { R } ^ { d \times d }$ invertible, $\mathbf { b } \in \mathbb { R } ^ { d }$, $g ( \mathbf { y } ) = A \mathbf { y } + \mathbf { b }$ with newton step $N _ { h } ( \mathbf { x } ) = \mathbf { x } - \nabla ^ { 2 } h ( \mathbf { x } ) ^ { - 1 } \nabla h ( \mathbf { x } )$ give $N _ { f \circ g } = g ^ { - 1 } \circ N _ { f } \circ g$, does not suffer from coordinate scale difference

    • alternative interpretation : minimize local second-order Taylor approximation, $\nabla ^ { 2 } f \left( \mathbf { x } _ { t } \right) \succ 0$ invertible, $\mathbf { x } _ { t + 1 } = \arg \min _ { \mathbf { x } \in \mathbb { R } ^ { d } } f \left( \mathbf { x } _ { t } \right) + \nabla f \left( \mathbf { x } _ { t } \right) ^ { T } \left( \mathbf { x } - \mathbf { x } _ { t } \right)+ \frac { 1 } { 2 } \left( \mathbf { x } - \mathbf { x } _ { t } \right) ^ { T } \nabla ^ { 2 } f \left( \mathbf { x } _ { t } \right) \left( \mathbf { x } - \mathbf { x } _ { t } \right)$

    • invertion of hessian : computational bottleneck

Quasi-Newton methods

  • secant method : derivative-free
    • difference approximation : $x _ { t + 1 } = x _ { t } - f \left( x _ { t } \right) \frac { x _ { t } - x _ { t - 1 } } { f \left( x _ { t } \right) - f \left( x _ { t - 1 } \right) }$ for $| x _ { t } - x _ { t - 1 } |$ small
    • second derivative : $x _ { t + 1 } = x _ { t } - f ^ { \prime } \left( x _ { t } \right) \frac { x _ { t } - x _ { t - 1 } } { f ^ { \prime } \left( x _ { t } \right) - f ^ { \prime } \left( x _ { t - 1 } \right) }$
    • secant condition : $H _ { t } = \frac { f ^ { \prime } \left( x _ { t } \right) - f ^ { \prime } \left( x _ { t - 1 } \right) } { x _ { t } - x _ { t - 1 } } \approx f ^ { \prime \prime } \left( x _ { t } \right)\iff f ^ { \prime } \left( x _ { t } \right) - f ^ { \prime } \left( x _ { t - 1 } \right) = H _ { t } \left( x _ { t } - x _ { t - 1 } \right)$
    • update : $\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - H _ { t } ^ { - 1 } \nabla f \left( \mathbf { x } _ { t } \right)$
  • quasi-newton : join secant condition with first-order Taylor, $\nabla f \left( \mathbf { x } _ { t } \right) - \nabla f \left( \mathbf { x } _ { t - 1 } \right) = H _ { t } \left( \mathbf { x } _ { t } - \mathbf { x } _ { t - 1 } \right) \approx \nabla ^ { 2 } f \left( \mathbf { x } _ { t } \right) \left( \mathbf { x } _ { t } - \mathbf { x } _ { t - 1 } \right)$
    • find good matrice $H _ { t } ^ { - 1 }$ : L-BFGS
    • Newton's method is quasi-newton iff $f$ nondegenerate quadratic : no generalize but family of related algorithm

Frank-Wolfe

  • Frank-Wolfe
    • update : $\mathbf { s } = \operatorname { LMO } \left( \nabla f \left( \mathbf { x } _ { t } \right) \right)$, $\mathbf { x } _ { t + 1 } = ( 1 - \gamma ) \mathbf { x } _ { t } + \gamma \mathbf { s }$

    • stepsize : $\gamma = \frac { 2 } { t + 2 }$

      • line-search variant : $\gamma _ { t } = \arg \min _ { \gamma \in [ 0,1 ] } f \left( ( 1 - \gamma ) \mathbf { x } _ { t } + \gamma \mathbf { s } \right)$
      • gap-based variant : $\gamma _ { t } = \min \left{ \frac { g \left( \mathbf { x } _ { t } \right) } { L | \mathbf { s } - \mathbf { x } _ { t } | ^ { 2 } } , 1 \right}$
    • linear minimization oracle : $\mathrm { LMO(g)=arg\operatorname{min } } _{ \mathbf{s} \in X } \langle \mathbf { s } , \mathbf { g } \rangle$

    • properties : always feasible, reduce non-linear to linear, projection-free, sparse iterates

    • duality gap : $g ( \mathbf { x } ) = \langle \mathbf { x } - \mathbf { s } , \nabla f ( \mathbf { x } ) \rangle$ with certificat $g ( \mathbf { x } ) = \max _ { \mathbf { s } \in X } \langle \mathbf { x } - \mathbf { s } , \nabla f ( \mathbf { x } ) \rangle \geq \left\langle \mathbf { x } - \mathbf { x } ^ { \star } , \nabla f ( \mathbf { x } ) \right\rangle \geq f ( \mathbf { x } ) - f \left( \mathbf { x } ^ { \star } \right)$

      • convergence : convex and smooth $L$, choosing any stepsize yields $g \left( \mathbf { x } _ { t } \right) \leq \frac { 27 / 4 C _ { f } } { T + 1 }$

      image-20180701105830583

    • $O(1/\epsilon)$ steps : convex and smooth $L$

      • stepsize : any of above
      • time-varing stepsize : $f \left( \mathbf { x } _ { T } \right) - f \left( \mathbf { x } ^ { \star } \right) \leq \frac { 2 L \operatorname { diam } ( X ) ^ { 2 } } { T + 1 }$ with $\operatorname { diam } ( X ) = \max _ { \mathbf { x } , \mathbf { y } \in X } | \mathbf { x } - \mathbf { y } |$
    • affine invariant : same as Newton but for constrained problems

      • curvature constant : $C _ { f } = \sup _ { \mathbf { x } , \mathbf { s } \in X , \gamma \in [ 0,1 ], \mathbf { y } = \mathbf { x } + \gamma ( \mathbf { s - x } ) } \frac { 2 } { \gamma ^ { 2 } } ( f ( \mathbf { y } ) - f ( \mathbf { x } ) - \langle \mathbf { y } - \mathbf { x } , \nabla f ( \mathbf { x } ) \rangle )$ give $C _ { f } \leq L \operatorname { diam } ( X ) ^ { 2 }$ for any norm
    • extensions

      • approximate LMO : additive of multiplicative accuracy
      • randomized LMO
      • unconstrained : matching pursuit variants
    • use cases : when projection more costly than solving linear problem (L1, matrix completion, relaxation of combinatorial problems)

Coordinate descent

  • coordinate descent : one coordinate at a time
    • update : $\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } + \gamma \mathbf { e } _ { i _ { t } }$ for $i _ { t } \in [ d ]$
      • gradient-based stepsize : $\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - \frac { 1 } { L } \nabla _ { i t } f \left( \mathbf { x } _ { t } \right) \mathbf { e } _ { i t }$ for $L$-smooth
      • exact coordinate minimization : $\arg \min _ { \gamma \in \mathbb { R } } f \left( \mathbf { x } _ { t } + \gamma \mathbf { e } _ { i _ { t } } \right)$ solve closed form
    • randomized update : $\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - \frac { 1 } { L } \nabla _ { i _ { t } } f \left( \mathbf { x } _ { t } \right) \mathbf { e } _ { i _ { t } }$ for $i _ { t } \in [ d ]$ uniformly at random, fast than GD if step cheaper than full gradient
    • coordinate-wise smoothness : $f \left( \mathbf { x } + \gamma \mathbf { e } _ { i } \right) \leq f ( \mathbf { x } ) + \gamma \nabla _ { i } f ( \mathbf { x } ) + \frac { L } { 2 } \gamma ^ { 2 } ; \forall \mathbf { x } \in \mathbb { R } ^ { d } , \forall \gamma \in \mathbb { R } , \forall i$
      • equivalent to coordinate-wise Lipschitz gradient : $| \nabla _ { i } f \left( \mathbf { x } + \gamma \mathbf { e } _ { i } \right) - \nabla _ { i } f ( \mathbf { x } ) | \leq L | \gamma | ; \forall \mathbf { x } \in \mathbb { R } ^ { d } , \forall \gamma \in \mathbb { R } , \forall i$
    • $O(\log(1/\epsilon))$ steps : coordinate-wise smooth $L$, strongly convex $\mu$
      • assume : $\gamma=1/L$, $\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - \frac { 1 } { L } \nabla _ { i _ { t } } f \left( \mathbf { x } _ { t } \right) \mathbf { e } _ { i _ { t } }$ uniformly at random
      • yields : $\mathrm { E } \left[ f \left( \mathbf { x } _ { t } \right) - f ^ { \star } \right] \leq \left( 1 - \frac { \mu } { d L } \right) ^ { t } \left[ f \left( \mathbf { x } _ { 0 } \right) - f ^ { \star } \right]$
      • lemma : strongly convex satisfy $\frac { 1 } { 2 } | \nabla f ( \mathbf { x } ) | ^ { 2 } \geq \mu \left( f ( \mathbf { x } ) - f ^ { \star } \right);\forall\mathbf{x}$
    • Polyak-Lojasiewicz inquality PL : $\frac { 1 } { 2 } | \nabla f ( \mathbf { x } ) | ^ { 2 } \geq \mu \left( f ( \mathbf { x } ) - f ^ { \star } \right) ; \forall \mathbf { x }$
    • sampling : uniformly not always best
      • individual smoothness constants $L_i$ for coordinate $i$ : $f \left( \mathbf { x } + \gamma \mathbf { e } _ { i } \right) \leq f ( \mathbf { x } ) + \gamma \nabla _ { i } f ( \mathbf { x } ) + \frac { L _ { i } } { 2 } \gamma ^ { 2 }$
      • selection probability : $P \left[ i _ { t } = i \right] = \frac { L _ { i } } { \sum _ { i } L _ { i } }$
      • stepsize : $\gamma=1 / L _ { i_t }$
      • converge : $\mathrm { E } \left[ f \left( \mathbf { x } _ { t } \right) - f ^ { \star } \right] \leq \left( 1 - \frac { \mu } { d \overline { L } } \right) ^ { t } \left[ f \left( \mathbf { x } _ { 0 } \right) - f ^ { \star } \right]$ with $\overline { L } = \frac { 1 } { d } \sum _ { i = 1 } ^ { d } L _ { i }$ and often $\overline { L } \ll L = \max _ { i } L _ { i }$
    • steepest coordinate descent : $i _ { t } = \arg \max _ { i \in [ d ] } | \nabla _ { i } f \left( \mathbf { x } _ { t } \right) |$, usefull for hardware limited memory (GPU)
      • same convergence as random : $\max _ { i } | \nabla _ { i } f ( \mathbf { x } ) | ^ { 2 } \geq \frac { 1 } { d } \sum _ { i } | \nabla _ { i } f ( \mathbf { x } ) | ^ { 2 }$, max vs avg
      • stepsize : $\gamma=1/L$
      • yields : $\mathrm { E } \left[ f \left( \mathbf { x } _ { t } \right) - f ^ { \star } \right] \leq \left( 1 - \frac { \mu } { d L } \right) ^ { t } \left[ f \left( \mathbf { x } _ { 0 } \right) - f ^ { \star } \right] $
      • can be faster : strong convexity measured with respect to $\ell_1$-norm instead of euclidean, $f ( \mathbf { y } ) \geq f ( \mathbf { x } ) + \langle \nabla f ( \mathbf { x } ) , \mathbf { y } - \mathbf { x } \rangle + \frac { \mu _ { 1 } } { 2 } | \mathbf { y } - \mathbf { x } | _ { 1 } ^ { 2 }$
        • can see : $\frac { \mu } { d } \leq \mu_1 \leq \mu$, equivalence of norms
        • stepsize : $\gamma=1/L$
        • yields : $\mathbb { E } \left[ f \left( \mathbf { x } _ { t } \right) - f ^ { \star } \right] \leq \left( 1 - \frac { \mu _ { 1 } } { L } \right) ^ { t } \left[ f \left( \mathbf { x } _ { 0 } \right) - f ^ { \star } \right]$, dimension-free
        • lemma : strongly convex w.r.t $\ell_1$-norm $\mu_1$ satisfies $\frac { 1 } { 2 } | \nabla f ( \mathbf { x } ) | _ { \infty } ^ { 2 } \geq \mu _ { 1 } \left( f ( \mathbf { x } ) - f ^ { \star } \right)$
    • non-smooth objective : CD fails, permanently stuck
      • alternative : non-smooth part separable over coordinate, $f ( \mathbf { x } ) = g ( \mathbf { x } ) + h ( \mathbf { x } )$ with $h ( \mathbf { x } ) = \sum _ { i } h _ { i } \left( x _ { i } \right)$
      • state-of-the-art for GLM : $f ( \mathbf { x } ) = g ( A \mathbf { x } ) + \sum _ { i } h _ { i } \left( x _ { i } \right)$ with regularization

Accelerated gradient descent

image-20180701114921597

  • momentum : $\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } - \gamma \nabla f \left( \mathbf { x } _ { t } \right) + \nu \left[ \mathbf { x } _ { t } - \mathbf { x } _ { t - 1 } \right]$
  • accelerated gradient method AGD
    • update : $\mathbf { x } _ { t + 1 } = \tau \mathbf { z } _ { t } + ( 1 - \tau ) \mathbf { y } _ { t }$, $\mathbf { y } _ { t + 1 } = \mathbf { x } _ { t + 1 } - \frac { 1 } { L } \nabla f \left( \mathbf { x } _ { t + 1 } \right)$, $\mathbf { z } _ { t + 1 } = \mathbf { z } _ { t } - \gamma \nabla f \left( \mathbf { x } _ { t + 1 } \right)$ with $\mathbf { x } _ { 0 } = \mathbf { y } _ { 0 } = \mathbf { z } _ { 0 }$
    • $O(1/\sqrt{\epsilon})$ : convex, differentiable, smooth $L$
      • assume : $| \mathbf { x } _ { 0 } - \mathbf { x } ^ { \star } | \leq R$, $| f \left( \mathbf { x } _ { 0 } \right) - f \left( \mathbf { x } ^ { \star } \right) | \leq d$
      • stepsize : $\gamma = \frac { R } { \sqrt { d L } }$ with $\tau$ s.t. $\frac { 1 - \tau } { \tau } = \gamma L$
      • satisfies : $f \left( \frac { 1 } { T } \sum _ { t = 0 } ^ { T - 1 } \mathbf { x } _ { t } \right) - f \left( \mathbf { x } ^ { \star } \right) \leq \frac { d } { 2 }$
    • strongly convex : $O(\sqrt{\frac{L}{\mu}})$ update s.t. $| \mathbf { x } - \mathbf { x } ^ { \star } | ^ { 2 } \leq \frac { 1 } { 2 } | \mathbf { x } _ { 0 } - \mathbf { x } ^ { \star } | ^ { 2 }$
    • $O(\log(1/\epsilon))$ convergence in iterate : by repeatedly starting AGD, strongly convex $\mu$, $L$-smooth where constant $\sqrt { \frac { L } { \mu } }$ instead of $\frac{L}{\mu}$ of GD image-20180701143256859

Duality

  • conjugate : $f ^ { * } ( \mathbf { y } ) : = \max _ { \mathbf { x } } \mathbf { x } ^ { T } \mathbf { y } - f ( \mathbf { x } )$ a.k.a. Legendre transform or Fenchel conjugate
    • always convex : even if $f$ not as point-wise maximum of convex (affine) functions in $\mathbf{y}$
    • Fenchel's inequality : $f ( \mathbf { x } ) + f ^ { * } ( \mathbf { y } ) \geq \mathbf { x } ^ { T } \mathbf { y }$ for any $\mathbf { x } , \mathbf { y }$
    • conjugate of conjugate : $f ^ { * * } \leq f$ with equality only if $f$ closed and convex ($\mathbf { y } \in \partial f ( \mathbf { x } ) \Leftrightarrow \mathbf { x } \in \partial f ^ { * } ( \mathbf { y } )\Leftrightarrow f ( \mathbf { x } ) + f ^ { * } ( \mathbf { y } ) = \mathbf { x } ^ { T } \mathbf { y }$)
    • separable functions : $f ( \mathbf { u } , \mathbf { v } ) = f _ { 1 } ( \mathbf { u } ) + f _ { 2 } ( \mathbf { v } )$ give $f ^ { * } ( \mathbf { w } , \mathbf { z } ) = f _ { 1 } ^ { * } ( \mathbf { w } ) + f _ { 2 } ^ { * } ( \mathbf { z } )$
    • indicator function : $\iota _ { C } ( \mathbf { x } ) : = \left{ \begin{array} { l l } { 0 } & { \mathbf { x } \in C } \ { + \infty } & { \text { otherwise } } \end{array} \right.$ with conjugate (support function) $f ^ { * } ( \mathbf { y } ) = \max _ { \mathbf { x } \in C } \mathbf { y } ^ { T } \mathbf { x }$
    • norm : $f ( \mathbf { x } ) = | \mathbf { x } |$ give $f ^ { * } ( \mathbf { y } ) = \iota _ { \left{ \mathbf { z } : | \mathbf { z } | _ { * } \leq 1 \right} } ( \mathbf { y } )$
      • dual : $| \cdot | _ { 1 } \leftrightarrow | \cdot | _ { \infty }$
    • GLM : $\min _ { \mathbf { x } \in \mathbb { R } ^ { d } , \mathbf { w } \in \mathbb { R } ^ { n } } f ( \mathbf { w } ) + g ( \mathbf { x } )$ s.t. $\mathbf { w } = A \mathbf { x }$
      • Lagrange dual : $\mathcal { L } ( \mathbf { u } ) = \min _ { \mathbf { x } \in \mathbb { R } ^ { d } , \mathbf { w } \in \mathbb { R } ^ { n } } f ( \mathbf { w } ) + g ( \mathbf { x } ) + \mathbf { u } ^ { T } ( \mathbf { w } - A \mathbf { x } )= - f ^ { * } ( \mathbf { u } ) - g ^ { * } \left( - A ^ { T } \mathbf { u } \right)$
      • dual problem : $\max _ { \mathbf { u } \in \mathbb { R } ^ { n } } \left[ \mathcal { L } ( \mathbf { u } ) = - f ^ { * } ( \mathbf { u } ) - g ^ { * } \left( - A ^ { T } \mathbf { u } \right) \right]$
    • Lasso : $\min _ { \mathbf { x } \in \mathbb { R } ^ { d } } \frac { 1 } { 2 } | A \mathbf { x } - \mathbf { b } | ^ { 2 } + \lambda | \mathbf { x } | _ { 1 }$
      • dual : $\max _ { \mathbf { u } \in \mathbb { R } ^ { n } } - f ^ { * } ( \mathbf { u } ) - g ^ { * } \left( - A ^ { T } \mathbf { u } \right) \Leftrightarrow \max _ { \mathbf { u } \in \mathbb { R } ^ { n } } - \frac { 1 } { 2 } | \mathbf { b } | ^ { 2 } + \frac { 1 } { 2 } | \mathbf { b } - \mathbf { u } | ^ { 2 }$ s.t. $| - A ^ { T } \mathbf { u } / \lambda | _ { \infty } \leq 1\Leftrightarrow \min _ { \mathbf { u } \in \mathbb { R } ^ { n } } | \mathbf { b } - \mathbf { u } | ^ { 2 }$ s.t. $| A ^ { T } \mathbf { u } | _ { \infty } \leq \lambda$
    • why : certificate of current optimization quality $f ( A \overline { \mathbf { x } } ) + g ( \overline { \mathbf { x } } )\geq \min _ { \mathbf { x } \in \mathbb { R } ^ { d } } f ( A \mathbf { x } ) + g ( \mathbf { x } )\geq \max _ { \mathbf { u } \in \mathbb { R } ^ { n } } - f ^ { * } ( \mathbf { u } ) - g ^ { * } \left( - A ^ { T } \mathbf { u } \right) \geq - f ^ { * } ( \overline { \mathbf { u } } ) - g ^ { * } \left( - A ^ { T } \overline { \mathbf { u } } \right)$, stopping criterion, dual sometimes easier

Gradient-free

  • zero-order optimization : gradient-free, blackbox
  • random search : no communication needed, competitive method for reinforcement learning, hyperparameter optimization
    • update : $\mathbf { x } _ { t + 1 } = \mathbf { x } _ { t } + \gamma \mathbf { d } _ { t }$ picking random direction $\mathbf { d } _ { t } \in \mathbb { R } ^ { d }$ and do line-search $\gamma = \arg \min _ { \gamma \in \mathbb { R } } \mathrm { f } \left( \mathbf { x } _ { t } + \gamma \mathbf { d } _ { t } \right)$
    • converge : same as GD up to slow-down factor $d$
    • convex : rate $\mathcal { O } ( d L / \varepsilon )$
    • strongly convex : $\mathcal { O } ( d L \log ( 1 / \varepsilon ) )$
  • reinforcement learning : $\mathbf { s } _ { t + 1 } = f \left( \mathbf { s } _ { t } , \mathbf { a } _ { t } , \mathbf { e } _ { t } \right)$
    • state : $\mathbf { s } _ { t }$
    • action : $\mathbf a _ { t }$
    • noise : $\mathbf e_t$
    • control policy : $\mathbf { a } _ { t } = \pi \left( \mathbf { a } _ { 1 } , \dots , \mathbf { a } _ { t - 1 } , \mathbf { s } _ { 0 } , \ldots , \mathbf { s } _ { t } \right)$
    • maximize reward : $\max _ { \mathbf { a } _ { t } } \mathrm { E } _ { \mathbf { e } _ { t } } \left[ \sum _ { t = 0 } ^ { N } R _ { t } \left( \mathbf { s } _ { t } , \mathbf { a } _ { t } \right) \right]$ s.t. $\mathbf { s } _ { t + 1 } = f \left( \mathbf { s } _ { t } , \mathbf { a } _ { t } , \mathbf { e } _ { t } \right)$ and $\mathbf s_0$ given

Adaptive methods

  • adagrad : strong performance in practice
    • update : $\left[ \mathbf { x } _ { t + 1 } \right] _ { i } = \left[ \mathbf { x } _ { t } \right] _ { i } - \frac { \gamma } { \sqrt { \left[ G _ { t } \right] _ { i } } } \left[ \mathbf { g } _ { t } \right] _ { i }$ for each feature $i$ as $\left[ G _ { t } \right] _ { i } : = \sum _ { s = 0 } ^ { t } \left( \left[ \mathbf { g } _ { s } \right] _i \right) ^ { 2 }$ picking stochastic gradient $\mathbf g_t$
    • variants : adadelta, adam, RMSprop

In practice

  • trends
    • privacy in ML
    • decentralized training
    • ML for trust : intrusion
    • trust in ML : adversarial attacks
      • training : $\min _ { \mathbf { w } } \left( f _ { \mathbf { w } } \left( \mathbf { x } _ { i } \right) - y _ { i } \right) ^ { 2 }$ change model $\nabla _ { w } f$
      • attacking : $\min _ { \mathbf { x } _ { i } } \left( f _ { \mathbf { w } } \left( \mathbf { x } _ { i } \right) - y _ { i } \right) ^ { 2 }$ change data $\nabla _ { \mathbf { x } _ { i } } f$
  • tricks
    • feature hashing
    • limited precision operations
  • parallel image-20180701152030927
    • CoCoA : communication efficient distributed optimization
  • primal-dual image-20180701152105409
  • autoML
    • hyper-parameter optimization
    • learning to learn
    • neural architecture search