Skip to content
Paco Zamora Martinez edited this page Sep 17, 2015 · 27 revisions

Introduction

Related with the module require("aprilann.ann.optimizer").

The optimizer is an object which implements the learning algorithm. Every class in ann.optimizer is an optimizer. Several learning hyperparameters are available, depending in the selected optimizer. This learning hyperparameters are known as options, and could be set globally (to all the connection weight layers of the ANN), or layerwise (to a concrete connection weights object, identified by its name). Optimizers implement the following API.

Interface of ann.optimizer classes

clone

other = opt:clone()

Returns a deep copy of the caller object.

get_option

value = opt:get_option(option)

Return the global value of a given learning option name.

set_option

opt:set_option(option, value)

Sets the global value of a given learning option name.

set_layerwise_option

opt:set_layerwise_option(layer, option, value)

Sets a layerwise option for the given layer name.

get_layerwise_option

value = opt:get_layerwise_option(layer, option)

Returns the layerwise option of the given layer name.

get_option_of

value = opt:get_option_of(layer, option)

Returns the option which is applicable to the given layer name. If a layerwise option was previously defined, the method returns its value. Otherwise, the value of the global option will be returned.

execute

loss,gradients,... = opt:execute(eval,cnn)

This is the core function of optimizer objects. It receives two parameters, and return the same values as returned by eval function. The parameters are:

  • eval is a Lua function which returns at least the first two parameters of the list below. The function receives two inputs:

    1. A table of candidate weigths to compute the loss.

    2. The iteration number of the algorithm.

    This function could return more values, all of them will be returned by the call to execute:

    1. The loss of a bunch of data (a mini-batch).

    2. A gradients matrix dictionary (as returned by ANN components method compute_gradients()). It must be a table of names=>matrix.

  • cnn a dictionary of matrices with the connection weight objects, as the dictionary returned by ANN components method copy_weights(). It must be a table of names=>matrix. The execute method can modify the matrices contained at cnn, by following the given gradients, can use a bunch of weight candidates and call eval function to test the loss.

Coupling ANNs with optimizer

Using a trainer object

The trainable.supervised_trainer class is the default coupling between the optimizer, the ANN component and the loss function. The optimizer could be set at the constructor:

> trainer = trainable.supervised_trainer(ann_component, ann.loss.mse(),
                                       bunch_size, ann.optimizer.sgd())

The hyperparemters of optimizer objects can be modified by the trainer object:

  • trainer:set_option(option,value): sets a global learning option value.

  • value=trainer:get_option(option): gets a global learning option value.

  • trainer:set_layerwise_option(layer_match,option,value): sets a layerwise learning option value of all the connection weight objects whose name matches the given layer_match Lua pattern string.

  • value=trainer:get_option_of(layer,option): gets the option value applicable to the given layer.

trainer = trainable.supervised_trainer(ann_component, ann.loss.mse(),
                                       bunch_size, ann.optimizer.sgd())
trainer:build()
trainer:set_option("learning_rate", number)
trainer:set_option("momentum", number)
-- regularization is recommended to not be applied at bias connections
trainer:set_layerwise_option("w.*", "weight_decay", number)
trainer:set_layerwise_option("w.*", "max_norm_penalty", number)
trainer:set_layerwise_option("w.*", "L1_norm", number)

Doing this, when you call trainer:train_dataset(...) or trainer:validate_dataset(...), the object implements a default eval function for the optimizer.

From the scratch

First, you need to construct the ANN component, the loss function and the optimizer instance, and the configuration of the optimizer.

local thenet = ann.mlp.all_all.generate("256 inputs 128 tanh 10 log_softmax")
local thenet,cnns = thenet:build()
local loss = ann.loss.multi_class_cross_entropy()
local opt  = ann.optimizer.sgd()
opt:set_option("learning_rate", 0.01)

The initialization of the weights is required:

local rnd = random(1234)
for _,w in pairs(cnns) do w:uniformf(-0.1,0.1,rnd) end

The training is performed over random data, generated on-the-fly:

local M = matrix.col_major -- ANNs need col_major matrices
local weight_grads = {} -- upvalue for eval function
for i=1,1000 do
  local input  = M(1,256):uniformf(0,1,rnd)
  local target = M(1,10):zeros():set(1, rnd:randInt(1,10), 1.0)
  opt:execute(function(weights,it)
                    if cnns ~= weights then thenet:build(weights) cnns=weights end
                    thenet:reset(it)
                    local out = thenet:forward(input)
                    local tr_loss,tr_matrix = loss:compute_loss(out,target)
                    thenet:backprop(loss:gradient(out,target))
                    weight_grads:zeros()
                    weight_grads = thenet:compute_gradients(weight_grads)
                    return tr_loss,weight_grads
               end,
               cnns)
end

All this code could be modified in many ways, to customize or implement your own methods.

NOTE the full code of this example is available at: EXAMPLES/optimizer-from-scratch.lua

Stochastic Gradient Descent

The ann.optimizer.sgd class trains the neural network following the Stochastic Grandient Descent algorithm. It incorporates regularization and momentum hyperparameters.

> opt = ann.optimizer.sgd()
> loss,gradients = opt:execute(function(weights,it)
                                 -- HERE YOUR CODE
                                 return loss,gradients
                               end,
                               -- The connections
                               cnns)

Its options are (default values are indicated with = symbol):

  • learning_rate: the learning rate controls the portion of the gradient used to update the weights.

  • decay=1e-05: controls the decay of learning rate.

  • momentum=0: is a inertial hyperparameter which applies a portion of the weight update in the previous iteration.

  • weight_decay=0: a L2 regularization term.

  • L1_norm=0: a L1 regularization term, a naive implementation with ZERO truncation to avoid ZERO cross.

  • max_norm_penalty=0: a constrain penalty based on the two-norm of the weights.

The algorithm uses the following learning rule:

w = w - lr'*( grad(L)/grad(w') + weight_decay*w' + L1_norm*sign(w') ) + momentum*(w' - w'')

where w, w' and w'' are the weight values at next, current, and previous iterations; lr' is the learning_rate, and grad(L)/grad(w') is the gradient of the loss function at the given weight. The L1 regularization is performed following truncate gradient algorithm. After this learning rule, the constraint max_norm_penalty is applied, forcing the 2-norm of the input weights of every neuron to be less than the given parameter.

> opt = ann.optimizer.sgd() -- an instance

Averaged Stochastic Gradient Descent

The ann.optimizer.asgd class trains the neural network following the Averaged Stochastic Grandient Descent algorithm, taken from Leon Bottou, Stochastic Gradient Descent Tricks, Microsoft Research, 2012. It incorporates regularization and learning rate decay hyperparameters.

> opt = ann.optimizer.asgd()
> loss,gradients = opt:execute(function(weights,it)
                                 -- HERE YOUR CODE
                                 return loss,gradients
                               end,
                               -- The connections
                               cnns)

Its options are (default values are indicated with = symbol):

  • learning_rate: the learning rate controls the portion of the gradient used to update the weights.

  • lr_decay=0.75: controls the learning rate decay. The effective learning rate is computed as lr / (1 + lr*t)^lr_decay, being t the number of presentations, which has not to be confused with number of epochs. Number of presentations would be higher of number of epochs.

  • t0=0: when to start averaging. Before t0, the algorithm will be SGD. As before, t0 is indicated in number of presentations, i.e. one epoch will be t0 = ceil( number_of_patterns / bunch_size ). It is recommended to set this option to at least one epoch.

  • weight_decay=0: a L2 regularization term.

> opt = ann.optimizer.asgd() -- an instance

Resilient backpropagation (RProp)

The ann.optimizer.rprop class trains the neural network following the RProp algorithm. It incorporates an option to set the number of iterations done with every mini-batch of patterns. This optimizer has a step size hyperparameter for each weight parameter in the model. All options are global, layer-wise options doesn't make sense here.

> opt = ann.optimizer.rprop()
> loss,gradients = opt:execute(function(weights,it)
                                 -- HERE YOUR CODE
                                 return loss,gradients
                               end,
                               -- The connections
                               cnns)

Its options are (default values are indicated with = symbol):

  • initial_step=0.1: the initial step size of every weight.

  • eta_plus=1.2: value which controls how much proportion is the step increased when the gradient sign is the same between two iterations.

  • eta_minus=0.5: value which controls how much proportion is the step decreased when the gradient sign changes between two iterations.

  • max_step=50: the maximum value for the weight step hyperparameter.

  • min_step=1e-05: the minimum value for the weight step hyperparameter.

  • niter=1: the number of iterations done with one mini-batch of patterns.

The algorithm modifies the step(i) parameter of a weight at iteration i:

          | step(i-1) * eta_minus, iff sign(grad(w')/grad(L)) <> sign(grad(w'')/grad(L))
step(i) = | step(i-1) * eta_plus,  iff sign(grad(w')/grad(L)) == sign(grad(w'')/grad(L))

It updates the weight following this equation:

w = w' - sign(grad(w')/grad(L)) * step(i)

In the above equations, w, w' and w'' are the weight values at next, current, and previous iterations and sign(grad(.)/grad(L)) is the sign of the gradient of the loss function at the given weight. Note that the step is saturated with the values max_step and min_step.

> opt = ann.optimizer.rprop() -- an instance

Conjugate Gradient (CG)

The ann.optimizer.cg class trains the neural network following the CG algorithm. It is a second order Hessian Free optimizer. Its convergence is usually faster than SGD algorithm.

> opt = ann.optimizer.cg()
> loss,gradients = opt:execute(function(weights,it)
                                 -- HERE YOUR CODE
                                 return loss,gradients
                               end,
                               -- The connections
                               cnns)

Its options are (default values are indicated with = symbol):

  • rho=0.01: Constant for Wolf-Powell conditions (global option).
  • sig=0.5: Constant for Wolf-Powell conditions (global option).
  • int=0.1: Reevaluation limit (global option).
  • ext=3: Maximum number of extrapolations (global option).
  • max_iter: Maximum number of iterations (global option).
  • max_eval=1.25*max_iter: Maximum number of evaluations (global option).
  • ratio=100: Maximum slope ratio (global option).
  • weight_decay: Weights L2 regularization (global and layer-wise option).
  • L1_norm: Weight L1 regularization (global and layer-wise option).
  • max_norm_penalty: Weight max norm upper bound (global and layer-wise option).

This implementation is rewrite of Torch 7 optim package, which is a rewrite of minimize.m written by Carl E. Rasmussen.

> opt = ann.optimizer.cg() -- an instance

Quickprop

The ann.optimizer.quickprop class trains the neural network following the Quickprop algorithm. It is a second order optimizer which uses a quadratic approximation to speed-up the learning convergence. It is usually faster than SGD, but can suffer of chaotic oscillations.

> opt = ann.optimizer.quickprop()
> loss,gradients = opt:execute(function(weights,it)
                                 -- HERE YOUR CODE
                                 return loss,gradients
                               end,
                               -- The connections
                               cnns)

Its options are (default values are indicated with = symbol):

  • learning_rate: It is mandatory to be given.
  • mu=1.75: Maximum growth factor.
  • epsilon=1e-04: Bootstrap factor.
  • max_step=1000: Maximum step value.
  • weight_decay: Weights L2 regularization.
  • L1_norm: Weight L1 regularization.
  • max_norm_penalty: Weight max norm upper bound.
> opt = ann.optimizer.quickprop() -- an instance

AdaDelta

The ann.optimizer.adadelta class trains the neural network following the AdaDelta algorithm. It is a method which dynamically adapts the learning rate just using first order information. It as simple as SGD, and appears to be more robust.

> opt = ann.optimizer.adadelta()
> loss,gradients = opt:execute(function(weights,it)
                                 -- HERE YOUR CODE
                                 return loss,gradients
                               end,
                               -- The connections
                               cnns)

Its options are (default values are indicated with = symbol):

  • decay=0.95: Decay of the accumulated gradient and updates.
  • epsilon=1e-06: To avoid numerical issues.
  • weight_decay=0.0: Weights L2 regularization.
  • max_norm_penalty=0.0: Weight max norm upper bound.
> opt = ann.optimizer.adadelta() -- an instance
Clone this wiki locally