-
Notifications
You must be signed in to change notification settings - Fork 12
22 ann.optimizer
Related with the module require("aprilann.ann.optimizer")
.
The optimizer is an object which implements the learning algorithm. Every class
in ann.optimizer
is an optimizer. Several learning hyperparameters are available,
depending in the selected optimizer. This learning hyperparameters are known as
options, and could be set globally (to all the connection weight layers of
the ANN), or layerwise (to a concrete connection weights object, identified
by its name). Optimizers implement the following API.
other = opt:clone()
Returns a deep copy of the caller object.
value = opt:get_option(option)
Return the global value of a given learning option
name.
opt:set_option(option, value)
Sets the global value of a given learning option
name.
opt:set_layerwise_option(layer, option, value)
Sets a layerwise option
for the given layer
name.
value = opt:get_layerwise_option(layer, option)
Returns the layerwise option
of the given layer
name.
value = opt:get_option_of(layer, option)
Returns the option which is applicable to the given layer
name. If a
layerwise option was previously defined, the method returns its
value. Otherwise, the value of the global option will be returned.
loss,gradients,... = opt:execute(eval,cnn)
This is the core function of optimizer objects. It receives two parameters, and
return the same values as returned by eval
function. The parameters are:
-
eval
is a Lua function which returns at least the first two parameters of the list below. The function receives two inputs:-
A table of candidate weigths to compute the loss.
-
The iteration number of the algorithm.
This function could return more values, all of them will be returned by the call to execute:
-
The loss of a bunch of data (a mini-batch).
-
A gradients matrix dictionary (as returned by ANN components method
compute_gradients()
). It must be a table of names=>matrix
.
-
-
cnn
a dictionary of matrices with the connection weight objects, as the dictionary returned by ANN components methodcopy_weights()
. It must be a table of names=>matrix
. Theexecute
method can modify the matrices contained atcnn
, by following the given gradients, can use a bunch of weight candidates and calleval
function to test the loss.
The trainable.supervised_trainer
class is the default coupling
between the optimizer, the ANN component and the loss function. The optimizer
could be set at the constructor:
> trainer = trainable.supervised_trainer(ann_component, ann.loss.mse(),
bunch_size, ann.optimizer.sgd())
The hyperparemters of optimizer
objects can be modified by the trainer
object:
-
trainer:set_option(option,value)
: sets a global learning option value. -
value=trainer:get_option(option)
: gets a global learning option value. -
trainer:set_layerwise_option(layer_match,option,value)
: sets a layerwise learning option value of all the connection weight objects whose name matches the givenlayer_match
Lua pattern string. -
value=trainer:get_option_of(layer,option)
: gets the option value applicable to the given layer.
trainer = trainable.supervised_trainer(ann_component, ann.loss.mse(),
bunch_size, ann.optimizer.sgd())
trainer:build()
trainer:set_option("learning_rate", number)
trainer:set_option("momentum", number)
-- regularization is recommended to not be applied at bias connections
trainer:set_layerwise_option("w.*", "weight_decay", number)
trainer:set_layerwise_option("w.*", "max_norm_penalty", number)
trainer:set_layerwise_option("w.*", "L1_norm", number)
Doing this, when you call trainer:train_dataset(...)
or
trainer:validate_dataset(...)
, the object implements a default
eval
function for the optimizer.
First, you need to construct the ANN component, the loss function and the optimizer instance, and the configuration of the optimizer.
local thenet = ann.mlp.all_all.generate("256 inputs 128 tanh 10 log_softmax")
local thenet,cnns = thenet:build()
local loss = ann.loss.multi_class_cross_entropy()
local opt = ann.optimizer.sgd()
opt:set_option("learning_rate", 0.01)
The initialization of the weights is required:
local rnd = random(1234)
for _,w in pairs(cnns) do w:uniformf(-0.1,0.1,rnd) end
The training is performed over random data, generated on-the-fly:
local M = matrix.col_major -- ANNs need col_major matrices
local weight_grads = {} -- upvalue for eval function
for i=1,1000 do
local input = M(1,256):uniformf(0,1,rnd)
local target = M(1,10):zeros():set(1, rnd:randInt(1,10), 1.0)
opt:execute(function(weights,it)
if cnns ~= weights then thenet:build(weights) cnns=weights end
thenet:reset(it)
local out = thenet:forward(input)
local tr_loss,tr_matrix = loss:compute_loss(out,target)
thenet:backprop(loss:gradient(out,target))
weight_grads:zeros()
weight_grads = thenet:compute_gradients(weight_grads)
return tr_loss,weight_grads
end,
cnns)
end
All this code could be modified in many ways, to customize or implement your own methods.
NOTE the full code of this example is available at: EXAMPLES/optimizer-from-scratch.lua
The ann.optimizer.sgd
class trains the neural network following the
Stochastic Grandient Descent algorithm. It incorporates regularization and
momentum hyperparameters.
> opt = ann.optimizer.sgd()
> loss,gradients = opt:execute(function(weights,it)
-- HERE YOUR CODE
return loss,gradients
end,
-- The connections
cnns)
Its options are (default values are indicated with =
symbol):
-
learning_rate
: the learning rate controls the portion of the gradient used to update the weights. -
decay=1e-05
: controls the decay of learning rate. -
momentum=0
: is a inertial hyperparameter which applies a portion of the weight update in the previous iteration. -
weight_decay=0
: a L2 regularization term. -
L1_norm=0
: a L1 regularization term, a naive implementation with ZERO truncation to avoid ZERO cross. -
max_norm_penalty=0
: a constrain penalty based on the two-norm of the weights.
The algorithm uses the following learning rule:
w = w - lr'*( grad(L)/grad(w') + weight_decay*w' + L1_norm*sign(w') ) + momentum*(w' - w'')
where w
, w'
and w''
are the weight values at next, current, and previous
iterations; lr'
is the learning_rate
, and grad(L)/grad(w')
is the gradient
of the loss function at the given weight. The L1 regularization is performed
following truncate gradient algorithm. After this learning rule, the constraint
max_norm_penalty
is applied, forcing the 2-norm of the input weights of every
neuron to be less than the given parameter.
> opt = ann.optimizer.sgd() -- an instance
The ann.optimizer.asgd
class trains the neural network following the
Averaged Stochastic Grandient Descent algorithm, taken from
Leon Bottou, Stochastic Gradient Descent Tricks, Microsoft Research, 2012.
It incorporates regularization and learning rate decay hyperparameters.
> opt = ann.optimizer.asgd()
> loss,gradients = opt:execute(function(weights,it)
-- HERE YOUR CODE
return loss,gradients
end,
-- The connections
cnns)
Its options are (default values are indicated with =
symbol):
-
learning_rate
: the learning rate controls the portion of the gradient used to update the weights. -
lr_decay=0.75
: controls the learning rate decay. The effective learning rate is computed aslr / (1 + lr*t)^lr_decay
, beingt
the number of presentations, which has not to be confused with number of epochs. Number of presentations would be higher of number of epochs. -
t0=0
: when to start averaging. Beforet0
, the algorithm will be SGD. As before,t0
is indicated in number of presentations, i.e. one epoch will bet0 = ceil( number_of_patterns / bunch_size )
. It is recommended to set this option to at least one epoch. -
weight_decay=0
: a L2 regularization term.
> opt = ann.optimizer.asgd() -- an instance
The ann.optimizer.rprop
class trains the neural network following the RProp
algorithm. It incorporates an option to set the number of iterations done with
every mini-batch of patterns. This optimizer has a step size hyperparameter for
each weight parameter in the model. All options are global, layer-wise options
doesn't make sense here.
> opt = ann.optimizer.rprop()
> loss,gradients = opt:execute(function(weights,it)
-- HERE YOUR CODE
return loss,gradients
end,
-- The connections
cnns)
Its options are (default values are indicated with =
symbol):
-
initial_step=0.1
: the initial step size of every weight. -
eta_plus=1.2
: value which controls how much proportion is the step increased when the gradient sign is the same between two iterations. -
eta_minus=0.5
: value which controls how much proportion is the step decreased when the gradient sign changes between two iterations. -
max_step=50
: the maximum value for the weight step hyperparameter. -
min_step=1e-05
: the minimum value for the weight step hyperparameter. -
niter=1
: the number of iterations done with one mini-batch of patterns.
The algorithm modifies the step(i)
parameter of a weight at iteration i
:
| step(i-1) * eta_minus, iff sign(grad(w')/grad(L)) <> sign(grad(w'')/grad(L))
step(i) = | step(i-1) * eta_plus, iff sign(grad(w')/grad(L)) == sign(grad(w'')/grad(L))
It updates the weight following this equation:
w = w' - sign(grad(w')/grad(L)) * step(i)
In the above equations, w
, w'
and w''
are the weight values at next,
current, and previous iterations and sign(grad(.)/grad(L))
is the sign of the
gradient of the loss function at the given weight. Note that the step
is
saturated with the values max_step
and min_step
.
> opt = ann.optimizer.rprop() -- an instance
The ann.optimizer.cg
class trains the neural network following the CG
algorithm. It is a second order Hessian Free optimizer. Its convergence is
usually faster than SGD algorithm.
> opt = ann.optimizer.cg()
> loss,gradients = opt:execute(function(weights,it)
-- HERE YOUR CODE
return loss,gradients
end,
-- The connections
cnns)
Its options are (default values are indicated with =
symbol):
-
rho=0.01
: Constant for Wolf-Powell conditions (global option). -
sig=0.5
: Constant for Wolf-Powell conditions (global option). -
int=0.1
: Reevaluation limit (global option). -
ext=3
: Maximum number of extrapolations (global option). -
max_iter
: Maximum number of iterations (global option). -
max_eval=1.25*max_iter
: Maximum number of evaluations (global option). -
ratio=100
: Maximum slope ratio (global option). -
weight_decay
: Weights L2 regularization (global and layer-wise option). -
L1_norm
: Weight L1 regularization (global and layer-wise option). -
max_norm_penalty
: Weight max norm upper bound (global and layer-wise option).
This implementation is rewrite of Torch 7
optim package, which is a
rewrite of minimize.m
written by Carl E. Rasmussen.
> opt = ann.optimizer.cg() -- an instance
The ann.optimizer.quickprop
class trains the neural network following the
Quickprop algorithm.
It is a second order optimizer which uses
a quadratic approximation to speed-up the learning convergence. It is usually faster than
SGD, but can suffer of chaotic oscillations.
> opt = ann.optimizer.quickprop()
> loss,gradients = opt:execute(function(weights,it)
-- HERE YOUR CODE
return loss,gradients
end,
-- The connections
cnns)
Its options are (default values are indicated with =
symbol):
-
learning_rate
: It is mandatory to be given. -
mu=1.75
: Maximum growth factor. -
epsilon=1e-04
: Bootstrap factor. -
max_step=1000
: Maximum step value. -
weight_decay
: Weights L2 regularization. -
L1_norm
: Weight L1 regularization. -
max_norm_penalty
: Weight max norm upper bound.
> opt = ann.optimizer.quickprop() -- an instance
The ann.optimizer.adadelta
class trains the neural network following the
AdaDelta algorithm. It is a method which
dynamically adapts the learning rate just using first order information. It as
simple as SGD, and appears to be more robust.
> opt = ann.optimizer.adadelta()
> loss,gradients = opt:execute(function(weights,it)
-- HERE YOUR CODE
return loss,gradients
end,
-- The connections
cnns)
Its options are (default values are indicated with =
symbol):
-
decay=0.95
: Decay of the accumulated gradient and updates. -
epsilon=1e-06
: To avoid numerical issues. -
weight_decay=0.0
: Weights L2 regularization. -
max_norm_penalty=0.0
: Weight max norm upper bound.
> opt = ann.optimizer.adadelta() -- an instance