Here is a walk-through of the main classes in minnn.py
.
xp
is an alias for the numerical processing library that we're using to make computation efficient. By default we use numpy, a widely used numerical library that you may know of already. For brief tutorials, you can check the links provided at the end of this page. Alternatively, you can use cupy
, an interface that is basically identical to numpy
, but allows computation on the GPU, which can be useful for speed purposes.
- The choice of computation library can be specified by the environment variable
WHICH_XP
. - For this assignment using
numpy
with CPU will already be fast enough (around 6s per iter in our running). In our final testing, we would probably usenumpy
. (Nevertheless please feel free to usecupy
if you find it is much faster.) - In the
to-be-implemented
parts, please simply usexp
to denotenumpy
orcupy
.
The Tensor
class is a Tensor data structure, with the underlying data stored in a multidimensional array.
- This class is very similar to
torch.Tensor
. Tensor.data
is the field that contains the main data for this tensor, this field is axp.ndarray
. The updates of the parameters should be directly changing this data.Tensor.grad
is the field for storing the gradient for this tensor. There can be three types of values for this field:- ->
None
: which denotes zero gradient. - ->
xp.ndarray
: which should be the same size as theTensor.data
, denoting dense gradients. - ->
Dict[int, xp.ndarray]
: which is a simple simulation of sparse gradients for 2D matrices (embeddings). The keyint
denotes the index into the first dimension, while the value is axp.ndarray
which shape isTensor.data.shape[1]
, denoting the gradient for the column slice according to the index. Tensor.op
, which is anOp
(see below) that generates thisTensor
, ifNone
then mostly not calculated but inputted.Parameter
is a simple sub-class ofTensor
, denoting persistent model parameters.
This class implements an operation that is part of a ComputationGraph
.
Op.forward
andOp.backward
: these are the forward and backward methods calculating the operation itself and its gradient.Op.ctx
: this field is a data field that is populated during theforward
operation to store all the relevant values (input, output, intermediate) that must be used inbackward
to calculate gradients. We provide two helper methodsstore_ctx
andget_ctx
to do these, but please feel free to store things in your own way, we will not checkOp.ctx
.Op.full_forward
: This is a simple wrapper for the actualforward
, adding only one thing to make it convenient, recording theTensor.op
for the outputtedTensor
so that you do not need to add this inforward
.
This class is the one that keeps track of the current computational graph.
- It simply contains a list of
Op
s, which are registered inOp.__init__
. - In forward, these
Op
are appended incrementally in calculation order, and in backward (see functionbackward
, they are visited in reversed order).
This is simply a collection of initializer methods that produces a xp.ndarray
according to the specified shape and other parameters like initializer ranges.
Model
maintains a collection ofParameter
. We provideadd_parameters
as a shortcut of making a newParameter
and adding it to the model.Trainer
takes aModel
and handles the update of the parameters.Trainer.update
denotes one update step which will be implemented in the sub-classes.SGDTrainer
is a simple SGD trainer, notice that here we check whetherTensor.grad
is sparse (simulated by a python dictionary) or not, and update accordingly. (In our enviroment with CPU, enabling sparse update is much faster, but not necessarily with GPU).- Notice that at the end of each
update
, we also clear the gradients (clearing by simply settingTensor.grad=None
). This can usually be two separate steps, but we combine them here for convenience.
reset_computation_graph
discards the previousComputationGraph
(together with previousOp
s and intermediateTensor
s) and make a new one. This should be called at the start of each computation loop.forward
gets thenp.ndarray
value of aTensor
. Since we calculate everything greedily, this step is simply retriving theTensor.data
.backward
assign a scalar gradientalpha
to a tensor and do backwards according to the reversed order of theOp
list stored insideComputationGraph
.
- The remaining
Op*
are all sub-classes ofOp
and denotes a specific function. We provide some operations and ask you to implement some of them. - Take
OpDropout
as an example, here we implement the inverted dropout, which scales values by1/(1-drop)
in forward. Inforward
, (if training), we obtain amask
usingxp.random
and multiply the input by this. All the intermediate values (including input and output) are stored usingstore_ctx
. Inbackward
, we obtain the graident of the outputTensor
by retriving previous stored values. Then the calcualted gradients are assigned to the inputTensor
by accumulate_grad. - Finally, there are some shortcut functions to make it more convenient.
Notably, minnn.py
is not completely implemented, and there are some parts that you will need to finish. For all of the parts below, there are tests in test_minnn.py
, which will allow you to test if each individual part is working properly.
accumulate_grad
accepts one (dense)xp.ndarray
and accumulate to theTensor
's dense gradients (xp.ndarray
).accumulate_grad_sparse
accepts a list of (index,xp.ndarray
) and accumulates them to theTensor
's simulated sparase gradients (dict
).- We will check the gradients before and after these methods. Notice that we reuse the
Tensor.grad
for both dense and (simulated) sparse gradients, thus please do not apply both at the same time. (See alsoget_dense_grad
of how to convert from simulated sparse gradients to dense ones.)
- This accepts inputs of
shape
andgain
, and outputs axp.ndarray
where the shape isshape
. (gain
simply means that finally we are scaling the weights by this value.) - See
Understanding the difficulty of training deep feedforward neural networks - Glorot, X. & Bengio, Y. (2010)
for details about Xavier/Glorot initialization, and this blog for more details about initialization in general.
We provide an implementation of SGDTrainer
and please similarly implement one for MomentumTrainer
, which is SGD with momentum.
- Notice that in this one, there can be some variations. You can implement according to this formula:
m <- mrate*m + (1-mrate)*g, p <- p - lrate * m
, but if you find something better feel free to use that as well. - Notice that for
update_sparse
, we still need to update the parameters if there are historicalm
, even if there are no gradients for the current step. - Please remember to clear gradients (by setting
p.grad=None
) at the end ofupdate
, similar toSGDTrainer
.
- Please implement the
forward
andbackward
methods for theseOp
s. OpLookup
represents a "lookup" operation, and accepts aTensor
matrixW_emb
([N,D]) and a list of word indexes ([n]), it returns anotherTensor
matrix ([n,D]).OpDot
represents a matrix multiplication, accepting aTensor
matrixW
([M,N]) and aTensor
vector ([N]), it returns anotherTensor
vector ([M]).OpTanh
calculates a tanh, and accepts anyTensor
and returns another one with the same shape.
- The only external library allowed is
numpy
/cupy
. No other libs can be utilized, for example,pytorch
or other tools. - With the default settings of
classifier.py
, the accuracies on sst are around 41(dev)/42(test). - In
classifier.py
, we also provide an option ofdo_gradient_check
to do gradient checking with finite differences, which can be utilized for debugging. - Please do not change another other existing parts of
minnn.py
(other than theto-be-implemented
ones) and the method signatures (name and argument names). But surely feel free to add any helper functions as long as they do not conflict with existing ones. - One thing to notice is the difference between
Tensor
andxp.ndarray
. The general rule of thumb is that the returning value ofOp*
'sforward
should be aTensor
. Nevertheless, in theOp.ctx
, we can store bothTensor
andxp.ndarray
. In addition, please check the type hint of the arguments and other providedOp*
for reference.