⚠️ This tutorial assumes you have done dslr and that you have a good understanding of the basics of machine learning and linear algebra.
- Introduction 👋
- Layers 📚
- Forward Propagation ➡️
- Backward Propagation 🔙
- Early stopping 🛑
- Complex optimizations 🚀
- Resources 📖
multilayer-perceptron
gives us a problem to solve that is fairly similar to the one we had in dslr
, if not simpler: we are given a dataset of fine-needle aspirates of breast mass, and we have to predict whether the mass is benign or malignant.
The project is not really about the dataset, but rather about the implementation of a neural network, and how modular it is. Here, we will cover the various "modules" of the neural network, and how they interact with each other.
💡 I recommend you watch sentdex's Neural Networks from Scratch first, as it provides great explanations on each component of a NN and their respective roles.
A neural network is made of layers, and layers are made of neurons, which are the basic unit of computation in a neural network.
A layer is characterized by its input and output sizes (e.g. 4 neurons in, 3 neurons out), along with its weights and biases.
The weights and biases are the parameters of the layer, and they are what the neural network will tweak during its training.
There are several types of layers, but we will explain only two of them here.
A dense layer, or fully connected layer, is a type of layer where each neuron is connected to each neuron in the previous layer.
Each weight in the layer is associated with a connection between two neurons (i.e. what impact will the output of neuron
Activations are actually functions, as they do not depend on any external weight but only on the output of the neurons. However, we will treat them as some kind of intermiediate layer here, as it will be more relevant when we get to forward and back propagation.
In dslr
, our activation function was the sigmoid function, which squashes the output of the neuron between 0 and 1.
It is now considered a bit old-fashioned, so you would rather use the ReLU function, which is much simpler and faster to compute.
💡 The ReLU function is defined as
$f(x) = \max(0, x)$ . Simpler than$\frac{1}{1 + e^{-x}}$ to be honest.
Forward propagation is the simple process of passing an input through the layers of the neural network.
You pass an input to the input layer, that will pass it to a hidden layer that will compute an output, passed to an activation layer, passed to another hidden layer, and so on, until you reach the output layer.
The output layer will then give the prediction of the neural network.
Simple.
With forward propagation, we now have a prediction. However, we need to know how good this prediction is, and how we can improve it.
In dslr
, we directly calculated the derivative of the loss w.r.t. the weights and biases, and updated them accordingly, because each weight had a direct impact on the output.
However, in a neural network with multiple layers, the weights of the first layer have an indirect impact on the output, and it gets trickier to derive.
That is where backpropagation and the chain rule come in.
We basically decompose the task of computing the derivatives per layer, and pass them sequentially from the output layer to the input layer.
That is to say, each element will need to implement the derivative of its output error w.r.t. its input, formally
Forward and backward propagation visualized for one layer.
💡 Back propagation is one of the hardest concepts of this project, so be sure to check out the resources, particularly this video that decomposes the problem amazingly, also available as an article.
Naturally, the first derivative we will need to compute is the derivative of the loss w.r.t. the output of the neural network.
Here, the network's output passes through a softmax function, which turns a vector of raw scores into a vector of probabilities (e.g.
Then, it is passed to a cross-entropy loss function, which computes the difference between the predicted probabilities and the actual probabilities (e.g. the actual result was
The derivative of the cross-entropy loss w.r.t. its input is quite tricky, and the softmax function is even trickier.
However, the derivative of the softmax function + cross-entropy loss is quite simple, and is given by the formula
💡 Capital letters denote vectors, and normal letters are scalars.
$Y$ is a vector$[y_1, y_2, y_3]$ for example.
We will now see how to compute
In the following, we will call
⚠️ Here, all the derivatives are given already reduced, but you should watch the whole process of deriving them in this video mentioned above.
We just saw that the derivative of the softmax + cross-entropy loss is
The derivative of activation layers is given by its output gradient, multiplied element-wise by the derivative of the activation function used.
Let's see some derivatives
The derivative of the sigmoid function is given by:
💡 Here for instance,
$\frac{\partial E}{\partial X} = \frac{\partial E}{\partial Y} \odot f(x) (1 - f(x))$ .
The derivative of the ReLU function is quite simple:
Dense layers are a bit more complex, because aside from computing the input gradient based on the output gradient, we also need to compute the weights and biases gradients in order to update them.
Here, we assume the weights are a matrix of shape
The input gradient is given by:
The weights gradient is given by:
The biases gradient is simply the output gradient
Now, let's take the following neural network:
- Input layer of size 4
- ReLU activation layer
- Hidden layer of size 3
- ReLU activation layer
- Output layer of size 2
- Softmax activation layer
During a single epoch, backpropagation will look like this:
- Compute the output of the neural network and pass it through softmax to get
$Y_{pred}$ . - Compute the loss gradient
$Y_{pred} - Y_{true}$ . - Pass it to the output layer to get the gradient of its loss w.r.t. its input.
- Pass this gradient to the ReLU activation layer to get the gradient of its loss w.r.t. its input.
- Pass this gradient to the hidden layer to get the gradient of its loss w.r.t. its input, and allow it to update its weights and biases.
- And so on, until we reach the input layer.
💡 You might end up fighting with NumPy's broadcasting when working with batches. Do not sum or average the batches without reason. You should be able to implement this without doing any other sum than the biases update's.
When training a neural network, it is good practice to keep track of the loss on the training data, obviously, but also on some validation data, that is data that the neural network has never seen before.
Typically, if we plot the training loss and the validation loss, we will see that the training loss will decrease over time, while the validation loss will decrease at first, but then increase again.
This is called overfitting, and it happens when the neural network learns the training data too well, and starts to "learn the noise" in the data.
Although the loss on the training data is still decreasing, it is basically useless to keep going, as it will never perform that well on new data.
That is why we use early stopping, which consists of stopping the training when the validation loss starts to increase again.
It is fairly simply to implement, you just need to add the following logic to your training loop:
- Initialize a counter and a variable to keep track of the best loss.
patience_counter = 0
best_loss = float("inf")
- If the loss is lower than the previous best loss, save it.
if val_loss < best_loss:
best_loss = val_loss
- If the loss is higher than the previous best loss, increment the counter.
else:
patience_counter += 1
- If the counter reaches a certain threshold, stop the training.
if patience_counter >= patience:
break
💡 You can also save the model's weights when the validation loss is the lowest, and load them back when you stop the training.
That's it!
Vanilla gradient descent consists of simply updating the weights and biases based on the gradients.
The concept of momentum introduces some inertia in the updates, by adding a fraction of the previous update to the current update.
To implement it, you can simply store the previous update in a variable, and update it as follows:
if velocity is None:
velocity = np.zeros_like(weights)
velocity = momentum * velocity + learning_rate * gradient
weights -= velocity
Note that the only difference with the vanilla gradient descent is the addition of the momentum * velocity
term, as vanilla gradient descent would simply be:
weights -= learning_rate * gradient
RMSprop is a more complex optimization algorithm that adapts the learning rate based on the gradients.
It is based on AdaGrad, which adapts the learning rate based on the sum of the squared gradients. However, it is considered too aggressive, as the learning rate decreases too quickly.
RMSprop introduces a decay factor to the sum of the squared gradients, which allows the learning rate to decrease more slowly.
We can introduce some "velocity" term as we did before, that is defined as follows:
Where:
-
$v_t$ is the velocity at time$t$ -
$\beta$ is the decay factor (around 0.9) -
$\nabla J(\theta)$ is the gradient of the loss w.r.t. the weights
💡
${\nabla J(\theta)}^2$ is of course a Hadamard product, we could have noted it$\nabla J(\theta) \odot \nabla J(\theta)$ .
Then, we can update the weights as follows:
As you can see, the velocity is based on
Where:
-
$\eta$ is the learning rate -
$\epsilon$ is a small number (e.g.$10^{-7}$ ), to avoid division by zero
The implementation logic is the same as before, only the update formula changes!
You can check out this interesting article for a more in-depth explanation.
Adam is a combination of momentum and RMSprop, and is considered one of the best optimization algorithms for neural networks. It stands for Adaptive Moment Estimation.
It introduces two momentum terms, that we will denote
Where:
-
$m_t$ is the momentum at time$t$ -
$\beta_1$ is the decay factor (around 0.9) -
$\nabla J(\theta)$ is the gradient of the loss w.r.t. the weights
Where:
-
$v_t$ is the momentum at time$t$ -
$\beta_2$ is the decay factor (around 0.999) -
$\nabla J(\theta)$ is the gradient of the loss w.r.t. the weights
Then, we can update the weights as follows:
Where:
-
$\eta$ is the learning rate -
$\epsilon$ is a small number (e.g.$10^{-7}$ ), to avoid division by zero
- 📺 YouTube − Neural Networks from Scratch : understand the various components of a neural network and how to create one in a modular way.
- 📺 YouTube − Backpropagation, step-by-step | DL3 : understand how backpropagation works.
- 📺 YouTube − Backpropagation calculus | DL4 : understand the math behind backpropagation.
- 📺 YouTube − Backpropagation Algorithm | Neural Networks : have a good insight of the chain rule's logic in backpropagation.
- 📺 YouTube − Neural Network from Scratch | Mathematics & Python Code : understand the actual implementation of backpropagation in code.
- 📖 Medium − Neural Network from scratch in Python : the corresponding article to the previous video.
- 📖 Pinecone − Cross-Entropy Loss: make predictions with confidence : the full derivation of the softmax + cross-entropy loss.
- 📺 YouTube − Optimization for Deep Learning (Momentum, RMSprop, AdaGrad, Adam)
- 📖 Medium − Understanding RMSProp: A Simple Guide to One of Deep Learning’s Powerful Optimizers