# Workshop #5 Supervised Learning: FFBPN

Posted:2021-04-19 08:43:57　Click:245

# An introduction to backpropagation

Neural networks, like any other supervised learning algorithms, learn to map an input to an output based on some provided examples of (input, output) pairs, called the training set. In particular, neural networks performs this mapping by processing the input through a set of transformations. A neural network is composed of several layers, and each of these layers are made of units (also called neurons) as illustrated below:

In the picture above, the input is transformed first through the hidden layer 1, then the second one and finally an output is predicted. Each transformation is controlled by a set of weights (and biases). During training, to indeed learn something, the network needs to adjust these weights to minimize the error (also called the loss function) between the expected outputs and the ones it maps from the given inputs. Using gradient descent as an optimization algorithm, the weights are updated at each iteration as:

where L is the loss function and $\fn_phv&space;\large&space;\epsilon$ is the learning rate.

As we can see above, the gradient of the loss with respect to the weight is substracted from the weight at each iteration. This is the so called gradient descent. The gradient $\dpi{100}&space;\large&space;\dpi{100}&space;\large&space;\frac{\partial&space;L}{\partial&space;w}$ can be interpreted as a measure of the contribution of the weight to the loss. Therefore the larger is this gradient (in absolute value), the more the weight is updated during an iteration of gradient descent.

The minimization of the loss function task ends up being related to the evaluation of the gradients described above. We will review 3 proposals to perform this evaluation:

• Analytical calculation of the gradients.
• Approximation of the gradients as being: $\dpi{100}&space;\large&space;\dpi{100}&space;\large&space;\frac{\partial&space;L}{\partial&space;w}&space;=&space;\frac{L(w&space;+&space;\delta&space;w)&space;-&space;L(w)}{\delta&space;w}$, where $\dpi{300}&space;\large&space;\dpi{100}&space;\large&space;\delta&space;w&space;\rightarrow&space;0$.
• Backpropagation or reverse mode autodiff.

Before going into the details of these proposals, we will first clearly define our problem and simplify it for the sake of the discussion.

### Problem Definition

Let's discuss a little bit about how the input is transformed to produce the hidden layer representation. In neural network, a layer is obtained by performing two operations on the previous layer:

• First the previous layer is tranformed via a linear operation: the value of the previous layer is multiplied by a weight, and a bias is added to it. It gives: $\dpi{100}&space;\large&space;z=&space;xw&space;+&space;b$, where $\dpi{100}&space;\large&space;x$ is the value of the previous layer unit, $\dpi{100}&space;\large&space;w$ and $\dpi{100}&space;\large&space;b$ are respectively the weight and the bias discussed above.
• Second, the previous linear operation is used as an input to the activation function of the unit. This activation is generally chosen to introduce non linearity in order to solve complex tasks. Here we will simply consider that this activation function is a sigmoid function: $\dpi{100}&space;\large&space;\sigma(x)&space;=&space;\frac{1}{1+\exp(-x)}$. As a consequence the value y of a layer can be written as: $\dpi{100}&space;\large&space;y&space;=&space;\sigma&space;(z)&space;=&space;\sigma&space;(xw&space;+&space;b)$.

So now, considering our case, with an input layer, one hidden layer, and an output layer, all made of a single unit and respectively called x, h and y, we can write:

$\dpi{100}&space;\large&space;h&space;=&space;\sigma&space;(xw_1&space;+&space;b_1)$, where $\dpi{100}&space;\large&space;w_1$ and $\dpi{100}&space;\large&space;b_1$ are respectively the weight and the bias used to compute the hidden unit from the input.

$\dpi{100}&space;\large&space;y&space;=&space;\sigma&space;(hw_2&space;+&space;b_2)$, where $\dpi{100}&space;\large&space;w_2$ and $\dpi{100}&space;\large&space;b_2$ are respectively the weight and the bias used to compute the output from the hidden unit.

From now on, we are able to calculate the output y based on the input x, through a set of transformations. This is the so called forward propagation since this calculation goes forward inside the network.

We now need to compare this predicted ouptut to the true one $\dpi{100}&space;\large&space;y_T$. As explained earlier, we use a loss function to quantify the error that the network does while prediciting. Here we will consider as a loss function the squared error defined as :

$\dpi{100}&space;\large&space;L&space;=&space;\frac{1}{2}&space;(y-y_T)^2$

As discussed earlier, the weights (and biases) need to be updated according to the gradient of this loss function L with respect to these weights (and biases). Therefore the challenge here is to evaluate these gradients. The first approach would be to manually derive them.

### Analytical differentiation

Although this method is cumbersome and error prone, it is worth to investigate to better understand the challenge Here we simplify a lot the problem since we have a single hidden layer and a single unit per layer. And yet the analytical derivation requires quite some attention.

$\dpi{100}&space;\large&space;\frac{\partial&space;L}{\partial&space;w_2}&space;=&space;\frac{1}{2}&space;\frac{\partial&space;\left((y-y_T)^2\right)}{\partial&space;w_2}=\frac{1}{2}&space;\left(\frac{\partial&space;\left(y^2\right)}{\partial&space;w_2}&space;-&space;2y_T&space;\frac{\partial&space;y}{\partial&space;w_2}\right)$

Knowing that $\dpi{100}&space;\large&space;y&space;=&space;\sigma&space;(hw_2&space;+b_2)$, we get :

$\dpi{100}&space;\large&space;\frac{\partial&space;L}{\partial&space;w_2}&space;=&space;\frac{1}{2}&space;\left(\frac{\partial&space;\left(\frac{1}{\left(&space;1&space;+&space;\exp(-hw_2&space;-b_2\right))^2}\right)}{\partial&space;w_2}&space;-&space;2y_T&space;\frac{\partial&space;\frac{1}{1&space;+&space;\exp(-hw_2&space;-&space;b_2)}}{\partial&space;w_2}\right)=h\frac{&space;\exp(-hw_2&space;-&space;b_2)}{\left(1+\exp(-hw_2&space;-&space;b_2)\right)^2}&space;\left(\frac{1}{1+\exp(-hw_2&space;-&space;b_2)}&space;-&space;y_T\right)$

Here we derived the gradient with respect to $w_2$, and the calculation for the one with respect with $\dpi{100}&space;\large&space;w_1$ would be even more tedious. Therefore such an analytical approach would be very hard to implement for a complex network. In addition, computing wise this approach would be quite inefficient since we could not leverage the fact that the gradients share some common definition as we will soon discuss. A way more easy way to get these gradients would be to use a numerical approximation.

### Numerical differentiation

Trading accuracy for simplicity, we can obtain the gradient using the following :

$\dpi{100}&space;\large&space;\frac{\partial&space;L}{\partial&space;w}&space;\simeq&space;\frac{L(w&space;+&space;\delta&space;w)&space;-&space;L(w)}{\delta&space;w}$, with $\dpi{100}&space;\large&space;\delta&space;w&space;\rightarrow&space;0$

As suggested above, although simpler than the analytical derivation, this numerical differentiation is also less precise. In addition for every gradients to evaluate, we would have to calculate the loss function at least once (doing one forward pass by weights and biases). For a neural network with 1 million weigth parameters, it would requires 1 million forward passes, which is definitely not efficient to compute. Let's now come to a better solution and the core of this article by reviewing the backpropagation approach.

### Backpropagation

Before speaking in more details about what backpropagation is, let's first introduce the computational graph that leads to the evaluation of the loss function.

The nodes in this graph correspond to all the values that are computed in order to get the loss L. If a variable is computed by applying an operation to another variable, an edge is drawn between the two variable nodes. Looking at this graph, and making use of the chain rule of calculus, we can express the gradient of L with respect to the weights (or biases) as:

$\dpi{100}&space;\large&space;\frac{\partial&space;L}{\partial&space;w_2}&space;=&space;\frac{\partial&space;L}{\partial&space;y}&space;\frac{\partial&space;y}{\partial&space;w_2}&space;=&space;\frac{\partial&space;L}{\partial&space;y}&space;\frac{\partial&space;y}{\partial&space;z_2}\frac{\partial&space;z_2}{\partial&space;w_2}$

Same goes for the weight $\dpi{100}&space;\large&space;w_1$:

$\dpi{100}&space;\large&space;\frac{\partial&space;L}{\partial&space;w_1}&space;=&space;\frac{\partial&space;L}{\partial&space;y}&space;\frac{\partial&space;y}{\partial&space;w_1}&space;=&space;\frac{\partial&space;L}{\partial&space;y}&space;\frac{\partial&space;y}{\partial&space;z_2}&space;\frac{\partial&space;z_2}{\partial&space;w_1}&space;=&space;\frac{\partial&space;L}{\partial&space;y}&space;\frac{\partial&space;y}{\partial&space;z_2}&space;\frac{\partial&space;z_2}{\partial&space;h}&space;\frac{\partial&space;h}{\partial&space;w_1}&space;=&space;\frac{\partial&space;L}{\partial&space;y}&space;\frac{\partial&space;y}{\partial&space;z_2}&space;\frac{\partial&space;z_2}{\partial&space;h}&space;\frac{\partial&space;h}{\partial&space;z_1}&space;\frac{\partial&space;z_1}{\partial&space;w_1}$

One very important thing to notice here is that the evaluation of the gradient $\dpi{100}&space;\large&space;\frac{\partial&space;L}{\partial&space;w_1}$ can reuse some of the calculations perfomed during the evaluation of the gradient $\dpi{100}&space;\large&space;\frac{\partial&space;L}{\partial&space;w_2}$. It is even clearer if we evaluate the gradient $\dpi{100}&space;\large&space;\frac{\partial&space;L}{\partial&space;b_1}$:

$\dpi{100}&space;\large&space;\frac{\partial&space;L}{\partial&space;b_1}&space;=&space;\frac{\partial&space;L}{\partial&space;y}&space;\frac{\partial&space;y}{\partial&space;z_2}&space;\frac{\partial&space;z_2}{\partial&space;h}&space;\frac{\partial&space;h}{\partial&space;z_1}&space;\frac{\partial&space;z_1}{\partial&space;b_1}$

We see that the first four terms on the righ hand of the equation are the same than the one from $\dpi{100}&space;\large&space;\frac{\partial&space;L}{\partial&space;w_1}$.

As you can see in the equations above, we calculate the gradient starting from the end of the computational graph, and proceed backward to get the gradient of the loss with respect to the weights (or the biases). This backward evaluation gives its name to the algoritm: backpropagation. The backpropagation algorithm can be illustrated by the image below:

In pratice, one iteration of gradient descent would now require one forward pass, and only one pass in the reverse direction computing all the partial derivatives starting from the output node. It is therefore way more efficient than the previous approaches. In the original paper about backpropagation published in 1986 , the authors (among which Geoffrey Hinton) used for the first time backpropagation to allow internal hidden units to learn features of the task domain.

# Backpropagation Parameters

Refer to the website：https://keras.io/api/