If we take a look about the linear relationship in each synapse, for both inputs, in next figure:We can see how we are first transposing and placing before the gradient respect to the bias b2, as we intend to average all the error committed in the different observation to provide a single value to update this bias.On the other hand, for the a2W2 term we are simply making a matrix multiplication of the back-propagated error by a column of 1s, as we want to keep track of each different back-propagated error to continue backwards.We already know that addGates act like distributors, and that the local derivative in both cases is 1..That is why we have 3 dimensions, one for each output of the hidden neurons; and each of them is adding the error committed for the different observations.2..We are splitting the back-propagated error of each observation to the different hidden neurons, that’s why we have (4×3) matrix (4 observations and 3 neurons).Figure 4..We are applying exactly the same procedure we applied in the first addGate.The last step is another mulGate:The last transpose of X is used to average every back-propagated error of each observation with every input at each observation for every neuron in the hidden layer.3.. More details