for each neuron of each layer.
So now variable with only superscript letter such as ????ᴸ, w ˡ are vectors and matrices, whereas variables with both superscript and subscript letters such as ????ᵢˡ and wˡᵢᵣ are single values.
Since we have multiple output layer, the definition of loss function ????(y, ŷ) = ½(y – ŷ)² is no more sufficient.
We have to account to all output neurons.
So we define the cost function as the square sum of all output neurons C=½∑ ᵢ(yᵢ−ŷᵢ)².
So the formulas that we have computed so far will have a vector form:Where ∇aC is a vector of variation of the cost C relative to the output of the network aᵢᴸ, which is ∂C/∂aᵢᴸ .
The ⊙ operator is a member-wise vector/matrix multiplication.
The Back Propagation AlgorithmInput x: Set the input in layer 1.
Forward: For each layer l = 2,3,…,L we compute zˡ = wˡ aˡ⁻¹+bˡ and aˡ=gˡ(zˡ).
Output error ????ᴸ: At the output layer we compute the vector ????ᴸ=∇aC⊙gᴸ′(zᴸ).
This will be the start of the back propagation.
Back propagation: We move backward, for each layer l=L-1,L-2,L-3,…,2 we compute the error of each layer????ˡ=((w ˡ⁺¹)ᵀ * ???? ˡ⁺¹) ⊙ gˡ’(zˡ).
Then we update the weights of each layer using the Gradient Descent Formula: wˡ⁺ᵢᵣ = wˡᵢᵣ -????.* ????ˡᵢ * aᵣˡ⁻¹ and bˡ⁺ᵢ = bˡᵢ -????.* ????ˡᵢwhere wˡ⁺ and bˡ⁺ are the updated values of wˡ and bˡ at layer l after each iteration.
Output: At the end we will have the weights w and biases b at each layer that have been computed to minimize the cost function C.
ConclusionBack propagation might be tricky to understand and trickier to implement in code, where it is easy to get entangled with matrices and vector and their dimensions.
However, it is important for beginners to put enough efforts to gain enough intuition about this technique as it will help them acquire in depth knowledge of Neural Networks.
.. More details