7 DIY ML/AI
We assume a background similar to that found in here. A neural network is a series of composed functions whereby parent nodes of the design graph are linearly combined via weights, then acted on by an activation function to obtain child nodes. Take the example below.
Usually, the nodes are added in so called layers.
where
A typical activation function is a rectified linear unit (RELU), defined as
Let’s consider fitting the network above using gradient descent and obtaining the derivative via the chain rule. Consider the contribution of a row of data to a SE loss function,
where
These get multiplied together, using matrix multiplication when required, to form the first derivative for
Let’s try coding it for this parameter. We’re going to create the model just hard coding the network.
import numpy as np
## Define our activation function and its derivative
= lambda x : np.exp(x) / (1 + np.exp(x))
g = lambda x: g(x) * (1 - g(x))
g_deriv
## Here's one row of data
= 100, 2, 3
Y, X1, X2
## Creating some random initialized weights
## Adding to the dims so that the notation agrees
= np.random.normal( size = [3, 4, 3] )
W
= g(W[0,1,1] + W[1,1,1] * X1 + W[2,1,1] * X2)
H11 = g(W[0,1,2] + W[1,1,2] * X1 + W[2,1,2] * X2)
H12 = g(W[0,2,1] + W[1,2,1] * H11 + W[2,2,1] * H12)
H21 = g(W[0,2,2] + W[1,2,2] * H11 + W[2,2,2] * H12)
H22 = g(W[0,3,1] + W[1,3,1] * H21 + W[2,3,1] * H22)
ETA
## Our chain rule sequence of derivatives
= (Y - ETA) ** 2
L
## Backprop calculating the derivatives
= -2 * (Y - ETA)
dL_dETA
= g_deriv(W[0,3,1] + W[1,3,1] * H21 + W[2,3,1] * H22) * np.array((W[1,3,1], W[2,3,1]))
dETA_dH2
= np.array(
dH2_dH11 0,2,1] + W[1,2,1] * H11 + W[2,2,1] * H12 ) * W[1,2,1],
( g_deriv(W[0,2,2] + W[1,2,2] * H11 + W[2,2,2] * H12 ) * W[1,2,2]
g_deriv(W[
)
)
= g_deriv(W[0,1,1] + W[1,1,1] * X1 + W[2,1,1] * X2) * X1
dH11_dW111
## Here's the backrpop in derivative calculation
= dL_dETA * np.sum(dETA_dH2 * dH2_dH11) * dH11_dW111
dL_dW111
print(dL_dW111)
## Let's approximate the derivative numerically
= 1e-6
e
## Perturb W111 a little bit
1,1,1] -= e
W[
## Feed forward through the network with the perturbed W111
= g(W[0,1,1] + W[1,1,1] * X1 + W[2,1,1] * X2)
H11 = g(W[0,1,2] + W[1,1,2] * X1 + W[2,1,2] * X2)
H12 = g(W[0,2,1] + W[1,2,1] * H11 + W[2,2,1] * H12)
H21 = g(W[0,2,2] + W[1,2,2] * H11 + W[2,2,2] * H12)
H22 = g(W[0,3,1] + W[1,3,1] * H21 + W[2,3,1] * H22)
ETA
## Calculate the new loss
= (Y - ETA) ** 2
Le
## Here's the approximate derivative
print( (L - Le) / e )
0.0032232195647262773
0.0032232492230832577
Now let’s calculate the derivative