7  DIY ML/AI

We assume a background similar to that found in here. A neural network is a series of composed functions whereby parent nodes of the design graph are linearly combined via weights, then acted on by an activation function to obtain child nodes. Take the example below.

Usually, the nodes are added in so called layers. (X1,X2) is the input layer, (H11,H12) is the first hidden layer, (H21,H22) is the second hidden layer and Y is the output layer. Imagine plugging an X1 and X2 into this network. It would feed forward through the network as

H11=g1(W011+W111X1+W211X2)H12=g1(W012+W112X1+W212X2)H21=g2(W021+W121H11+W221H12)H22=g2(W022+W122H11+W222H12)η^=g3(W031+W131H21+W231H22)

where gk are specified activation functions and η is our estimate of Y. Typically, we would have a different activation function for the output layer than the others, and the other would have the same activation function. So, for example, if Y was binary, like hypertension diagnosis, then g1=g2 and g3 would be a sigmoid.

A typical activation function is a rectified linear unit (RELU), defined as g(x)=xI(x>0). The neural network is typically fit via a gradient based method, such as gradient descent, assuming a loss function. The loss function is usually based on maximum likelihood.

Let’s consider fitting the network above using gradient descent and obtaining the derivative via the chain rule. Consider the contribution of a row of data to a SE loss function, Li(w)=(yiηi)2, where ηi is the feed forward of our neural network for row i. Let’s look at the derivative with respect to w111 where we drop the subscript i. First note that only these arrows involve w111

Lw111=LηηH2H2H11H11w111

where H2=(H21,H22)t. These parts are:

Lη=2(Yη)ηH2=g3(W031+W131H21+W231H22)(W131,W231)H2H11=[g2(W021+W121H11+W221H12)W121,g2(W022+W122H11+W222H12)W122]tH11w111=g1(W011+W111X1+W211X2)x1

These get multiplied together, using matrix multiplication when required, to form the first derivative for W111. This is repeated for all of the weight parameters. Notice this requires keeping track of which nodes have w111 in its parent chain and that it travels backwards through the network. For this reason, it is called backpropagation

Let’s try coding it for this parameter. We’re going to create the model just hard coding the network.

import numpy as np

## Define our activation function and its derivative
g = lambda x : np.exp(x) / (1 + np.exp(x))
g_deriv = lambda x: g(x) * (1 - g(x))

## Here's one row of data
Y, X1, X2 = 100, 2, 3

## Creating some random initialized weights
## Adding to the dims so that the notation agrees
W = np.random.normal( size = [3, 4, 3] )

H11 = g(W[0,1,1] + W[1,1,1] * X1  + W[2,1,1] * X2)
H12 = g(W[0,1,2] + W[1,1,2] * X1  + W[2,1,2] * X2) 
H21 = g(W[0,2,1] + W[1,2,1] * H11 + W[2,2,1] * H12) 
H22 = g(W[0,2,2] + W[1,2,2] * H11 + W[2,2,2] * H12) 
ETA = g(W[0,3,1] + W[1,3,1] * H21 + W[2,3,1] * H22)

## Our chain rule sequence of derivatives
L = (Y - ETA) ** 2

## Backprop calculating the derivatives
dL_dETA  = -2 * (Y - ETA)

dETA_dH2 = g_deriv(W[0,3,1] + W[1,3,1] * H21 + W[2,3,1] * H22) * np.array((W[1,3,1],  W[2,3,1]))

dH2_dH11 = np.array( 
        ( g_deriv(W[0,2,1] + W[1,2,1] * H11 + W[2,2,1] * H12 ) * W[1,2,1], 
          g_deriv(W[0,2,2] + W[1,2,2] * H11 + W[2,2,2] * H12 ) * W[1,2,2] 
        ) 
)

dH11_dW111 = g_deriv(W[0,1,1] + W[1,1,1] * X1 + W[2,1,1] * X2) * X1

## Here's the backrpop in derivative calculation
dL_dW111 = dL_dETA * np.sum(dETA_dH2 * dH2_dH11) * dH11_dW111

print(dL_dW111)

## Let's approximate the derivative numerically
e = 1e-6

## Perturb W111 a little bit
W[1,1,1] -= e

## Feed forward through the network with the perturbed W111
H11 = g(W[0,1,1] + W[1,1,1] * X1  + W[2,1,1] * X2)
H12 = g(W[0,1,2] + W[1,1,2] * X1  + W[2,1,2] * X2) 
H21 = g(W[0,2,1] + W[1,2,1] * H11 + W[2,2,1] * H12) 
H22 = g(W[0,2,2] + W[1,2,2] * H11 + W[2,2,2] * H12) 
ETA = g(W[0,3,1] + W[1,3,1] * H21 + W[2,3,1] * H22)

## Calculate the new loss
Le = (Y - ETA) ** 2

## Here's the approximate derivative
print( (L - Le) / e )
0.0032232195647262773
0.0032232492230832577

Now let’s calculate the derivative