Neural networks

Neural networks#

Basics#

Let’s start by relating neural networks to regression. Consider a simple case where we have two nodes, $1$ and $X$ pointing to an outcome $Y$ . What does this mean? Let’s first put some context around the problem. Imagine that we want to use a subject’s BMI $X$ to predict their blood pressure, $Y$ . This diagram represents that.

_images/a6744b9580e0b1f7cfa74edc14a496d11cf9c33bb06efd30e711b6408481fc31.png

To interpret this diagram as a neural network, consider the following rule:

:::{note} Parent nodes that point to a child node are multiplied by weights then added together then operated on by an activation function to form the child node. :::

If the parent nodes point to the outcome, then the nodes are combined the operated on by a known function, called the activation function to form a prediction. So, in this case, this is saying that the intercept (node labeled $1$ )times a weight plus BMI (node labeled $X$ ) times a different weight get combined to form a prediction for SBP $Y$ . Or, in other words

\hat{Y} = g (w_{0} \times 1 + w_{1} \times X)

where $g$ is a function that we specify. So in this case, if $w_{0} = 120$ , $w_{1} = .1$ and $g$ is an idenity function, $g (a) = a$ , and a subject had a BMI of 30, then the prediction would be

\hat{Y} = g (120 + .1 * 30) = 120.3

Note $g$ is not shown in the diagram (though maybe you could with the shape of the child node) or something like that0. Also not shown in the daigram is:

The loss function, i.e. how to measure the different between $\hat{Y}$ and $Y$ .
The way the loss function combines subjects; we have multiple BMIs and SBPs
How we obtain the weights, $W_{0}$ and $W_{1}$ ; this is done by minmizing the loss function using an algorithm

So, imagine the case where $g$ is an identity function, our loss function for different subjects is squared error and we combine different losses by adding them up. Then, our weights are obtained by minmizing

\sum_{i = 1}^{N} (Y_{i} - {\hat{Y}}_{i})^{2}

and so, presuming our optimization algorithm works well, it should be idential to linear regression.

Consider a different setting. Imagine if our $Y$ is 0 or 1 based on whether or not the subject is taking anti-hypertensive mediations. Further, let $g$ be the sigmoid function, $g (a) = 1 / {1 + \exp (- a)}$ . Our prediction is

\hat{Y} = {1 + \exp (- W_{0} - W_{1} X)}^{- 1}

which is the logistic regression prediction with intercept $W_{0}$ and slope $W_{1}$ . Consider a case where $W_{0} = - 4$ , $W_{1} = .1$ and $X = 30$ , then our $\hat{Y} = 1 / {1 + \exp [- (- 4 + .1 \times 30)}] \approx .27$ . Thus, this model estimates a 27% probability that a subject with a BMI of 30 has hypertension.

Further, if we specify that the loss function is binary cross entropy

- \sum_{i = 1}^{n} {Y_{i} \log ({\hat{Y}}_{i}) + (1 - Y_{i}) \log (1 - {\hat{Y}}_{i})} / N

then minmizing our loss function is identical to maximizing the likelihood for logistic regression.

1 / (1 + np.exp(-(-4 + .1 * 30)))

0.2689414213699951

More layers#

Of course, there’d be no point in using NNs for problems that we can just solve with generalized linear models. NNs get better when we add more layers, since then they can discover interactions and non-linearities. Consider the following model. Notice we quit explicitly adding the bias (intercept) term / node. In general assume the bias term is included unless otherwise specified.

_images/a347c65f1f2f1d91721583ab3718d8fd02e8bb01ad7154b3554c6f4441270541.png

Usually, the nodes are added in so called layers. $(X_{1}, X_{2})$ is the input layer, $(H_{11}, H_{12})$ is the first hidden layer, $(H_{21}, H_{22})$ is the second hidden layer and $Y$ is the output layer. Imagine plugging an $X_{1}$ and $X_{2}$ into this network. It would feed forward through the network as

\begin{array}{r} \begin{aligned} H_{11} = & g_{1} (W_{011} + W_{111} X_{1} + W_{211} X_{2}) \\ H_{12} = & g_{1} (W_{012} + W_{112} X_{1} + W_{212} X_{2}) \\ H_{21} = & g_{2} (W_{021} + W_{121} H_{11} + W_{221} H_{12}) \\ H_{22} = & g_{2} (W_{022} + W_{122} H_{12} + W_{222} H_{12}) \\ \hat{Y} = & g_{3} (W_{031} + W_{131} H_{21} + W_{231} H_{22}) \end{aligned} \end{array}

where $g_{k}$ are specified activation functions. Typically, we would have a different activation function for the output layer than the others, and the other would have the same activation function. So, for example, if $Y$ was binary, like hypertension diagnosis, then $g_{1} = g_{2}$ and $g_{3}$ would be a sigmoid.

Activation functions#

The output activation function tends to be based on the structure of the outcome. For example, a binary outcome would likely have a sigmoidal, or other function from $R$ to $[0, 1]$ so as to model a probability. Historically, the internal activation functions were binary thresholds. This was owning to the fact that neural networks were models of (biological) neurons and the threshold was a model of an action potential being propigated. However, modern neural networks have less of a direct connection to their biological motivation and other activation functions tend to be used. The most popular right now is the rectified linear unit (RELU) function. This is simply:

\begin{array}{r} R E L U (a) = {\begin{cases} a & if a > 0 \\ 0 & otherwise \end{cases} = a \times I (a > 0) \end{array}

Plotted, this is:

plt.plot( [-1, 0, 1], [0, 0, 1], linewidth = 4);

_images/be3ea223109526ef62c00f339a36c77e65ec7b5fe71ddd64f4afd251a6b7f3e0.png

If a bias term is included, then the fact that the RELU is centered at zero isn’t important, since the intercept term effectively shifts the function around. These kinds of splin terms are incredibly flexible. Just to show you an example, let’s fit the sine function using a collection of shifted RELUs. This is just

Y = \sin (X) + ϵ

being fit with

\sum_{i = 1}^{N} {Y_{i} - W_{021} - \sum_{j = 1}^{d} W_{j 21} g (W_{1 j 1} X_{i} - W_{0 j 1})}^{2}

where the $W_{k j}$ are the weights for layer $k$ . Below, we’re just setting $W_{1 j 1} = 1$ and specifying the $W_{0 j 1}$ at a sequence of values.

## Generate some data, a sine function on 0,4*pi
n = 1000
x = np.linspace(0, 4 * np.pi, n)
y = np.sin(x) + .2 * np.random.normal(size = n)

## Generate the spline regressors
df = 30
knots = np.linspace(x.min(), x.max(), df)
xmat = np.zeros((n, df))
for i in range(0, df): xmat[:,i] = (x - knots[i]) * (x > knots[i])

## Fit them
from sklearn.linear_model import LinearRegression
yhat = LinearRegression().fit(xmat, y).predict(xmat)

## Plot them versus the data
plt.plot(x, y);
plt.plot(x, yhat);

_images/d5bcb49c577f8b648de59a8c18868cf35432b0cd833f37fb99b63d5489fbb9b9.png

This corresponds to a network like depicted below if there were $d = 3$ hidden nodes, there was a relu activation function at the first layer, then a identity activation function for the output layer and the weights for the first layer are specified.

_images/64a9774b8cfdcfae305f7df04ac1a4893fb258577fbced43437706fb95193cdf.png

We can actually fit this function way better using splines and a little bit more care. However, this helps show how even one layer of RELU activated nodes can start to fit complex shapes.

Optimization#

One of the last bits of the puzzle we have to figure out is how to obtain the weights. A good strategy would be to minimize the loss function. However, it’s hard to minmize. If we had a derivative, we could try the following. Let $L (W)$ be the loss function for weights $W$ . Note, we’re omitting the fact that this is a function of the data (predictors and outcome) as well, since that’s a set of fixed numbers. Consider updating parameters as

W^{(n e w)} = W^{(o l d)} - e * L^{'} (W^{(o l d)})

What does this do? It moves the parameters by a small amount, $e$ , called the learning rate, in the direction the opposite of the gradient. Think of a one dimensional convex function. If the derivative at a point is positive, then that point is larger than where the minimum is. Similarily, if the derivative is negative, it’s smaller. So, the idea is to head a small amount in the opposite direction of the derivative. How much? How about along the line of the derivative? That’s all gradient descent does, just in more than one dimension.

How do we get the gradient? Consider the following. If $X$ is our vector of predictors and $Y$ is our vector of outputs, a neural network with 3 layers, can be thought of as, where $L_{k}$ is layer $K$ and $W_{k}$ are the weights for that layer:

L_{3} (L_{2} (L_{1} (X, W_{1}), W_{2}) W_{3})

Or a series of function compositions. Recall from calculus, if we want the derivative of composed functions we have a really simple rule called the chain rule:

\frac{d}{d x} f (g (x)) = f^{'} (g (x)) g^{'} (x)

I.e. if $h = f (u)$ and $u = g (x)$ then $\frac{d h}{d x} = \frac{d h}{d u} \frac{d u}{d x}$ . Thus, characterized this way, the chain rule formally acts like fractions (though this is a symbolic equivalence having entirely different underlying meanings).

If we use the chain rule on our composed loss functions, we wind up bookkeeping backwards through our neural network. That is why it’s called backwards propagation (backprop).

So, our algorithm goes something like this. Given, $W^{(n e w)}$ , network, $ϕ (X, W)$ , which depends on the predictors and the weights and loss, $L (Y, \hat{Y})$ , which depends on the observed and predicted outputs.

Set $W^{(o l d)} = W^{(n e w)}$
Calculate $\hat{Y} = ϕ (X, W^{(o l d)})$ and loss $L (Y, \hat{Y})$ .
Use back propagation to get to get a numerical approximation to $\frac{d}{d W} L {Y, ϕ (X, W)} |_{W = W^{(o l d)}} = L^{'} (W^{(o l d)})$
Update $W^{(n e w)} = W^{(o l d)} - e L^{'} (W^{(o l d)})$
Go to step 0.