Neural networks#
Basics#
Let’s start by relating neural networks to regression. Consider a simple case where we have two nodes,
Show code cell source
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import sklearn as skl
plt.figure(figsize=[2, 2])
#G = nx.Graph()
G = nx.DiGraph()
G.add_node("1", pos = (0, 1) )
G.add_node("X", pos = (1, 1) )
G.add_node("Y", pos = (.5, 0))
G.add_edge("1", "Y")
G.add_edge("X", "Y")
nx.draw(G,
nx.get_node_attributes(G, 'pos'),
with_labels=True,
font_weight='bold',
node_size = 2000,
node_color = "lightblue",
linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])
plt.show()

To interpret this diagram as a neural network, consider the following rule:
:::{note} Parent nodes that point to a child node are multiplied by weights then added together then operated on by an activation function to form the child node. :::
If the parent nodes point to the outcome, then the nodes are combined the operated on by a known function, called the activation function to form a prediction. So, in this case, this is saying that the intercept (node labeled
where
Note
The loss function, i.e. how to measure the different between
and .The way the loss function combines subjects; we have multiple BMIs and SBPs
How we obtain the weights,
and ; this is done by minmizing the loss function using an algorithm
So, imagine the case where
and so, presuming our optimization algorithm works well, it should be idential to linear regression.
Consider a different setting. Imagine if our
which is the logistic regression prediction with intercept
Further, if we specify that the loss function is binary cross entropy
then minmizing our loss function is identical to maximizing the likelihood for logistic regression.
1 / (1 + np.exp(-(-4 + .1 * 30)))
0.2689414213699951
More layers#
Of course, there’d be no point in using NNs for problems that we can just solve with generalized linear models. NNs get better when we add more layers, since then they can discover interactions and non-linearities. Consider the following model. Notice we quit explicitly adding the bias (intercept) term / node. In general assume the bias term is included unless otherwise specified.
Show code cell source
#plt.figure(figsize=[2, 2])
G = nx.DiGraph()
G.add_node("X1", pos = (0, 3) )
G.add_node("X2", pos = (1, 3) )
G.add_node("H11", pos = (0, 2) )
G.add_node("H12", pos = (1, 2) )
G.add_node("H21", pos = (0, 1) )
G.add_node("H22", pos = (1, 1) )
G.add_node("Y", pos = (.5, 0))
G.add_edges_from([ ("X1", "H11"), ("X1", "H12"), ("X2", "H11"), ("X2", "H12")])
G.add_edges_from([("H11", "H21"), ("H11", "H22"), ("H12", "H21"), ("H12", "H22")])
G.add_edges_from([("H21", "Y"), ("H22", "Y")])
nx.draw(G,
nx.get_node_attributes(G, 'pos'),
with_labels=True,
font_weight='bold',
node_size = 2000,
node_color = "lightblue",
linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 3.3])
ax.set_ylim([-.3, 3.3])
plt.show()

Usually, the nodes are added in so called layers.
where
Activation functions#
The output activation function tends to be based on the structure of the outcome. For example, a binary outcome would likely have a sigmoidal, or other function from
Plotted, this is:
plt.plot( [-1, 0, 1], [0, 0, 1], linewidth = 4);

If a bias term is included, then the fact that the RELU is centered at zero isn’t important, since the intercept term effectively shifts the function around. These kinds of splin terms are incredibly flexible. Just to show you an example, let’s fit the sine function using a collection of shifted RELUs. This is just
being fit with
where the
## Generate some data, a sine function on 0,4*pi
n = 1000
x = np.linspace(0, 4 * np.pi, n)
y = np.sin(x) + .2 * np.random.normal(size = n)
## Generate the spline regressors
df = 30
knots = np.linspace(x.min(), x.max(), df)
xmat = np.zeros((n, df))
for i in range(0, df): xmat[:,i] = (x - knots[i]) * (x > knots[i])
## Fit them
from sklearn.linear_model import LinearRegression
yhat = LinearRegression().fit(xmat, y).predict(xmat)
## Plot them versus the data
plt.plot(x, y);
plt.plot(x, yhat);

This corresponds to a network like depicted below if there were
Show code cell source
G = nx.DiGraph()
G.add_node("X", pos = (1, 2) )
G.add_node("H11", pos = (0, 1) )
G.add_node("H12", pos = (1, 1) )
G.add_node("H13", pos = (2, 1) )
G.add_node("Y", pos = (1, 0))
G.add_edges_from([("X", "H11"), ("X", "H12"), ("X", "H13")])
G.add_edges_from([("H11", "Y"), ("H12", "Y"), ("H13", "Y")])
nx.draw(G,
nx.get_node_attributes(G, 'pos'),
with_labels=True,
font_weight='bold',
node_size = 2000,
node_color = "lightblue",
linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 3.3])
ax.set_ylim([-.3, 3.3])
plt.show()

We can actually fit this function way better using splines and a little bit more care. However, this helps show how even one layer of RELU activated nodes can start to fit complex shapes.
Optimization#
One of the last bits of the puzzle we have to figure out is how to obtain the weights. A good strategy would be to minimize the loss function. However, it’s hard to minmize. If we had a derivative, we could try the following. Let
What does this do? It moves the parameters by a small amount,
How do we get the gradient? Consider the following. If
Or a series of function compositions. Recall from calculus, if we want the derivative of composed functions we have a really simple rule called the chain rule:
I.e. if
If we use the chain rule on our composed loss functions, we wind up bookkeeping backwards through our neural network. That is why it’s called backwards propagation (backprop).
So, our algorithm goes something like this.
Given,
Set
Calculate
and loss .Use back propagation to get to get a numerical approximation to
Update
Go to step 0.