Regression through the origin#
In this notebook, we investigate a simple poblem where we’d like to use one scaled regressor to predict another. That is, let
Taking a derivative of
If we set this equal to zero and solve for beta we obtain the classic solution:
Note further, if we take a second derivative we get
which is strictly positive unless all of the
First let’s load the necessary packages.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Now let’s download and read in the data.
dat = pd.read_csv("https://raw.githubusercontent.com/bcaffo/ds4bme_intro/master/data/oasis.csv")
dat.head()
FLAIR | PD | T1 | T2 | FLAIR_10 | PD_10 | T1_10 | T2_10 | FLAIR_20 | PD_20 | T1_20 | T2_20 | GOLD_Lesions | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.143692 | 1.586219 | -0.799859 | 1.634467 | 0.437568 | 0.823800 | -0.002059 | 0.573663 | 0.279832 | 0.548341 | 0.219136 | 0.298662 | 0 |
1 | 1.652552 | 1.766672 | -1.250992 | 0.921230 | 0.663037 | 0.880250 | -0.422060 | 0.542597 | 0.422182 | 0.549711 | 0.061573 | 0.280972 | 0 |
2 | 1.036099 | 0.262042 | -0.858565 | -0.058211 | -0.044280 | -0.308569 | 0.014766 | -0.256075 | -0.136532 | -0.350905 | 0.020673 | -0.259914 | 0 |
3 | 1.037692 | 0.011104 | -1.228796 | -0.470222 | -0.013971 | -0.000498 | -0.395575 | -0.221900 | 0.000807 | -0.003085 | -0.193249 | -0.139284 | 0 |
4 | 1.580589 | 1.730152 | -0.860949 | 1.245609 | 0.617957 | 0.866352 | -0.099919 | 0.384261 | 0.391133 | 0.608826 | 0.071648 | 0.340601 | 0 |
It’s almost always a good idea to plot the data before fitting the model.
x = dat.T2
y = dat.PD
plt.plot(x, y, 'o')
[<matplotlib.lines.Line2D at 0x7d0538dd80a0>]

Now, let’s center the data as we mentioned so that it seems more reasonable to have the line go through the origin. Notice here, the middle of the data, both
x = x - np.mean(x)
y = y - np.mean(y)
plt.plot(x, y, 'o')
[<matplotlib.lines.Line2D at 0x7d0538c606a0>]

Here’s our slope estimate according to our formula.
b = sum(y * x) / sum(x ** 2 )
b
0.7831514763655999
Let’s plot it so to see how it did. It looks good. Now let’s see if we can do a line that doesn’t necessarily have to go through the origin.
plt.plot(x, y, 'o')
t = np.array([-1.5, 2.5])
plt.plot(t, t * b)
[<matplotlib.lines.Line2D at 0x7d0530ac2e30>]
