Regression through the origin¶
In this notebook, we investigate a simple poblem where we’d like to use one scaled regressor to predict another. That is, let be a collection of variables we’d like to predict and be predictors. Consider minimizing
Taking a derivative of with respect to yields
If we set this equal to zero and solve for beta we obtain the classic solution:
Note further, if we take a second derivative we get
which is strictly positive unless all of the are zero (a case of zero variation in the predictor where regresssion is uninteresting). Regression through the origin is a very useful version of regression, but it’s quite limited in its application. Rarely do we want to fit a line that is forced to go through the origin, or stated equivalently, rarely do we want a prediction algorithm for that is simply a scale change of . Typically, we at least also want an intercept. In the example that follows, we’ll address this by centering the data so that the origin is the mean of the and the mean of the . As it turns out, this is the same as fitting the intercept, but we’ll do that more formally in the next section.
First let’s load the necessary packages.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Now let’s download and read in the data.
dat = pd.read_csv("https://raw.githubusercontent.com/bcaffo/ds4bme_intro/master/data/oasis.csv")
pd.set_option('display.float_format', '{:.2f}'.format)
dat.head()It’s almost always a good idea to plot the data before fitting the model.
x = dat.T2
y = dat.PD
plt.plot(x, y, 'o')
Now, let’s center the data as we mentioned so that it seems more reasonable to have the line go through the origin. Notice here, the middle of the data, both and , is right at (0, 0).
x = x - np.mean(x)
y = y - np.mean(y)
plt.plot(x, y, 'o')
Here’s our slope estimate according to our formula.
b = sum(y * x) / sum(x ** 2 )
b0.7831514763655999Let’s plot it so to see how it did. It looks good. Now let’s see if we can do a line that doesn’t necessarily have to go through the origin.
plt.plot(x, y, 'o')
t = np.array([-1.5, 2.5])
plt.plot(t, t * b)