Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Regression through the origin

Open In Colab

Regression through the origin

In this notebook, we investigate a simple poblem where we’d like to use one scaled regressor to predict another. That is, let Y1,YnY_1, \ldots Y_n be a collection of variables we’d like to predict and X1,,XnX_1, \ldots, X_n be predictors. Consider minimizing

l=i(YiβXi)2=YβX2.l = \sum_i ( Y_i - \beta X_i)^2 = || Y - \beta X||^2.

Taking a derivative of ll with respect to β\beta yields

l=i2(YiβXi)Xi.l' = - \sum_i 2 (Y_i - \beta X_i) X_i.

If we set this equal to zero and solve for beta we obtain the classic solution:

β^=iYiXiiXi2=<Y,X>X2.\hat \beta = \frac{\sum_i Y_i X_i}{\sum_i X_i^2} = \frac{<Y, X>}{||X||^2}.

Note further, if we take a second derivative we get

l=i2xi2l'' = \sum_i 2 x_i^2

which is strictly positive unless all of the xix_i are zero (a case of zero variation in the predictor where regresssion is uninteresting). Regression through the origin is a very useful version of regression, but it’s quite limited in its application. Rarely do we want to fit a line that is forced to go through the origin, or stated equivalently, rarely do we want a prediction algorithm for YY that is simply a scale change of XX. Typically, we at least also want an intercept. In the example that follows, we’ll address this by centering the data so that the origin is the mean of the YY and the mean of the XX. As it turns out, this is the same as fitting the intercept, but we’ll do that more formally in the next section.

First let’s load the necessary packages.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Now let’s download and read in the data.

dat = pd.read_csv("https://raw.githubusercontent.com/bcaffo/ds4bme_intro/master/data/oasis.csv")
pd.set_option('display.float_format', '{:.2f}'.format)
dat.head()
Loading...

It’s almost always a good idea to plot the data before fitting the model.

x = dat.T2
y = dat.PD
plt.plot(x, y, 'o')
<Figure size 640x480 with 1 Axes>

Now, let’s center the data as we mentioned so that it seems more reasonable to have the line go through the origin. Notice here, the middle of the data, both YY and XX, is right at (0, 0).

x = x - np.mean(x)
y = y - np.mean(y)
plt.plot(x, y, 'o')
<Figure size 640x480 with 1 Axes>

Here’s our slope estimate according to our formula.

b = sum(y * x) / sum(x ** 2 )
b
0.7831514763655999

Let’s plot it so to see how it did. It looks good. Now let’s see if we can do a line that doesn’t necessarily have to go through the origin.

plt.plot(x, y, 'o')
t = np.array([-1.5, 2.5])
plt.plot(t, t * b)
<Figure size 640x480 with 1 Axes>