Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Regression and FFTs

Open In Colab

Regression and FFTs

Recall regression through the origin. If yy and xx are nn-vectors of the same length, the minimizer of

yβx2||y - \beta x ||^2

is β^=<x,y>/x2\hat \beta = <x, y> / ||x||^2. Note, if x=1||x|| = 1 then the estimate is just β^=<x,y>\hat \beta = <x, y>. Now consider a second variable, ww, such that <x,w>=0<x, w> = 0 and w=1||w|| = 1. Consider now the least squares model

yβxγw2.||y - \beta x - \gamma w||^2.

We argued that the best estimate for β\beta now first gets rid of ww be regressing it out of yy and xx. So, consider that

y<w,y>wβ(x<w,x>w)2=y<w,y>wβx2.||y - <w, y> w - \beta (x - <w, x> w)||^2 = ||y - <w, y> w - \beta x||^2.

Thus, now the best estimate of β\beta is

<y<w,y>w,x>=<y,x>.<y - <w, y> w, x> = <y, x>.

Or, in other words, if xx and ww are orthogonal then the coefficient estimate for xx with ww included is the same as the coefficient of xx by itself. This extends to more than two regressors.

If you have a collection of nn mutually orthogonal vectors of norm one, they are called an orthonormal basis. For an orthonomal basis, 1. the coefficients are just the inner products between the regressors and the outcome and 2. inclusion or exclusion of other elemenents of the basis doesn’t change a basis elements estimated coefficients.

It’s important to note, that this works quite generally. For example, for complex numbers as well as real. So, for example, consider the possibility that xx is e2πimk/ne^{-2\pi i m k / n} for m=0,,n1m=0,\ldots, n-1 for a particular value of kk. Vectors like this are orthogonal for different values of kk and all have norm 1. We have already seen that the Fourier coefficient is

fk=<y,x>=m=0n1yme2πimk/n=m=0n1ymcos(2πmk/n)+im=0n1ymsin(2πmk/n)f_k = <y, x> = \sum_{m=0}^{n-1} y_m e^{-2\pi i m k / n} = \sum_{m=0}^{n-1} y_m \cos(-2\pi m k / n) + i \sum_{m=0}^{n-1} y_m \sin(-2\pi m k / n)

where ymy_m is element mm of yy. Thus, the Fourier coefficients are exactly just least squares coefficients applied in the complex space. Thus we have that

fk=ak+ibkf_k = a_k + i b_k

where aka_k and bkb_k are the coefficients from linear models with just the sine and cosine terms. Of course, we don’t actually fit Fourier transforms this way, since there’s a much faster way to do, aptly named the fast Fourier transform (FFT). However, knowing how fast discrete Fourier transforms relate to linear models allows us to use them in creative ways, like putting them into models with other covariates, or in logistic regression models.

Let’s numerically look at FFTs and linear models using covid case counts in Italy as an example.

import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import statsmodels.api as sm
dat = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
dat.head()
Loading...
## Get Italy, drop everyrthing except dates, convert to long (unstack converts to tuple)
y=dat[dat['Country/Region'] == 'Italy'].drop(["Province/State", "Country/Region", "Lat", "Long"], axis=1).unstack()
## convert from tuple to array
y = np.asarray(y)  
## get case counts instead of cumulative counts
y = y[1 : y.size] - y[0 : (y.size - 1)]
## get the first non zero entry
y =  y[np.min(np.where(y !=  0)) : y.size]
plt.plot(y)

<Figure size 640x480 with 1 Axes>

Let’s look at a smoothed version of it and then take the residual. The residual is where we’d like to look at some oscillatory behavior.

n = y.size
t = np.arange(0, n, 1)
lowess = sm.nonparametric.lowess
yhat = lowess(y, t, frac=.05,return_sorted=False)
plt.plot(y)
plt.plot(yhat)
<Figure size 640x480 with 1 Axes>
## We're interested in the residual
e = y - yhat
plt.plot(e)
<Figure size 640x480 with 1 Axes>

Let’s manually create our Fourier bases. We’re just going to pick some periods to investigate. We’ll pick a fast varying and slow varying.


## Create 4 elements
## Orthonormal basis (note dividing by sqrt(n/2) makes them norm 1)
c5  = np.cos(-2 * np.pi * t * 5 / n  ) / np.sqrt(n /2)
c20 = np.cos(-2 * np.pi * t * 20 / n ) / np.sqrt(n /2)
s5  = np.sin(-2 * np.pi * t * 5  / n  )/ np.sqrt(n /2)
s20 = np.sin(-2 * np.pi * t * 20 / n  ) / np.sqrt(n /2)
fig, axs = plt.subplots(2, 2)
axs[0,0].plot(t, c5)
axs[0,1].plot(t, c20)
axs[1,0].plot(t, s5)
axs[1,1].plot(t, s20)
plt.show()
<Figure size 640x480 with 4 Axes>

Let’s verify that they are indeed orthonormal. That is, we want to show that <xi,xj>=I(i=j)<x_i, x_j> = I(i =j). We also show that they are all mean 0.

## Verify that they are orthonormal mean 0, round to 6 decimal places
np.around( [
 np.sum(c5),
 np.sum(c20),
 np.sum(s5),
 np.sum(s20),
 np.sum(c5 * c5),
 np.sum(c20 * c20),
 np.sum(s5 * s5),
 np.sum(s20 * s20),
 np.sum(c5 * s5),
 np.sum(c5 * s20),
 np.sum(c5 * c20),
 np.sum(s5 * s20),
], 6)
array([-0., 0., 0., -0., 1., 1., 1., 1., 0., 0., -0., -0.])

Let’s take the FFT, the fast (discrete) Fourier transform th way one would normally do it. First, we use FFT in numpy. Then, there’s a convenient method, fftfreq, which gives the associated frequencies with each element of the transform. Finally, we plot the spectral density, which is the sum of the real and complex Fourier coefficients. Sorting the elements first is necessary to connect the dots on the plot. Interestingly, once we remove the trend from the Italy data, there’s some very noticeable spikes in the spectral density, which implies large coefficients on that specific frequency. This is possibly some reporting issue.

f = np.fft.fft(e)
w = np.fft.fftfreq(n)
ind = w.argsort()
f = f[ind] 
w = w[ind]
plt.plot(w, f.real**2 + f.imag**2)
<Figure size 640x480 with 1 Axes>

Now let’s manually find the coefficients using our constructed bases and the formula that the coefficients.

[
 np.sum(c5 * e) * np.sqrt(n / 2),
 np.sum(c20 * e) * np.sqrt(n / 2),
 np.sum(s5 * e) * np.sqrt(n / 2),
 np.sum(s20 * e) * np.sqrt(n / 2),
] 
[-42402.16539776623, -961069.7936856844, 62406.76098672958, 1902286.8621751775]
sreg = linear_model.LinearRegression()
x=np.c_[c5, c20, s5, s20]
fit = sreg.fit(x, y)
fit.coef_ * np.sqrt(n/2)
array([ 1222376.2412703 , -1286448.11961955, -4881386.48874237, 1827887.03044428])
x=np.c_[c5, s5]
fit = sreg.fit(x, y)
fit.coef_ * np.sqrt(n/2)
array([ 1222376.2412703 , -4881386.48874237])
test = np.where( np.abs(f.real / np.sum(c5 * y) / np.sqrt(n / 2) - 1) < 1e-5) 
[test, f.real[test], w[test], 5 / n]
[(array([], dtype=int64),), array([], dtype=float64), array([], dtype=float64), 0.004409171075837742]
f.imag[test]
array([], dtype=float64)