Regression and FFTs

Open In Colab

Regression and FFTs#

Recall regression through the origin. If $y$ and $x$ are $n$-vectors of the same length, the minimizer of

$$ ||y - \beta x ||^2 $$

is $\hat \beta = <x, y> / ||x||^2$. Note, if $||x|| = 1$ then the estimate is just $\hat \beta = <x, y>$. Now consider a second variable, $w$, such that $<x, w> = 0$ and $||w|| = 1$. Consider now the least squares model

$$ ||y - \beta x - \gamma w||^2. $$

We argued that the best estimate for $\beta$ now first gets rid of $w$ be regressing it out of $y$ and $x$. So, consider that

$$ ||y - <w, y> w - \beta (x - <w, x> w)||^2 = ||y - <w, y> w - \beta x||^2. $$

Thus, now the best estimate of $\beta$ is

$$ <y - <w, y> w, x> = <y, x>. $$

Or, in other words, if $x$ and $w$ are orthogonal then the coefficient estimate for $x$ with $w$ included is the same as the coefficient of $x$ by itself. This extends to more than two regressors.

If you have a collection of $n$ mutually orthogonal vectors of norm one, they are called an orthonormal basis. For an orthonomal basis, 1. the coefficients are just the inner products between the regressors and the outcome and 2. inclusion or exclusion of other elemenents of the basis doesn’t change a basis elements estimated coefficients.

It’s important to note, that this works quite generally. For example, for complex numbers as well as real. So, for example, consider the possibility that $x$ is $e^{-2\pi i m k / n}$ for $m=0,\ldots, n-1$ for a particular value of $k$. Vectors like this are orthogonal for different values of $k$ and all have norm 1. We have already seen that the Fourier coefficient is

$$ f_k = <y, x> = \sum_{m=0}^{n-1} y_m e^{-2\pi i m k / n} = \sum_{m=0}^{n-1} y_m \cos(-2\pi m k / n) + i \sum_{m=0}^{n-1} y_m \sin(-2\pi m k / n) $$

where $y_m$ is element $m$ of $y$. Thus, the Fourier coefficients are exactly just least squares coefficients applied in the complex space. Thus we have that

$$ f_k = a_k + i b_k $$

where $a_k$ and $b_k$ are the coefficients from linear models with just the sine and cosine terms. Of course, we don’t actually fit Fourier transforms this way, since there’s a much faster way to do, aptly named the fast Fourier transform (FFT). However, knowing how fast discrete Fourier transforms relate to linear models allows us to use them in creative ways, like putting them into models with other covariates, or in logistic regression models.

Let’s numerically look at FFTs and linear models using covid case counts in Italy as an example.

import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import statsmodels.api as sm
dat = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
dat.head()
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 2/28/23 3/1/23 3/2/23 3/3/23 3/4/23 3/5/23 3/6/23 3/7/23 3/8/23 3/9/23
0 NaN Afghanistan 33.93911 67.709953 0 0 0 0 0 0 ... 209322 209340 209358 209362 209369 209390 209406 209436 209451 209451
1 NaN Albania 41.15330 20.168300 0 0 0 0 0 0 ... 334391 334408 334408 334427 334427 334427 334427 334427 334443 334457
2 NaN Algeria 28.03390 1.659600 0 0 0 0 0 0 ... 271441 271448 271463 271469 271469 271477 271477 271490 271494 271496
3 NaN Andorra 42.50630 1.521800 0 0 0 0 0 0 ... 47866 47875 47875 47875 47875 47875 47875 47875 47890 47890
4 NaN Angola -11.20270 17.873900 0 0 0 0 0 0 ... 105255 105277 105277 105277 105277 105277 105277 105277 105288 105288

5 rows × 1147 columns

## Get Italy, drop everyrthing except dates, convert to long (unstack converts to tuple)
y=dat[dat['Country/Region'] == 'Italy'].drop(["Province/State", "Country/Region", "Lat", "Long"], axis=1).unstack()
## convert from tuple to array
y = np.asarray(y)  
## get case counts instead of cumulative counts
y = y[1 : y.size] - y[0 : (y.size - 1)]
## get the first non zero entry
y =  y[np.min(np.where(y !=  0)) : y.size]
plt.plot(y)
[<matplotlib.lines.Line2D at 0x7c2219af5750>]
_images/7930205aaadfddd74ec240e4ed877ce327e69fbec8cdc3f0e4282e004bd60f7d.png

Let’s look at a smoothed version of it and then take the residual. The residual is where we’d like to look at some oscillatory behavior.

n = y.size
t = np.arange(0, n, 1)
lowess = sm.nonparametric.lowess
yhat = lowess(y, t, frac=.05,return_sorted=False)
plt.plot(y)
plt.plot(yhat)
[<matplotlib.lines.Line2D at 0x7c22116ef730>]
_images/9ec525c20efb61cd14e8870c346b8c7fee29344b80076467e692a2b887cb74b9.png
## We're interested in the residual
e = y - yhat
plt.plot(e)
[<matplotlib.lines.Line2D at 0x7c22117678b0>]
_images/6dbdbbabac1b8fd81d3741371d0772c6668f96bff563af00e5f9738caa650ad7.png

Let’s manually create our Fourier bases. We’re just going to pick some periods to investigate. We’ll pick a fast varying and slow varying.

## Create 4 elements
## Orthonormal basis (note dividing by sqrt(n/2) makes them norm 1)
c5  = np.cos(-2 * np.pi * t * 5 / n  ) / np.sqrt(n /2)
c20 = np.cos(-2 * np.pi * t * 20 / n ) / np.sqrt(n /2)
s5  = np.sin(-2 * np.pi * t * 5  / n  )/ np.sqrt(n /2)
s20 = np.sin(-2 * np.pi * t * 20 / n  ) / np.sqrt(n /2)
fig, axs = plt.subplots(2, 2)
axs[0,0].plot(t, c5)
axs[0,1].plot(t, c20)
axs[1,0].plot(t, s5)
axs[1,1].plot(t, s20)
plt.show()
_images/53b5e94cefb790d4349d59c60f119e25f075694c69dd0412361078f711ee77a9.png

Let’s verify that they are indeed orthonormal. That is, we want to show that $<x_i, x_j> = I(i =j)$. We also show that they are all mean 0.

## Verify that they are orthonormal mean 0, round to 6 decimal places
np.around( [
 np.sum(c5),
 np.sum(c20),
 np.sum(s5),
 np.sum(s20),
 np.sum(c5 * c5),
 np.sum(c20 * c20),
 np.sum(s5 * s5),
 np.sum(s20 * s20),
 np.sum(c5 * s5),
 np.sum(c5 * s20),
 np.sum(c5 * c20),
 np.sum(s5 * s20),
], 6)
array([-0.,  0.,  0., -0.,  1.,  1.,  1.,  1.,  0.,  0., -0., -0.])

Let’s take the FFT, the fast (discrete) Fourier transform th way one would normally do it. First, we use FFT in numpy. Then, there’s a convenient method, fftfreq, which gives the associated frequencies with each element of the transform. Finally, we plot the spectral density, which is the sum of the real and complex Fourier coefficients. Sorting the elements first is necessary to connect the dots on the plot. Interestingly, once we remove the trend from the Italy data, there’s some very noticeable spikes in the spectral density, which implies large coefficients on that specific frequency. This is possibly some reporting issue.

f = np.fft.fft(e)
w = np.fft.fftfreq(n)
ind = w.argsort()
f = f[ind] 
w = w[ind]
plt.plot(w, f.real**2 + f.imag**2)
[<matplotlib.lines.Line2D at 0x7c2211523fa0>]
_images/57c11073d7c89b5a842ae25d510c7377eb8bbdbfd1aef6fe30532be6ae08354d.png

Now let’s manually find the coefficients using our constructed bases and the formula that the coefficients.

[
 np.sum(c5 * e) * np.sqrt(n / 2),
 np.sum(c20 * e) * np.sqrt(n / 2),
 np.sum(s5 * e) * np.sqrt(n / 2),
 np.sum(s20 * e) * np.sqrt(n / 2),
] 
[-42402.16539776623, -961069.7936856844, 62406.76098672958, 1902286.8621751775]
sreg = linear_model.LinearRegression()
x=np.c_[c5, c20, s5, s20]
fit = sreg.fit(x, y)
fit.coef_ * np.sqrt(n/2)
array([ 1222376.2412703 , -1286448.11961955, -4881386.48874237,
        1827887.03044428])
x=np.c_[c5, s5]
fit = sreg.fit(x, y)
fit.coef_ * np.sqrt(n/2)
array([ 1222376.2412703 , -4881386.48874237])
test = np.where( np.abs(f.real / np.sum(c5 * y) / np.sqrt(n / 2) - 1) < 1e-5) 
[test, f.real[test], w[test], 5 / n]
[(array([], dtype=int64),),
 array([], dtype=float64),
 array([], dtype=float64),
 0.004409171075837742]
f.imag[test]
array([], dtype=float64)