Numpy#
Numpy is an essential core component of doing statistics in python. Numpy basically contains all of the basic mathematical functions that you need. Let’s load in some data and work with it a bit. Here we’re going to be downloading the cumulative daily case counts of Covid-19 for Italy. (Modify the code in the obvious way to pick another country. The data is from the JHU Covid-19 dashboard, a fantastic data science project from the Center for Systems Science Engineering at JHU.)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dat = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
## Get Italy, drop everyrthing except dates, convert to long (unstack converts to tuple)
Italy = dat[dat['Country/Region'] == 'Italy'].drop(["Province/State", "Country/Region", "Lat", "Long"], axis=1).unstack()
Let’s create a numpy array from the counts and work with it a bit. First we’ll take our data from Italy and convert it from the cumulative case counts to the daily case counts.
## convert from tuple to array
X = np.asarray(Italy)
## get case counts instead of cumulative counts
X = X[1 : X.size] - X[0 : (X.size - 1)]
Let’s get some basic statistical summaries. Note the default is that the standard deviations uses the formula
rather than
To get the latter (the unbiased version), set ddof=1
. Personally, I prefer \(N\) as a divisor, though that’s a minority opinion. (Between bias or variance of the standard deviation estimate, I’d rather rather have lower variance.) To described the code below:
X.mean()
gives the mean; sinceX
is a numpy object, it has statistical methods defined in the classX.std()
gives the (biased version of the) standard deviationnp.round(A, 2)
rounds the numpy object to two decimal places (useful for printing)str
is the python string converter (just to print)
print("Mean : " + str(np.round(X.mean(), 2)))
print("Std (biased) : " + str(np.round(X.std() , 2)))
Mean : 22419.89
Std (biased) : 36890.56
Numpy has a linear algebra library. Let’s calculate a distributed lag model using numpy (typically you would use this with regression software). A distributed lag model is of the form:
First, let’s create the lagged matrix considering 3 lags.
## Create a matrix of three lagged versions
X = np.array([ Italy.shift(1), Italy.shift(2), Italy.shift(3)]).transpose()
## Add a vector of ones
itc = np.ones( (X.shape[0], 1) )
X = np.concatenate( (itc, X), axis = 1)
## Visualize the results
X[0 : 10,:]
array([[ 1., nan, nan, nan],
[ 1., 0., nan, nan],
[ 1., 0., 0., nan],
[ 1., 0., 0., 0.],
[ 1., 0., 0., 0.],
[ 1., 0., 0., 0.],
[ 1., 0., 0., 0.],
[ 1., 0., 0., 0.],
[ 1., 0., 0., 0.],
[ 1., 0., 0., 0.]])
Let’s get rid of the three NA rows.
X = X[ 3 : X.shape[0], :]
np.any(np.isnan(X))
False
## Create the Y vector
Y = np.array(Italy[ 3 : Italy.shape[0]])
[Y.shape, X.shape]
[(1140,), (1140, 4)]
The matrix formula for minimizing the least squares regression model,
is given by
Let’s do this in numpy. Let’s find the estimated regression coefficients using the formula above. We’ll use the following functions
matmul(A,B)
is the matrix multiplication ofA
andB
A.T
is the transpose ofA
, labeled above as \(A'\)inv(A)
is the matrix inverse ofA
, labeled above as \(A^{-1}\)
np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T, X)), X.T), Y)
array([ 2.83671744e+03, 1.43359771e+00, -6.92820220e-02, -3.64121781e-01])
Of course, we don’t tend to do things this way. If needed, we’d use lstsq.
np.linalg.lstsq(X, Y, rcond = None)[0]
array([ 2.83671680e+03, 1.43359771e+00, -6.92820217e-02, -3.64121781e-01])
Typically, we wouldn’t do any of this for this problem, since high level regression models exist already. For example, sklearn’s linear_model module)
from sklearn import linear_model
model = linear_model.LinearRegression(fit_intercept = False)
fit = model.fit(X, Y)
fit.coef_
array([ 2.83671680e+03, 1.43359771e+00, -6.92820217e-02, -3.64121781e-01])