Regression using arrays#
import numpy as np
numpy
arrays are the fundamental building block of the Python SciPy stack.
Scientific computing in Python nearly always makes use of numpy.ndarray
at some level.
In this example we will load an array from file and conduct a simple linear regression. The method of Ordinary Least Squares is used to fit a linear model (think \(y = \beta_1 x + \beta_0 + \epsilon \) ).
The file data/lysis.csv
contains a monthly proportion of a hospital’s stroke patients that have been treated with a potentially life saving drug to remove a blood clot from their brain.
There are a total of 54 months in the dataset.
In month 42 the hospital introduced a new process for fast tracking patients to treatment. Here we will quantify the difference in treatment rates before and after the fast track intervention and to determine if the result is statistically significant.
To do this we will need to:
Import the OLS class from the
statsmodels.api
namespace.Read the
lysis.csv
file into anumpy.ndarray
variable (watching out for headers in the file)Create a numpy array that is the same size as the lysis array to represent our intervention variable. It should contain zero when it the month (index) is less than when the intervention was introduced (42) and 1 for all months after the intervention was introduced. The end array should look like
dummy == [0,0,0, ... ,1,1,1]
Once we have conducted the analysis we will print the results summary.
Import statsmodels
#
The code will use the OLS
class that cab be found in statsmodel.api
# this contains the analysis function we will use
import statsmodels.api as sm
Read in the data#
We will use the genfromtxt
function, but in this case you could equally use loadtxt
as there is no missing data.
lysis = np.genfromtxt('data/lysis.csv', skip_header=1)
print(lysis.shape)
# peek at first five data points.
print(lysis[:5])
(54,)
[0.01886793 0.03030303 0.01886793 0.01886793 0.06060606]
Setup variables#
The intervention takes place in month 42. So we’ll include a dummy variable that contains 0 before this and 1 afterwards. The coefficient of the variable is the effect size of the intervention.
By default the OLS
does not include an intercept in the model. To add an intercept the OLS
class requires a constant term (a variable with no variation where all values are 1). We can use the add_constant
function to add the constant.
# create intervention variables
intervention_month = 42
intervention = np.zeros(lysis.shape[0], dtype=int)
intervention[intervention_month:] = 1
# add a constant term - exog is a matrix.
exog = sm.add_constant(intervention)
exog.shape
(54, 2)
# peek at both columns and 3 rows
exog[:3, :]
array([[1., 0.],
[1., 0.],
[1., 0.]])
Regression code#
The next code block illustrates how to run the model. statsmodels
refers to the X and y variables as the exogenous and endogenous, respectively.
In the output take a look at the coefficients and CIs for the const
(the intercept) and ``x1` (the intervention variable). These can be interpreted as the mean treatment rate before the intervention and the mean increase in treatment rate due to the intervention.
# create an instance of the OLS class
model = sm.OLS(endog=lysis, exog=exog)
print(model)
<statsmodels.regression.linear_model.OLS object at 0x7fb0b7502790>
# fit the model
results = model.fit()
# show the results summary
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.538
Model: OLS Adj. R-squared: 0.529
Method: Least Squares F-statistic: 60.63
Date: Fri, 20 Sep 2024 Prob (F-statistic): 2.77e-10
Time: 15:53:28 Log-Likelihood: 106.92
No. Observations: 54 AIC: -209.8
Df Residuals: 52 BIC: -205.9
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0420 0.005 7.998 0.000 0.031 0.053
x1 0.0868 0.011 7.786 0.000 0.064 0.109
==============================================================================
Omnibus: 12.833 Durbin-Watson: 2.146
Prob(Omnibus): 0.002 Jarque-Bera (JB): 22.219
Skew: 0.678 Prob(JB): 1.50e-05
Kurtosis: 5.835 Cond. No. 2.55
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Check regression results#
mean_before = lysis[:intervention_month].mean()
mean_after = lysis[intervention_month:].mean()
diff = mean_after - mean_before
print(f'Before: {mean_before:.4f}')
print(f'After: {mean_after:.4f}')
print(f'Diff: {diff:.4f}')
Before: 0.0420
After: 0.1288
Diff: 0.0868