Regression using arrays#

import numpy as np

numpy arrays are the fundamental building block of the Python SciPy stack. Scientific computing in Python nearly always makes use of numpy.ndarray at some level.

In this example we will load an array from file and conduct a simple linear regression. The method of Ordinary Least Squares is used to fit a linear model (think \(y = \beta_1 x + \beta_0 + \epsilon \) ).

The file data/lysis.csv contains a monthly proportion of a hospital’s stroke patients that have been treated with a potentially life saving drug to remove a blood clot from their brain.

There are a total of 54 months in the dataset.

In month 42 the hospital introduced a new process for fast tracking patients to treatment. Here we will quantify the difference in treatment rates before and after the fast track intervention and to determine if the result is statistically significant.

To do this we will need to:

  • Import the OLS class from the statsmodels.api namespace.

  • Read the lysis.csv file into a numpy.ndarray variable (watching out for headers in the file)

  • Create a numpy array that is the same size as the lysis array to represent our intervention variable. It should contain zero when it the month (index) is less than when the intervention was introduced (42) and 1 for all months after the intervention was introduced. The end array should look like

dummy == [0,0,0, ... ,1,1,1] 

Once we have conducted the analysis we will print the results summary.

Import statsmodels#

The code will use the OLS class that cab be found in statsmodel.api

# this contains the analysis function we will use
import statsmodels.api as sm  

Read in the data#

We will use the genfromtxt function, but in this case you could equally use loadtxt as there is no missing data.

lysis = np.genfromtxt('data/lysis.csv', skip_header=1) 
print(lysis.shape)
# peek at first five data points.
print(lysis[:5])
(54,)
[0.01886793 0.03030303 0.01886793 0.01886793 0.06060606]

Setup variables#

The intervention takes place in month 42. So we’ll include a dummy variable that contains 0 before this and 1 afterwards. The coefficient of the variable is the effect size of the intervention.

By default the OLS does not include an intercept in the model. To add an intercept the OLS class requires a constant term (a variable with no variation where all values are 1). We can use the add_constant function to add the constant.

# create intervention variables
intervention_month = 42
intervention = np.zeros(lysis.shape[0], dtype=int)
intervention[intervention_month:] = 1

# add a constant term - exog is a matrix.
exog = sm.add_constant(intervention)
exog.shape
(54, 2)
# peek at both columns and 3 rows
exog[:3, :]
array([[1., 0.],
       [1., 0.],
       [1., 0.]])

Regression code#

The next code block illustrates how to run the model. statsmodels refers to the X and y variables as the exogenous and endogenous, respectively.

In the output take a look at the coefficients and CIs for the const (the intercept) and ``x1` (the intervention variable). These can be interpreted as the mean treatment rate before the intervention and the mean increase in treatment rate due to the intervention.

# create an instance of the OLS class
model = sm.OLS(endog=lysis, exog=exog)
print(model)
<statsmodels.regression.linear_model.OLS object at 0x7fb0b7502790>
# fit the model
results = model.fit()

# show the results summary
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.538
Model:                            OLS   Adj. R-squared:                  0.529
Method:                 Least Squares   F-statistic:                     60.63
Date:                Fri, 20 Sep 2024   Prob (F-statistic):           2.77e-10
Time:                        15:53:28   Log-Likelihood:                 106.92
No. Observations:                  54   AIC:                            -209.8
Df Residuals:                      52   BIC:                            -205.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0420      0.005      7.998      0.000       0.031       0.053
x1             0.0868      0.011      7.786      0.000       0.064       0.109
==============================================================================
Omnibus:                       12.833   Durbin-Watson:                   2.146
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               22.219
Skew:                           0.678   Prob(JB):                     1.50e-05
Kurtosis:                       5.835   Cond. No.                         2.55
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Check regression results#

mean_before = lysis[:intervention_month].mean()
mean_after = lysis[intervention_month:].mean()
diff = mean_after - mean_before

print(f'Before: {mean_before:.4f}')
print(f'After: {mean_after:.4f}')
print(f'Diff: {diff:.4f}')
Before: 0.0420
After: 0.1288
Diff: 0.0868