Statistical procedures in numpy
#
The following exercises test some of your new numpy
skills on some basic statistical problems. In some cases there is more than one solution. Feel free to experiment with different methods.
Imports#
import numpy as np
# you will need requests and io to download the data and load into a numpy array
import requests
import io
Exercise 1#
In this exercise, you will use numpy
to generate a random variable following a standard normal distribution. You will then use the statistical functions built into numpy to analyse the distribution of the variable.
Task:
Take 10,000 samples from the standard normal distribution
Create a function called
basic_descriptives
that returns the mean, stdev, and 1st/99th percentiles of a numpy array parameter.Call your function and printout the results.
Hints:
You can assume the numpy array passed to the function contains float data or you could check and raise an exception.
# your code here...
# example solution
SAMPLE_SIZE = 10_000
# create a random number generator
rng = np.random.default_rng(42)
# generate numpy array of size SAMPLE_SIZE using standard normal
samples = rng.normal(size=SAMPLE_SIZE)
print(type(samples))
print(samples.shape)
<class 'numpy.ndarray'>
(10000,)
def basic_descriptives(data):
"""
Returns mean, stdev, and 1st and 99th percentile of a 1D numpy.ndarray
Assumes `data` is numpy array of floats.
Parameters:
------------
data: numpy.ndarray
numeric data to analyse
Returns:
--------
(float, float, float, float)
"""
mean = data.mean()
std = data.std()
per_1st = np.percentile(data, 1)
per_99th = np.percentile(data, 99)
return mean, std, per_1st, per_99th
results = basic_descriptives(samples)
print(results)
(np.float64(-0.01024987541401165), np.float64(1.006285768537041), np.float64(-2.3738111979173713), np.float64(2.3558409670159173))
Exercise 2#
Reuse the data generated in exercise 1. You are going to analyse the tails of the distribution you generated.
Task:
Select only the samples that have a value greater than or equal to +1.96 and less than or equal to -1.96
Determine the proportion of data that falls into these tails.
Are these proportions what you expect for the standard normal distribution?
Hints:
You may want to create one or two general functions so that you can use to vary the cut-offs.
# your code here ...
Example solution:
It is very simple to work with numpy
arrays containing numeric data. For example if we wanted to find all of our samples that are greater than or equal to +1.96 we use:
# result is a array of bools
result = samples >= 1.96
print(result.shape)
print(type(result))
print(result)
(10000,)
<class 'numpy.ndarray'>
[False False False ... False False False]
The code returns a new numpy.ndarray that contains boolean (True/False) values. The value at array index i is True if the corresponding value at index i in array data is >= 2.3 and False otherwise. If we had used a python List we would have needed to loop through all of list items and perform the check ourselves.
Let’s create some generalised functions to return the probabilities that a value is greater or less than a user specified value in our data set.
To do that we need to know that
False == 0
True == 1
Therefore we can take the sum of our boolean array to find out how many array elements are greater or less than a user specified values. i.e.
(samples >= 1.96).sum()
np.int64(257)
def prob_great_than_or_equal_to(data, x):
'''
Return the proportion of the dataset that is greater than or equal to x
Parameters:
-----------
data: numpy.ndarray
numeric data to analyse
x: float
Lower cut-off.
Returns:
--------
float
'''
return (data >= x).sum()/data.shape[0]
def prob_less_than_or_equal_to(data, x):
'''
Return the proportion of the dataset that is less than or equal to x
Parameters:
-----------
data: numpy.ndarray
numeric data to analyse
x: float
Upper cut-off.
Returns:
--------
float
'''
return (data <= x).sum()/data.shape[0]
p1 = prob_great_than_or_equal_to(samples, 1.96)
p2 = prob_less_than_or_equal_to(samples, -1.96)
print(p1, p2)
print(1 - (p1+p2))
0.0257 0.0257
0.9486
Our test of these functions shows use that around 95% of data lie between points -1.96 and +1.96 (which again we would expect with the standard normal).
Exercise 3:#
Assume you have the data
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
The 20% Winsorized mean of
data
is calculated on a modified data set where the top and bottom 10% of values are replaced by the 10th and 90th percentiles.In this case the 10th percentile = 2 and the 90th = 10. Therefore the winsorized dataset is:
win_data = [2, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10]
win_mean = win_data.mean()
Task:
Write a function
winsorize(data, cut_off=0.1)
The function must modify
data
(typenp.ndarray
) so that is it is winsorized.A cut_off = 0.1 specifies that the function uses the 10th and 90th percentiles as cut-offs
Hints:
There are multiple ways to solve this problem
You could use a
for
loopYou could create a function that is vectorized using
np.vectorize()
You could use
np.where()
and fancy indexing
# your code here ...
# =============================================================================
# Exercise 3
# Solution 1: Using a `for` loop
# =============================================================================
def winsorize_loop(data, cut_off = 0.1):
low = np.percentile(data, cut_off * 100)
high = np.percentile(data, (1-cut_off) * 100)
for i in range(data.shape[0]):
if data[i] > high:
data[i] = high
elif data[i] < low:
data[i] = low
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
winsorize_loop(data)
print(data)
[ 2 2 3 4 5 6 7 8 9 10 10]
# =============================================================================
# Exercise 3
# Solution 2: Using a vectorized functioon
# =============================================================================
def winsorize_to_vectorize(to_test, low, high):
if to_test > high:
return high
elif to_test < low:
return low
else:
return to_test
def winsorize_limits(data, cut_off):
low = np.percentile(data, cut_off * 100)
high = np.percentile(data, (1-cut_off) * 100)
return low, high
v_winsorise = np.vectorize(winsorize_to_vectorize)
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
low, high = winsorize_limits(data, 0.1)
w_data = v_winsorise(data, low, high)
print(w_data)
[ 2. 2. 3. 4. 5. 6. 7. 8. 9. 10. 10.]
# =============================================================================
# Exercise 3
# Solution 3: Using np.where()
# =============================================================================
def winsorize(data, cut_off = 0.1):
low = np.percentile(data, cut_off * 100)
high = np.percentile(data, (1-cut_off) * 100)
data[np.where(data < low)] = low
data[np.where(data > high)] = high
#data = np.random.normal(10, 1, 100)
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
winsorize(data)
print(data)
[ 2 2 3 4 5 6 7 8 9 10 10]
# performance testing of solutions
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
%timeit winsorize_loop(data)
124 μs ± 1.48 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# reset data
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
%%timeit
low, high = winsorize_limits(data, 0.1)
w_data = v_winsorise(data, low, high)
157 μs ± 17.5 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
%timeit winsorize(data)
130 μs ± 4.7 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Exercise 4:#
Simple linear regression using data in numpy
arrays
In this example we will load two `numpy arrays from file and conduct a simple linear regression. The method of Ordinary Least Squares is used to fit a linear model (think \(y = \beta_1 x + \beta_0 + \epsilon \) ) to some data stored in numpy arrays.
We have two datasets:
breach.csv
: monthly totals of people waiting FOUR or more hours in English emergency departmentsdtocs.csv
: monthly total of the number of people waiting to be discharged (leave) hospial and go home or transfer to another form of healthcare.
You are going to (VERY naively) assess the relationship between these two variables. For the purposes of this exercise we are going to ignore that these two datasets are time-series data.
The library you will use to conduct linear regression is statsmodels.api
You will use the class OLS
which accepts two keyword arguments: y and x
Once you have conducted the analysis you will print the results summary.
Task:
Import
statsmodels.api
Load the dtoc and breaches datasets
Treating breaches as the dependent variable and dtocs as the independent variable perform the regression.
What is the \(R^2\) of your (naive) model?
Hints:
The columns of data in the two files have a header (the first line is a string). You will need to remove that before you can create a numpy array. Take a look at the options in
np.genfromtxt()
Your model will need an intercept. This is not automatically added in
statsmodels
. To do this you need to callstatsmodels.api.add_constant(X)
. There is a worked example in the statsmodels documentation.
# statsmodels contains the OLS class you will use
import statsmodels.api as sm
RESPONSE_SUCCESS = 200
DTOC_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
+ 'hpdm139-datasets/main/dtocs.csv'
BREACH_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
+ 'hpdm139-datasets/main/breach.csv'
def download_datasets(url):
'''
Downloads the dtoc and breaches datasets to your local machine
'''
response = requests.get(DTOC_URL)
if response.status_code == RESPONSE_SUCCESS:
# write the file
with open("dtocs.csv", "wb") as f:
f.write(response.content)
print('success: dtocs.csv downloaded.')
else:
print('connection error for dtocs')
response = requests.get(BREACH_URL)
if response.status_code == RESPONSE_SUCCESS:
# write the file
with open("breach.csv", "wb") as f:
f.write(response.content)
print('success: breach.csv downloaded.')
else:
print('connection error for dtocs')
download_datasets(DTOC_URL)
success: dtocs.csv downloaded.
success: breach.csv downloaded.
# your code here ...
#OLS functionality is in statsmodels
import statsmodels.api as sm
def load_dtoc_dataset():
'''
Loads the breach and dtoc data sets into memory
Returns a tuple of numpy.ndarrays representing
breach and dtoc dataset respectively.
'''
# note we use skip_header because the dataset has column descriptors
dtoc = np.genfromtxt('dtocs.csv', skip_header=1)
breach = np.genfromtxt('breach.csv', skip_header=1)
return breach, dtoc
breach, dtoc = load_dtoc_dataset()
###### regression code ########
# add an intercept term to the model
dtoc = sm.add_constant(dtoc)
model = sm.OLS(breach, dtoc)
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.714
Model: OLS Adj. R-squared: 0.710
Method: Least Squares F-statistic: 194.6
Date: Fri, 20 Sep 2024 Prob (F-statistic): 6.80e-23
Time: 15:55:50 Log-Likelihood: -945.02
No. Observations: 80 AIC: 1894.
Df Residuals: 78 BIC: 1899.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1.633e+05 2e+04 -8.178 0.000 -2.03e+05 -1.24e+05
x1 57.7282 4.138 13.950 0.000 49.489 65.967
==============================================================================
Omnibus: 11.472 Durbin-Watson: 0.888
Prob(Omnibus): 0.003 Jarque-Bera (JB): 16.361
Skew: 0.591 Prob(JB): 0.000280
Kurtosis: 4.874 Cond. No. 2.61e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.61e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Exercise 5:#
Preprocessing data to detrend a timeseries
In exercise 5, we conducted a simple linear regression to assess the relationship between two variables. Our initial analysis is problematic because both variables are time series and contain autocorrelation and trend. This means that we violate some of the assumptions of ordinary least squares regression and our estimates of the relationship are likely to be incorrect.
In practice, we would pre-process the data in the numpy
arrays before conducting the regression. A simple way to do this is to take the first difference of the time series. If \(Y_t\) represents the value of the time series at time \(t\) then the first difference is equal to:
Tip: If an array \(a\) has length \(n\) then an array of the first differences of \(a\) is \(n-1\)
Task:
Copy your solution code for exercise 4
Modify the code to take the first difference of the
breach
anddtoc
arraysConduct the regression analysis using the first differences (detrended data)
Look at the new \(R^2\). Is there still strong evidence of a relationship?
Hints:
If you have an array
data
then to take the first difference of the array using:
differenced = np.diff(data)
# Example solution: Use numpy built-in function np.diff()
def load_dtoc_dataset():
'''
Loads the breach and dtoc data sets into memory
Returns a tuple of numpy.ndarrays representing
breach and dtoc dataset respectively.
'''
# note we use skip_header because the dataset has column descriptors
dtoc = np.genfromtxt('dtocs.csv', skip_header=1)
breach = np.genfromtxt('breach.csv', skip_header=1)
return breach, dtoc
def dtoc_regression(breach, dtoc):
'''
Regression analysis for dtocs
and breaches
Keyword arguments:
breach - dependent variable
dtoc - independent variable
'''
dtoc = sm.add_constant(dtoc) # an intercept term to the model
model = sm.OLS(breach, dtoc)
results = model.fit()
print(results.summary())
breach, dtoc = load_dtoc_dataset()
# difference the code using np.diff
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.diff.html
d_breach = np.diff(breach)
d_dtoc = np.diff(dtoc)
dtoc_regression(d_breach, d_dtoc)
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.035
Model: OLS Adj. R-squared: 0.022
Method: Least Squares F-statistic: 2.775
Date: Fri, 20 Sep 2024 Prob (F-statistic): 0.0998
Time: 15:55:50 Log-Likelihood: -901.09
No. Observations: 79 AIC: 1806.
Df Residuals: 77 BIC: 1811.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2417.0761 2484.505 0.973 0.334 -2530.206 7364.358
x1 -13.3144 7.992 -1.666 0.100 -29.229 2.600
==============================================================================
Omnibus: 16.746 Durbin-Watson: 2.174
Prob(Omnibus): 0.000 Jarque-Bera (JB): 39.038
Skew: -0.640 Prob(JB): 3.33e-09
Kurtosis: 6.197 Cond. No. 312.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.