Statistical procedures in numpy
#
The following exercises test some of your new numpy
skills on some basic statistical problems. In some cases there is more than one solution. Feel free to experiment with different methods.
Imports#
import numpy as np
# you will need requests and io to download the data
# and load into a numpy array
import requests
import io
Exercise 1#
In this exercise, you will use numpy
to generate a random variable following a standard normal distribution. You will then use the statistical functions built into numpy to analyse the distribution of the variable.
Task:
Take 10,000 samples from the standard normal distribution
Create a function called
basic_descriptives
that returns the mean, stdev, and 1st/99th percentiles of a numpy array parameter.Call your function and printout the results.
Hints:
You can assume the numpy array passed to the function contains float data or you could check and raise an exception.
# your code here...
Exercise 2#
Reuse the data generated in exercise 1. You are going to analyse the tails of the distribution you generated.
Task:
Select only the samples that have a value greater than or equal to +1.96 and less than or equal to -1.96
Determine the proportion of data that falls into these tails.
Are these proportions what you expect for the standard normal distribution?
Hints:
You may want to create one or two general functions so that you can use to vary the cut-offs.
# your code here ...
Exercise 3:#
Assume you have the data
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
The 20% Winsorized mean of
data
is calculated on a modified data set where the top and bottom 10% of values are replaced by the 10th and 90th percentiles.In this case the 10th percentile = 2 and the 90th = 10. Therefore the winsorized dataset is:
win_data = [2, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10]
win_mean = win_data.mean()
Task:
Write a function
winsorize(data, cut_off=0.1)
The function must modify
data
(typenp.ndarray
) so that is it is winsorized.A cut_off = 0.1 specifies that the function uses the 10th and 90th percentiles as cut-offs
Hints:
There are multiple ways to solve this problem
You could use a
for
loopYou could create a function that is vectorized using
np.vectorize()
You could use
np.where()
and fancy indexing
# your code here ...
Exercise 4:#
Simple linear regression using data in numpy
arrays
In this example we will load two `numpy arrays from file and conduct a simple linear regression. The method of Ordinary Least Squares is used to fit a linear model (think \(y = \beta_1 x + \beta_0 + \epsilon \) ) to some data stored in numpy arrays.
We have two datasets:
breach.csv
: monthly totals of people waiting FOUR or more hours in English emergency departmentsdtocs.csv
: monthly total of the number of people waiting to be discharged (leave) hospial and go home or transfer to another form of healthcare.
You are going to (VERY naively) assess the relationship between these two variables. For the purposes of this exercise we are going to ignore that these two datasets are time-series data.
The library you will use to conduct linear regression is statsmodels.api
You will use the class OLS
which accepts two keyword arguments: y and x
Once you have conducted the analysis you will print the results summary.
Task:
Import
statsmodels.api
Load the dtoc and breaches datasets
Treating breaches as the dependent variable and dtocs as the independent variable perform the regression.
What is the \(R^2\) of your (naive) model?
Hints:
The columns of data in the two files have a header (the first line is a string). You will need to remove that before you can create a numpy array. Take a look at the options in
np.genfromtxt()
Your model will need an intercept. This is not automatically added in
statsmodels
. To do this you need to callstatsmodels.api.add_constant(X)
. There is a worked example in the statsmodels documentation.
# statsmodels contains the OLS class you will use
import statsmodels.api as sm
RESPONSE_SUCCESS = 200
DTOC_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
+ 'hpdm139-datasets/main/dtocs.csv'
BREACH_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
+ 'hpdm139-datasets/main/breach.csv'
def download_datasets(url):
'''
Downloads the dtoc and breaches datasets to your local machine
'''
response = requests.get(DTOC_URL)
if response.status_code == RESPONSE_SUCCESS:
# write the file
with open("dtocs.csv", "wb") as f:
f.write(response.content)
print('success: dtocs.csv downloaded.')
else:
print('connection error for dtocs')
response = requests.get(BREACH_URL)
if response.status_code == RESPONSE_SUCCESS:
# write the file
with open("breach.csv", "wb") as f:
f.write(response.content)
print('success: breach.csv downloaded.')
else:
print('connection error for dtocs')
download_datasets(DTOC_URL)
success: dtocs.csv downloaded.
success: breach.csv downloaded.
# your code here ...
Exercise 5:#
Preprocessing data to detrend a timeseries
In exercise 5, we conducted a simple linear regression to assess the relationship between two variables. Our initial analysis is problematic because both variables are time series and contain autocorrelation and trend. This means that we violate some of the assumptions of ordinary least squares regression and our estimates of the relationship are likely to be incorrect.
In practice, we would pre-process the data in the numpy
arrays before conducting the regression. A simple way to do this is to take the first difference of the time series. If \(Y_t\) represents the value of the time series at time \(t\) then the first difference is equal to:
Tip: If an array \(a\) has length \(n\) then an array of the first differences of \(a\) is \(n-1\)
Task:
Copy your solution code for exercise 4
Modify the code to take the first difference of the
breach
anddtoc
arraysConduct the regression analysis using the first differences (detrended data)
Look at the new \(R^2\). Is there still strong evidence of a relationship?
Hints:
If you have an array
data
then to take the first difference of the array using:
differenced = np.diff(data)
# your code here ...