import numpy as np
Statistical procedures#
A substantial proportion of real world applications in computational modelling require statistical procedures. numpy
provides a wide variety of efficient statistical functions for you to employ on an array. This section will explore the (simple and) commonly used functions that you can embed into your models and use for practical statistical analysis.
Simple data analysis example.#
ED attendance data#
We will first use data held in the minor_illness_ed_attends.csv
. This is a synthetic time series dataset reporting the number of patients registered at GP surgery who attend ED each week. The data are standardised to 10k of registered patients.
Loading the dataset#
Let’s first open the data and then construct some summary statistics
file_name = 'data/minor_illness_ed_attends.csv'
ed_data = np.loadtxt(file_name, skiprows=1, delimiter=',')
print(ed_data.shape)
(74,)
Here’s a peak the first 5 elements in ed_data
.
ed_data[:5]
array([2.11927795, 3.49057545, 3.98922908, 2.36860477, 3.24124863])
Calculate summary statistics#
numpy
makes it easy to calculate means, stdev and other summary statistics of an ndarray
. This typically have intuitive names which you can often guess. For example:
print(ed_data.mean())
print(ed_data.std())
2.919482262743243
0.7060975938853894
But it is always worth reading the docs. For example to calculate an unbiased estimate of the sample standard deviation of a data set of length \(n\) we should divide by \(n - 1\). In numpy
we do this using the ddof
(degree’s of freedom) parameter of std
.
Here we will create a class to act as a convienient container for a dataset. For convenience, we will override the __str__
method so that we can easily print a summary of the dataset to the screen when calling print
.
The class will make use of the following numpy
statistical procedures:
min()
andmax()
to return the minimum and maximum value in array respectivelynp.percentile()
to calculate a percentile - here the median and upper / lower quartiles.np.histogram()
that bins data points and reports the frequencies.
class AttendanceSummary:
'''
Simple container class Hold mean, stdev and 5/95 percentiles of ed data
'''
def __init__(self, data, name=None, decimal_places=2):
"""
Contructor method.
Parameters:
----------
data: numpy.ndarray
Vector containing data to analyse.
"""
self.name = name
self.dp = decimal_places
self.n = len(data)
self.mean = data.mean()
self.std = data.std(ddof=1)
self.min_attends = data.min()
self.max_attends = data.max()
# percentiles: note that this is a np. call
self.lower_quartile = np.percentile(data, 25)
self.median = np.percentile(data, 50)
self.upper_quartile = np.percentile(data, 75)
self.iqr = self.upper_quartile - self.lower_quartile
# frequency histogram (automatically binned)
self.freq, self.bins = np.histogram(data, density=False)
def __repr__(self):
to_print = f'\nDataset: {self.name}' \
+ f'\nMean:\t{self.mean:.{self.dp}f}' \
+ f'\nStdev:\t{self.std:.{self.dp}f}' \
+ f'\nMin:\t{self.min_attends:.{self.dp}f}' \
+ f'\nMax:\t{self.max_attends:.{self.dp}f}' \
+ f'\nMedian:\t{self.median:.{self.dp}f}' \
+ f'\nIQR:\t{self.iqr:.{self.dp}f}' \
+ f'\nn:\t{self.n}'
return to_print
def frequency_histogram(self):
print('x\tf(x)')
for f, b in zip(self.freq, self.bins):
print(f'{b:.{self.dp}f}\t{f}')
x = AttendanceSummary(ed_data, name="ED")
x
Dataset: ED
Mean: 2.92
Stdev: 0.71
Min: 1.62
Max: 5.11
Median: 2.87
IQR: 1.09
n: 74
x.frequency_histogram()
x f(x)
1.62 5
1.97 9
2.32 14
2.67 16
3.02 11
3.37 6
3.71 10
4.06 2
4.41 0
4.76 1
Its not necessary to always use a class as I here, but it is useful if for example you want to compare samples or in the case of a time series period.
In our ED data let’s assume an intervention was put in place in week 10: GP surgeries were opened to patients over weekends. Let’s summarise the results using descriptive statistics.
week = 10 # the week the intervention begins
before = AttendanceSummary(ed_data[:week], 'before')
after = AttendanceSummary(ed_data[week:], 'after')
print(before)
print(after)
Dataset: before
Mean: 3.12
Stdev: 0.59
Min: 2.12
Max: 3.99
Median: 3.18
IQR: 0.81
n: 10
Dataset: after
Mean: 2.89
Stdev: 0.73
Min: 1.62
Max: 5.11
Median: 2.87
IQR: 0.90
n: 64
Statistical procedures and n-dimensional arrays#
Generating statistics for matricies and n-dimensional arrays works in a similar manner to vectors. We can also go a bit further and calculate statistics across axes in the array; for example, rows and columns in a matrix. To do this you need to specify an integer axis in the call to the statistical function e.g. mean(axis=0)
Example 1: mean of matrix columns and rows#
Given matrix
calculate the mean of the columns and rows. We will first generate these manually using slices so that you can see it is working correctly.
matrix = np.array([[7.7, 4.3, 8.5],
[6.9, 0.9, 9.7],
[7.6, 7.8, 1.2]])
To calculate the mean of the first column we need to slice all rows and the column at index 0. As a reminder we would do the following:
Try changing the value of 0 to 1 or 2 to get the other columns.
matrix[:,0]
array([7.7, 6.9, 7.6])
This is a vector and hence calculating the mean is as we have already seen
here I also call
.round(1)
to also limit displayed the result to 1 decimal place. You could also use a format str.
matrix[:,0].mean().round(1)
np.float64(7.4)
print(matrix[:,1].mean().round(1))
print(matrix[:,2].mean().round(1))
4.3
6.5
This is a common operation so we can instead specify an axis. In a matrix we use axis 0 for columns and 1 for rows.
matrix.mean(axis=0).round(1)
array([7.4, 4.3, 6.5])
matrix.mean(axis=1).round(1)
array([6.8, 5.8, 5.5])
Example 2: Statistical operations on a larger dataset#
Let’s open a more realistic tabular dataset that you would use in a real health data science study. For this well download the Wisconsin breast cancer dataset from Github, create a np.ndarray
and summarise the features.
To download the file we will use the requests
library:
import requests
#download successful
RESPONSE_SUCCESS = 200
FILE_NAME = 'data/wisconsin.csv'
# note: github raw url needed for download
DATA_URL = "https://raw.githubusercontent.com/health-data-science-OR/" \
+ "hpdm139-datasets/main/wisconsin.csv"
# download the file...(only needs to be done once!).
print('downloading =>', end=' ')
response = requests.get(DATA_URL)
if response.status_code == RESPONSE_SUCCESS:
print('successful')
# response.content is a in bytes
# write to file to read into numpy.
with open(FILE_NAME, 'wb') as f:
f.write(response.content)
else:
print('file not found.')
downloading =>
successful
Before we attempt to load the numpy
array from let’s do some due diligence and check for headers and any non-numeric column data types. We need to do this to avoid any issues when loading the data. We will go a bit old school here and use the csv
library to help us with the inspection.
This isn’t a big dataset so we could just inspect it in a CSV viewer in JupyterLab (or in a free and open spreadsheet package like LibreOffice Calc). But I’d encourage you to keep your complete workflow documented in health data science. In some cases the dataset will also be far to big for memory and you’ll standard viewers won’t be able to cope.
import csv
with open(FILE_NAME, 'r') as f:
reader = csv.reader(f)
header_row = next(reader)
field_count = len(header_row)
# peak at the header_row
print(header_row[:5])
# record names
header_names = [header for header in header_row]
print(f'No. fields: {field_count}')
# peak at first data row
first_data_row = next(reader)
non_numeric_fields = []
# catch any fields that contain non numeric data
for field_index in range(len(first_data_row)):
try:
field = float(first_data_row[field_index])
except:
non_numeric_fields.append(field_index)
# print out non-numeric fields
print(f'Fields with non-numeric data: {non_numeric_fields}')
['', 'id', 'diagnosis', 'radius_mean', 'texture_mean']
No. fields: 33
Fields with non-numeric data: [2]
We can conclude that the first row in the file is indeed a header row - we’ll skip that on the array import. We will also drop the fields denoted ‘ ‘ and ‘id’. The field at index 2: diagnosis
contains non-numeric data. This is actually the target field in the dataset (‘M’ for malignant and ‘B’ for benign). We will deal with the target seperately here.
The first thing we will do is create a list of the field indexes containing the numeric data to read in, using boolean indexing.
field_indexes = np.arange(field_count)
fields_included = field_indexes[~np.isin(field_indexes, non_numeric_fields)]
fields_included = fields_included[2:]
header_names = np.array(header_names)[fields_included]
print(fields_included)
[ 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32]
Then its just a case of loading the correct rows and columns into an array. Here we will use np.genfromtxt
wisconsin = np.genfromtxt(FILE_NAME, skip_header=1, usecols=fields_included, delimiter=',')
print(wisconsin.shape)
(569, 30)
We now have a clean dataset that we can summarise. All fields are quantatitive so this is straightforward. The class FeatureSummary
below helps us visualise the results of each feature (column). Call the .show()
method to see a summary.
class FeatureSummary:
'''
Simple container class to hold and display a descriptive summary
of features in a dataset
'''
def __init__(self, data, headers, name=None, decimal_places=2):
"""
Contructor method.
Parameters:
----------
data: numpy.ndarray
Vector containing data to analyse.
"""
self.name = name
self.headers = headers
# these help with the summary layout.
self.max_header_padding_len = max(len(x) for x in headers)
self.header_padding = ''.rjust(self.max_header_padding_len)
print(self.header_padding)
self.dp = decimal_places
self.n = len(data)
self.mean = data.mean(axis=0)
self.std = data.std(ddof=1, axis=0)
self.min = data.min(axis=0)
self.max = data.max(axis=0)
self.range = self.max - self.min
# percentiles: note that this is a np. call
self.lower_quartile = np.percentile(data, 25, axis=0)
self.median = np.percentile(data, 50, axis=0)
self.upper_quartile = np.percentile(data, 75, axis=0)
self.iqr = self.upper_quartile - self.lower_quartile
def show(self):
print(self.name)
header_row = f"{self.header_padding}\tMean\tStd\tMedian\tIQR\tRange"
print(header_row)
line = ['-'] * int(len(header_row) * 1.3)
print(''.join(line))
for i in range(len(self.headers)):
row_padding_len = self.max_header_padding_len - len(self.headers[i])
row_padding = ''.rjust(row_padding_len)
field = f'{self.headers[i]}{row_padding}\t{self.mean[i]:.{self.dp}f}' \
+ f'\t{self.std[i]:.{self.dp}f}' \
+ f'\t{self.median[i]:.{self.dp}f}' \
+ f'\t{self.iqr[i]:.{self.dp}f}' \
+ f'\t{self.range[i]:.{self.dp}f}'
print(field)
x = FeatureSummary(wisconsin, header_names, name='Wisconson Breast Cancer Dataset.')
x.show()
Wisconson Breast Cancer Dataset.
Mean Std Median IQR Range
---------------------------------------------------------------
radius_mean 14.13 3.52 13.37 4.08 21.13
texture_mean 19.29 4.30 18.84 5.63 29.57
perimeter_mean 91.97 24.30 86.24 28.93 144.71
area_mean 654.89 351.91 551.10 362.40 2357.50
smoothness_mean 0.10 0.01 0.10 0.02 0.11
compactness_mean 0.10 0.05 0.09 0.07 0.33
concavity_mean 0.09 0.08 0.06 0.10 0.43
concave points_mean 0.05 0.04 0.03 0.05 0.20
symmetry_mean 0.18 0.03 0.18 0.03 0.20
fractal_dimension_mean 0.06 0.01 0.06 0.01 0.05
radius_se 0.41 0.28 0.32 0.25 2.76
texture_se 1.22 0.55 1.11 0.64 4.52
perimeter_se 2.87 2.02 2.29 1.75 21.22
area_se 40.34 45.49 24.53 27.34 535.40
smoothness_se 0.01 0.00 0.01 0.00 0.03
compactness_se 0.03 0.02 0.02 0.02 0.13
concavity_se 0.03 0.03 0.03 0.03 0.40
concave points_se 0.01 0.01 0.01 0.01 0.05
symmetry_se 0.02 0.01 0.02 0.01 0.07
fractal_dimension_se 0.00 0.00 0.00 0.00 0.03
radius_worst 16.27 4.83 14.97 5.78 28.11
texture_worst 25.68 6.15 25.41 8.64 37.52
perimeter_worst 107.26 33.60 97.66 41.29 200.79
area_worst 880.58 569.36 686.50 568.70 4068.80
smoothness_worst 0.13 0.02 0.13 0.03 0.15
compactness_worst 0.25 0.16 0.21 0.19 1.03
concavity_worst 0.27 0.21 0.23 0.27 1.25
concave points_worst 0.11 0.07 0.10 0.10 0.29
symmetry_worst 0.29 0.06 0.28 0.07 0.51
fractal_dimension_worst 0.08 0.02 0.08 0.02 0.15