Stroke data wrangling#
Thrombolysis is a clot busting treatment for people suffering from a ischemic stroke: where a blood clot is preventing blood flow to the brain causing it to die. In England, national data on thrombolysis treatment at individual hospitals strokes is collected and stored centrally. This exercise will make use of synthetic dataset based on the real data used in
Allen M, Pearn K, Monks T, Bray BD, Everson R, Salmon A, James M, Stein K (2019). Can clinical audits be enhanced by pathway simulation and machine learning? an example from the acute stroke pathway. BMJ Open, 9(9).
The data you are presented with in a data science or machine learning study nearly always requires a preprocessing step. This may include wrangling the data into a format suitable for machine learning, understanding (and perhaps imputing) missing values and cleaning/creation of features. In this exercise you will need to wrangle the stroke thrombolysis dataset.
Exercise 1: Read and initial look#
The dataset is held in synth_lysis.csv
.
Task:
Read the stroke thrombolysis dataset into a
pandas.DataFrame
Use appropriate
pandas
andDataFrame
methods and functions to gain an overview of the dataset and the features it contains.
Hints:
You might look at:
Size of the dataset, feature (field/variable) naming, data types, missing data etc.
import pandas as pd
DATA_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
+ 'hpdm139-datasets/main/synth_lysis.csv'
# your code here ...
# example solution
lysis = pd.read_csv(DATA_URL)
# take a look at basic info - size of dataset, features, datatypes, missing data
lysis.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 hospital 2000 non-null object
1 Male 2000 non-null object
2 Age 2000 non-null int64
3 severity 1753 non-null object
4 stroke type 1801 non-null object
5 Comorbidity 915 non-null object
6 S2RankinBeforeStroke 2000 non-null int64
7 label 2000 non-null object
dtypes: int64(2), object(6)
memory usage: 125.1+ KB
lysis.head()
hospital | Male | Age | severity | stroke type | Comorbidity | S2RankinBeforeStroke | label | |
---|---|---|---|---|---|---|---|---|
0 | Hosp_7 | Y | 65 | Minor | clot | NaN | 0 | N |
1 | Hosp_8 | N | 99 | Moderate to severe | clot | NaN | 3 | N |
2 | Hosp_8 | N | 49 | NaN | clot | NaN | 0 | N |
3 | Hosp_1 | N | 77 | Moderate | clot | Hypertension | 0 | Y |
4 | Hosp_8 | N | 86 | Minor | clot | Hypertension | 0 | N |
Exercise 2: Clean up the feature names#
The naming of features in this dataset is inconsistent. There is mixed capitalisation and use of spaces in variable names. A feature is called label
which is not particularly useful. This is the label indicating if a patient recieved thrombolysis or not.
Task:
convert all feature names to lower case
remove all spaces from names
rename
label
totreated
.
Hints:
Assuming your
DataFrame
is calleddf
you can get and set the column names usingdf.columns
# your code here...
# example solution
feature_names = list(lysis.columns)
feature_names = [s.lower().replace(' ', '_') for s in feature_names]
feature_names[-1] = 'treated'
lysis.columns = feature_names
lysis.head()
hospital | male | age | severity | stroke_type | comorbidity | s2rankinbeforestroke | treated | |
---|---|---|---|---|---|---|---|---|
0 | Hosp_7 | Y | 65 | Minor | clot | NaN | 0 | N |
1 | Hosp_8 | N | 99 | Moderate to severe | clot | NaN | 3 | N |
2 | Hosp_8 | N | 49 | NaN | clot | NaN | 0 | N |
3 | Hosp_1 | N | 77 | Moderate | clot | Hypertension | 0 | Y |
4 | Hosp_8 | N | 86 | Minor | clot | Hypertension | 0 | N |
Exercise 3: Create a pre-processing function#
It is useful to cleanly organise your data wrangling code. Let’s create a skeleton of one now before we get into any detailed wrangling.
There are a number of ways we might do this from functions, classes to specialist libraries such as pyjanitor
. Here we will prefer our own simple functions.
Task:
Create a function
wrangle_lysis
The function should accept a
str
parameter specifying the data url or directory path and filename of our thrombolysis data set.For now set the function up to read in the data (from exercise 1) and clean the feature headers (exercise 2).
The function should return a
pd.DataFrame
containing the stroke thrombolysis data.
Hints:
Get into the habit of writing a simple docstring for your functions.
This function will come in handy for the later exercises where you may make mistakes and muck up your nicely cleaned datasets! You can just reload and process them with one command after this exercise.
# your code here ...
# example solution
def wrangle_lysis(path):
'''
Preprocess and clean the stroke thrombolysis dataset.
Params:
-------
path: str
URL or directory path and filename
Returns:
--------
pd.DataFrame
Preprocessed stroke thrombolysis data
'''
lysis = pd.read_csv(path)
lysis.columns = clean_feature_names(list(lysis.columns))
return lysis
def clean_feature_names(current_feature_names):
'''
Clean the stroke lysis feature names.
1. All lower case
2. Replace spaces with '_'
3. Rename 'label' column to 'treated'
Params:
------
current_feature_names: list
List of the feature names
Returns:
-------
list
A modified list of feature names
'''
feature_names = [s.lower().replace(' ', '_') for s in current_feature_names]
feature_names[-1] = 'treated'
return feature_names
DATA_URL = 'data/synth_lysis.csv'
lysis = wrangle_lysis(DATA_URL)
lysis.info()
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[8], line 2
1 DATA_URL = 'data/synth_lysis.csv'
----> 2 lysis = wrangle_lysis(DATA_URL)
3 lysis.info()
Cell In[7], line 17, in wrangle_lysis(path)
3 def wrangle_lysis(path):
4 '''
5 Preprocess and clean the stroke thrombolysis dataset.
6
(...)
15 Preprocessed stroke thrombolysis data
16 '''
---> 17 lysis = pd.read_csv(path)
18 lysis.columns = clean_feature_names(list(lysis.columns))
19 return lysis
File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
1013 kwds_defaults = _refine_defaults_read(
1014 dialect,
1015 delimiter,
(...)
1022 dtype_backend=dtype_backend,
1023 )
1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
617 _validate_names(kwds.get("names", None))
619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
622 if chunksize or iterator:
623 return parser
File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
1617 self.options["has_index_names"] = kwds["has_index_names"]
1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)
File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
1878 if "b" not in mode:
1879 mode += "b"
-> 1880 self.handles = get_handle(
1881 f,
1882 mode,
1883 encoding=self.options.get("encoding", None),
1884 compression=self.options.get("compression", None),
1885 memory_map=self.options.get("memory_map", False),
1886 is_text=is_text,
1887 errors=self.options.get("encoding_errors", "strict"),
1888 storage_options=self.options.get("storage_options", None),
1889 )
1890 assert self.handles is not None
1891 f = self.handles.handle
File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
868 elif isinstance(handle, str):
869 # Check whether the filename is to be opened in binary mode.
870 # Binary mode does not support 'encoding' and 'newline'.
871 if ioargs.encoding and "b" not in ioargs.mode:
872 # Encoding
--> 873 handle = open(
874 handle,
875 ioargs.mode,
876 encoding=ioargs.encoding,
877 errors=errors,
878 newline="",
879 )
880 else:
881 # Binary mode
882 handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'data/synth_lysis.csv'
Exercise 4: Explore categorical features#
A number of features are categorical. For example, male
contains two values Y
(the patient is male) and N
(the patient is not male).
Task:
List the categories contained in the following fields:
fields = ['hospital', 'male', 'severity', 'stroke_type', 'treated']
# your code here ...
# example solution
# one option is to list the unique fields individually e.g.
# note you could also sort the array - but watch out for NaN's
import numpy as np
lysis['hospital'].unique()
array(['Hosp_7', 'Hosp_8', 'Hosp_1', 'Hosp_6', 'Hosp_2', 'Hosp_4',
'Hosp_3', 'Hosp_5'], dtype=object)
# option 2 - do this in a loop
categorical_features = ['hospital', 'male', 'severity', 'stroke_type', 'treated']
for feature in categorical_features:
print(f'{feature}: {lysis[feature].unique()}')
hospital: ['Hosp_7' 'Hosp_8' 'Hosp_1' 'Hosp_6' 'Hosp_2' 'Hosp_4' 'Hosp_3' 'Hosp_5']
male: ['Y' 'N']
severity: ['Minor' 'Moderate to severe' nan 'Moderate' 'Severe' 'No stroke symtpoms']
stroke_type: ['clot' 'bleed' nan]
treated: ['N' 'Y']
Exercise 5: Encode categorical fields with 2 levels#
In exercise 4, you should find that the male
and treated
columns have two levels (yes and no). If we take male
as an example we can encode it as a single feature as follows:
Note: we will deal with
stroke_type
which has two levels and missing data in exercise 6.
Task:
Encode
male
andtreated
to be binary 0/1 fields.Update
wrangle_lysis
to include this code.
Hints
Try the
pd.get_dummies
function.Remember that you only need one variable when a categorical field has two values. You can use the
drop_first=True
to keep only one variable (just make sure its the right one!).
# your code here ...
# example solution
# drop_first=True keeps the Y column
male = pd.get_dummies(lysis['male'], drop_first=True)
male.columns = ['_male']
# we will insert to just double check
lysis.insert(1,'_male', male)
# check - looks okay 1's and 0's match with Y and N.
lysis.head()
hospital | _male | male | age | severity | stroke_type | comorbidity | s2rankinbeforestroke | treated | |
---|---|---|---|---|---|---|---|---|---|
0 | Hosp_7 | 1 | Y | 65 | Minor | clot | NaN | 0 | N |
1 | Hosp_8 | 0 | N | 99 | Moderate to severe | clot | NaN | 3 | N |
2 | Hosp_8 | 0 | N | 49 | NaN | clot | NaN | 0 | N |
3 | Hosp_1 | 0 | N | 77 | Moderate | clot | Hypertension | 0 | Y |
4 | Hosp_8 | 0 | N | 86 | Minor | clot | Hypertension | 0 | N |
treated = pd.get_dummies(lysis['treated'], drop_first=True)
lysis.insert(len(lysis.columns)-1,'_treated', treated)
# check - looks okay 1's and 0's match with Y and N.
lysis.head()
hospital | _male | male | age | severity | stroke_type | comorbidity | s2rankinbeforestroke | _treated | treated | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Hosp_7 | 1 | Y | 65 | Minor | clot | NaN | 0 | 0 | N |
1 | Hosp_8 | 0 | N | 99 | Moderate to severe | clot | NaN | 3 | 0 | N |
2 | Hosp_8 | 0 | N | 49 | NaN | clot | NaN | 0 | 0 | N |
3 | Hosp_1 | 0 | N | 77 | Moderate | clot | Hypertension | 0 | 1 | Y |
4 | Hosp_8 | 0 | N | 86 | Minor | clot | Hypertension | 0 | 0 | N |
# update preprocessing function
# example solution
def wrangle_lysis(path):
'''
Preprocess and clean the stroke thrombolysis dataset.
Params:
-------
path: str
URL or directory path and filename
Returns:
--------
pd.DataFrame
Preprocessed stroke thrombolysis data
'''
lysis = pd.read_csv(path)
lysis.columns = clean_feature_names(list(lysis.columns))
encode_features(lysis)
return lysis
def encode_features(df):
'''
Encode the features in the dataset
Params:
------
df: pd.DataFrame
lysis dataframe
'''
df['male'] = pd.get_dummies(lysis['male'], drop_first=True)
df['treated'] = pd.get_dummies(lysis['treated'], drop_first=True)
DATA_URL = 'data/synth_lysis.csv'
lysis = wrangle_lysis(DATA_URL)
lysis.head()
hospital | male | age | severity | stroke_type | comorbidity | s2rankinbeforestroke | treated | |
---|---|---|---|---|---|---|---|---|
0 | Hosp_7 | 1 | 65 | Minor | clot | NaN | 0 | 0 |
1 | Hosp_8 | 0 | 99 | Moderate to severe | clot | NaN | 3 | 0 |
2 | Hosp_8 | 0 | 49 | NaN | clot | NaN | 0 | 0 |
3 | Hosp_1 | 0 | 77 | Moderate | clot | Hypertension | 0 | 1 |
4 | Hosp_8 | 0 | 86 | Minor | clot | Hypertension | 0 | 0 |
Exercise 6: Encoding fields with > 2 categories#
The process to encode features with more than category is almost identical to that used in exercise 6. For example the hospital
field contains 8 unique values. There are now two options.
encode as 7 dummy variables where all 0’s indicates hospital 1.
use a one-hot encoding and include 8 variables. The additional degree of freedom allows you to encode missing data as all zeros.
Note that some methods such as ordinary least squares regression require you to take approach one. More flexible machine learning approaches can handle approach 2. Here we will make use of approach 2.
Task:
Use a one-hot encoding on the
hospital
column.Use a one-hot encoding on the
stroke_type
column. You should prefix the new encoded columns withstroke_type_
. I.e. you will have two columnsstroke_type_clot
andstroke_type_bleed
.
Hints:
One hot encoding is just the same as calling
pd.get_dummies
, but we setdrop_first=False
.
# your code here ...
# example solution
def wrangle_lysis(path):
'''
Preprocess and clean the stroke thrombolysis dataset.
Params:
-------
path: str
URL or directory path and filename
Returns:
--------
pd.DataFrame
Preprocessed stroke thrombolysis data
'''
lysis = pd.read_csv(path)
lysis.columns = clean_feature_names(list(lysis.columns))
## MODIFICATION ###########################################
# Function uses p.d concat to create new dataframe that must be returned
lysis = encode_features(lysis)
###########################################################
return lysis
def encode_features(df):
'''
Modified function to encode the features in the dataset
Params:
------
df: pd.DataFrame
lysis dataframe
Returns:
-------
pd.DataFrame
'''
df['male'] = pd.get_dummies(df['male'], drop_first=True)
df['treated'] = pd.get_dummies(df['treated'], drop_first=True)
###### MODIFICATION ###############################################
# Hospital and stroke type encoding.
# Note that the function must now return a dataframe
# encode hospitals
hospitals = pd.get_dummies(df['hospital'], drop_first=False)
# concat the DataFrame's
df_encoded = pd.concat([hospitals, df], axis=1)
# drop the old 'hospital' feature
df_encoded.drop(['hospital'], inplace=True, axis=1)
# encode stroke type. add stroke_type_ prefix to each new feature
stroke_type = pd.get_dummies(df_encoded['stroke_type'], drop_first=False,
dummy_na=False, prefix="stroke_type_")
# update data frame - dropping original stroke_type column via slicing
INSERT_INDEX = 11
return pd.concat([df_encoded[df_encoded.columns[:INSERT_INDEX]],
stroke_type,
df_encoded[df_encoded.columns[INSERT_INDEX+1:]]],
axis=1)
#######################################################################
lysis = wrangle_lysis(DATA_URL)
lysis.head(20)
Hosp_1 | Hosp_2 | Hosp_3 | Hosp_4 | Hosp_5 | Hosp_6 | Hosp_7 | Hosp_8 | male | age | severity | stroke_type__bleed | stroke_type__clot | comorbidity | s2rankinbeforestroke | treated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 65 | Minor | 0 | 1 | NaN | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 99 | Moderate to severe | 0 | 1 | NaN | 3 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 49 | NaN | 0 | 1 | NaN | 0 | 0 |
3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 77 | Moderate | 0 | 1 | Hypertension | 0 | 1 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 86 | Minor | 0 | 1 | Hypertension | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 79 | NaN | 0 | 1 | Hypertension | 0 | 1 |
6 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 47 | Severe | 1 | 0 | NaN | 2 | 0 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 65 | Minor | 0 | 1 | Hypertension | 0 | 0 |
8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 72 | Moderate | 0 | 0 | Hypertension;Atrial Fib | 0 | 0 |
9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 84 | Moderate | 0 | 1 | Atrial Fib | 0 | 1 |
10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 81 | Moderate | 0 | 1 | NaN | 0 | 1 |
11 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 72 | Moderate | 0 | 1 | Diabetes;TIA | 2 | 1 |
12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 40 | Minor | 0 | 1 | NaN | 0 | 1 |
13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 64 | Minor | 0 | 1 | NaN | 0 | 0 |
14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 80 | Severe | 0 | 1 | NaN | 0 | 1 |
15 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 76 | NaN | 0 | 1 | NaN | 0 | 0 |
16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 73 | Minor | 0 | 1 | NaN | 0 | 0 |
17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 72 | Minor | 0 | 1 | Hypertension;Diabetes | 0 | 0 |
18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 94 | Moderate | 0 | 1 | NaN | 1 | 0 |
19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 67 | Moderate to severe | 0 | 1 | Hypertension | 0 | 1 |