ED data wrangling#
Emergency departments around the world must deal with highly attendance numbers on a daily basis. The following exercises work with multiple ED datasets. You will wrangle the dataset into a useful format using pandas
and then visualise the time series using matplotlib
.
The data sets used in these exercises are synthetic, but have been generated to reflect real emergency department demand in the United Kingdom.
Imports#
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Datasets#
The dataset syn_ts_ed_long.csv
contains data from 4 emergency departments in 2014. The data are stored in long (sometimes called tidy) format. You are provided with three columns: date
(non unique date time formatted), hosp
(int 1-4) and attends
(int, daily number of attends at hosp \(i\))
The dataset syn_ts_wide.csv
contains the same data in wide format. Each row now represents a unique date and each hospital ED has its own column.
Exercise 1#
Task 1:
Read the two datasets into a pandas dataframe and inspect the columns and data so that you understand the dataset description above.
Hints:
The URL’s for the datasets are provided below.
# your code here ...
LONG_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
+ 'hpdm139-datasets/main/syn_ts_ed_long.csv'
WIDE_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
+ 'hpdm139-datasets/main/syn_ts_ed_wide.csv'
Exercise 2:#
Assume you have only been provided with syn_ts_ed_long.csv
.
Task:
Convert the data into wide format.
the output of your code should a
pd.Dataframe
equivalent tosyn_ts_ed_wide.csv
Make a decision about the appropraite data types for each of the series. For example, by default the attendance column is an
int64
. Is this sensible? What other type of integer could the hospital columns be stored as?
Advanced Task:
Your data wrangling code should make use of chained commands in
pandas
.
Hints
There are various ways to complete this task. You may want to make use of
pivot_table
.One complication with a pivot is that you end up with a
MultiIndex
column for the hospital and number of attends. This is not always particularly clear for labelling. An option is to remove the Mulit-index during wrangling. You could explore the of transposing thepd.Dataframe
using.T
and the.reset_index()
to drop the index.You may want to build up your code command by command to help debug as you go along.
Don’t forget about data types.
# your code here ...
Exercise 3:#
Now assume that you have been provided with the data in syn_ts_ed_wide.csv
Task:
Convert the dataset from wide format to long (tidy) format.
Advanced task
Your data wrangling code should make use of chained commands in
pandas
.
Hints:
Investigate the
pandas
functionwide_to_long()
or the functionmelt()
# your code here...
Exercise 4#
We will now move onto visualising the dataset using matplotlib
Task:
Using the wide format data, create a line plot of the data for the ED located at hospital 1.
Label the y axis ‘Attendances’
Label x axis ‘Date’
Use a fontsize of 12
Provide a background grid for the plot.
Save the plot as a .png file with dpi of 300.
Hints
Feel free to adapt the plot to improve its appearance using whatever
matplotlib
options you prefer.
# your code here ...
Exercise 5#
Task:
Create a grid of subplots with 1 column and 4 rows. Each subplot should display one of the hospital ED’s.
Label each subplot with the appropraite hospital.
Provide an overall figure y axis label of ‘ED Attendances’
Give the figure and appropriate sizing
Hints:
There are several ways to create a grid of subplots. The easiest for this problem is to use the factory function
plt.subplots()
. Refer back to thematplotlib
sections in the book for help.If you are using
matplotlib
version 3.4 or above you can usefig.supylabel()
andfigsupxlabel()
to set an overall axis label.
# your solution here ...