Working with a pandas.Dataframe
#
These exercises and drills aim to build your experience of using an pandas.DataFrame
. Try to work through them all without looking at the answers first.
Remember for any code where you wish to use pandas
you need to import it. We will also import numpy
to help us generate some synthetic data to use.
import pandas as pd
import numpy as np
Exercise 1#
Task:
Create a
pandas.Series
named “number” with numbers from 1 to 10,000 inclusive.The datatype of the series should be
np.uint32
Check the length of the
Series
View the head and tail of the series to quickly validate your code has worked.
# your code here ...
# example solution
column1 = pd.Series(np.arange(1, 10_000+1), name='number',
dtype=np.uint32)
column1.head(2)
0 1
1 2
Name: number, dtype: uint32
column1.tail(2)
9998 9999
9999 10000
Name: number, dtype: uint32
len(column1)
10000
Exercise 2#
Task:
Create a
pandas.DataFrame
with 5 rows and 5 columns.The data contained in each columns should all be of type int64.
Columns should be titled “col_1”, “col_2” … “col_5”
Check the datatype, shape and column names using
.info()
View the
DataFrame
View only the first 2 rows in the
DataFrame
Create a new variable
col_4
of typepandas.Series
that references the data incol_4
Hints:
Data and can take any valid int64 value.
One option is to generate a random matrix using a
numpy.random.Generator
# your code here...
# example solution - data could be anything you want.
rng = np.random.default_rng(42)
matrix = rng.integers(0, 50_0000, size=(5,5))
df = pd.DataFrame(matrix, columns=[f'col_{i}' for i in range(1, 6)])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col_1 5 non-null int64
1 col_2 5 non-null int64
2 col_3 5 non-null int64
3 col_4 5 non-null int64
4 col_5 5 non-null int64
dtypes: int64(5)
memory usage: 332.0 bytes
df.shape
(5, 5)
df
col_1 | col_2 | col_3 | col_4 | col_5 | |
---|---|---|---|---|---|
0 | 44625 | 386978 | 327285 | 219439 | 216507 |
1 | 429298 | 42972 | 348684 | 100734 | 47088 |
2 | 263239 | 487811 | 367876 | 380569 | 358738 |
3 | 393032 | 256613 | 64056 | 419874 | 225192 |
4 | 250175 | 185399 | 91274 | 463382 | 390783 |
df.head(2)
col_1 | col_2 | col_3 | col_4 | col_5 | |
---|---|---|---|---|---|
0 | 44625 | 386978 | 327285 | 219439 | 216507 |
1 | 429298 | 42972 | 348684 | 100734 | 47088 |
col_4 = df['col_4']
print(type(col_4))
print(col_4)
<class 'pandas.core.series.Series'>
0 219439
1 100734
2 380569
3 419874
4 463382
Name: col_4, dtype: int64
Exercise 3:#
The data below represents outpatient bookings for a hospital.
unique_id |
surname |
age |
female |
first_appoint |
---|---|---|---|---|
het1m |
Hetfield |
50 |
False |
False |
ulr33f |
Ulrich |
89 |
True |
True |
ham1f |
Hammett |
32 |
True |
True |
tru4m |
Trujillo |
65 |
False |
True |
Task:
Using a method of your choice create a
pandas.DataFrame
that represents the patient bookings.You should set the
DataFrame
index to beunique_id
(assume it is unique)Filter the dataset for just the
ham1f
bookingFiler the dataset for the ‘tru4m’ and ‘ulr33f’ bookings.
# your code here ...
# example solution where each column is defined as a seperate list
unique_patient_id = ['het1m', 'ulr33f', 'ham1f', 'tru4m']
patient_surname = ['Hetfield', 'Ulrich', 'Hammett', 'Trujillo']
age = [50, 89, 32, 65]
female = [0, 1, 1, 0]
first_appointment = [0, 1, 1, 1]
patient_bookings = pd.DataFrame()
patient_bookings['unique_id'] = pd.Series(unique_patient_id, dtype=str)
patient_bookings['surname'] = pd.Series(patient_surname, dtype=str)
patient_bookings['age'] = pd.Series(age, dtype=np.uint8)
patient_bookings['female'] = pd.Series(female, dtype=bool)
patient_bookings['first_appoint'] = pd.Series(first_appointment, dtype=bool)
# set index
patient_bookings = patient_bookings.set_index('unique_id')
# print
patient_bookings
surname | age | female | first_appoint | |
---|---|---|---|---|
unique_id | ||||
het1m | Hetfield | 50 | False | False |
ulr33f | Ulrich | 89 | True | True |
ham1f | Hammett | 32 | True | True |
tru4m | Trujillo | 65 | False | True |
# filter for index ham1f
patient_bookings.loc['ham1f']
surname Hammett
age 32
female True
first_appoint True
Name: ham1f, dtype: object
# filter for both records with index tru4m and ulr33df
to_find = ['tru4m', 'ulr33f']
patient_bookings.loc[to_find]
surname | age | female | first_appoint | |
---|---|---|---|---|
unique_id | ||||
tru4m | Trujillo | 65 | False | True |
ulr33f | Ulrich | 89 | True | True |
Exercise 4:#
The data contained in sw_imaging.csv
is based on data taken from NHS England’s open diagnostic imaging dataset from 2019/20..
This subset contains information on the number of patients referred for imaging, the median waiting time for the imaging and the median waiting time between the imaging and the report. Results are broken down by provider and imaging type.
Task:
Read the CSV data into an appropriately named
DataFrame
(data URL provided below)View the information about the
DataFrame
View the shape of the
DataFrame
View the first two rows in the
DataFrame
# your code here...
DATA_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
+ 'hpdm139-datasets/main/sw_imaging.csv'
# example solution
sw_imaging = pd.read_csv(DATA_URL)
sw_imaging.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 region 31 non-null object
1 org_code 31 non-null object
2 provider 31 non-null object
3 imaging_type 31 non-null object
4 n_referrals 31 non-null int64
5 mdn_days_rtt 31 non-null float64
6 mdn_days_ttr 31 non-null float64
dtypes: float64(2), int64(1), object(4)
memory usage: 1.8+ KB
sw_imaging.shape
(31, 7)
sw_imaging.head(2)
region | org_code | provider | imaging_type | n_referrals | mdn_days_rtt | mdn_days_ttr | |
---|---|---|---|---|---|---|---|
0 | Y58 | REF | Royal Cornwall Hospitals NHS Trust | Computerized Axial Tomography | 46160 | 3.0 | 0.0 |
1 | Y58 | REF | Royal Cornwall Hospitals NHS Trust | Diagnostic Ultrasonography | 72985 | 14.0 | 0.0 |
Exercise 5:#
Continue to work with the South West imaging dataset from exercise 4.
Task:
Filter the
DataFrame
to only show results for “Magnetic Resonance Imaging”Filter the
DataFrame
to only show results for “Computerized Axial Tomography”Filter the
DataFrame
to only show results for the “Royal Devon and Exeter NHS Foundation Trust”Filer the
DataFrame
to only show results for “Magnetic Resonance Imaging” at the “Royal Devon and Exeter NHS Foundation Trust”
Hints:
The type of imaging is a
str
held in the columnimaging_type
The provider of the imaging is a
str
held in the columnprovider
The final filter is a bit more tricky to get right first time. Let’s assume you need to find “Diagnostic Ultrasonography” figures for
org_code
“RK9”. This would be solved as follows:
# assume your `DataFrame` is called sw_imaging
sw_imaging[(sw_imaging['org_code'] == 'RK9') &
(sw_imaging['imaging_type'] == 'Diagnostic Ultrasonography')]
# example solution
sw_imaging[sw_imaging['imaging_type'] == 'Magnetic Resonance Imaging']
region | org_code | provider | imaging_type | n_referrals | mdn_days_rtt | mdn_days_ttr | |
---|---|---|---|---|---|---|---|
3 | Y58 | REF | Royal Cornwall Hospitals NHS Trust | Magnetic Resonance Imaging | 27535 | 40.0 | 5.0 |
10 | Y58 | RH8 | Royal Devon and Exeter NHS Foundation Trust | Magnetic Resonance Imaging | 22380 | 23.0 | 7.0 |
18 | Y58 | RA9 | Torbay and South Devon NHS Foundation Trust | Magnetic Resonance Imaging | 14730 | 14.0 | 2.0 |
26 | Y58 | RK9 | University Hospitals Plymouth NHS Trust | Magnetic Resonance Imaging | 27210 | 30.0 | 3.0 |
sw_imaging[sw_imaging['imaging_type'] == 'Computerized Axial Tomography']
region | org_code | provider | imaging_type | n_referrals | mdn_days_rtt | mdn_days_ttr | |
---|---|---|---|---|---|---|---|
0 | Y58 | REF | Royal Cornwall Hospitals NHS Trust | Computerized Axial Tomography | 46160 | 3.0 | 0.0 |
7 | Y58 | RH8 | Royal Devon and Exeter NHS Foundation Trust | Computerized Axial Tomography | 40870 | 3.0 | 0.0 |
15 | Y58 | RA9 | Torbay and South Devon NHS Foundation Trust | Computerized Axial Tomography | 31165 | 3.0 | 0.0 |
23 | Y58 | RK9 | University Hospitals Plymouth NHS Trust | Computerized Axial Tomography | 54575 | 2.0 | 0.0 |
# could also use org_code of 'RH8'
sw_imaging[sw_imaging['provider'] ==
'Royal Devon and Exeter NHS Foundation Trust']
region | org_code | provider | imaging_type | n_referrals | mdn_days_rtt | mdn_days_ttr | |
---|---|---|---|---|---|---|---|
7 | Y58 | RH8 | Royal Devon and Exeter NHS Foundation Trust | Computerized Axial Tomography | 40870 | 3.0 | 0.0 |
8 | Y58 | RH8 | Royal Devon and Exeter NHS Foundation Trust | Diagnostic Ultrasonography | 40070 | 14.0 | 0.0 |
9 | Y58 | RH8 | Royal Devon and Exeter NHS Foundation Trust | Fluoroscopy | 4745 | 16.0 | 0.0 |
10 | Y58 | RH8 | Royal Devon and Exeter NHS Foundation Trust | Magnetic Resonance Imaging | 22380 | 23.0 | 7.0 |
11 | Y58 | RH8 | Royal Devon and Exeter NHS Foundation Trust | Nuclear Medicine Procedure | 3465 | 14.0 | 1.0 |
12 | Y58 | RH8 | Royal Devon and Exeter NHS Foundation Trust | Plain Radiography | 161850 | 0.0 | 1.0 |
13 | Y58 | RH8 | Royal Devon and Exeter NHS Foundation Trust | Positron Emission Tomography | 1000 | 9.0 | 4.0 |
14 | Y58 | RH8 | Royal Devon and Exeter NHS Foundation Trust | Single Photon Emission Computerized Tomography | 545 | 17.0 | 2.0 |
# remember to use () around the criteria
sw_imaging[(sw_imaging['provider'] ==
'Royal Devon and Exeter NHS Foundation Trust') &
(sw_imaging['imaging_type'] == 'Magnetic Resonance Imaging')]
region | org_code | provider | imaging_type | n_referrals | mdn_days_rtt | mdn_days_ttr | |
---|---|---|---|---|---|---|---|
10 | Y58 | RH8 | Royal Devon and Exeter NHS Foundation Trust | Magnetic Resonance Imaging | 22380 | 23.0 | 7.0 |
Exercise 6:#
Continue to work with the South West imaging dataset from exercise 4.
We will now look at combining both a logical and with a logical or when selecting rows from a DataFrame
.
Task:
Filter the
DataFrame
to return both Magnetic Resonance Imaging and Computerized Axial Tomography undertaken atorg_code
=’REF’ (cornwall)
Hints:
Think carefully about how to express the above task logically.
You need results when the
(org_code == <code>) AND ((imaging_type == <type_1>) OR (imaging_type == <type_2>))
In
pandas
logical AND is represented by&
and OR by|
# your code here ...
# example answer
# make sure you get the () in the correct place
sw_imaging[(sw_imaging['org_code'] == 'REF') &
((sw_imaging['imaging_type'] == 'Magnetic Resonance Imaging') |
(sw_imaging['imaging_type'] == 'Computerized Axial Tomography'))]
region | org_code | provider | imaging_type | n_referrals | mdn_days_rtt | mdn_days_ttr | |
---|---|---|---|---|---|---|---|
0 | Y58 | REF | Royal Cornwall Hospitals NHS Trust | Computerized Axial Tomography | 46160 | 3.0 | 0.0 |
3 | Y58 | REF | Royal Cornwall Hospitals NHS Trust | Magnetic Resonance Imaging | 27535 | 40.0 | 5.0 |
Exercise 7#
Continue to work with the South West imaging dataset from exercise 4.
Task:
Filter the data set for all records where
n_referrals
is greater than 100,000Filter the results for all records where
mdn_days_rtt
is greater than 30
sw_imaging[sw_imaging['n_referrals'] > 100_000]
region | org_code | provider | imaging_type | n_referrals | mdn_days_rtt | mdn_days_ttr | |
---|---|---|---|---|---|---|---|
5 | Y58 | REF | Royal Cornwall Hospitals NHS Trust | Plain Radiography | 198295 | 0.0 | 1.0 |
12 | Y58 | RH8 | Royal Devon and Exeter NHS Foundation Trust | Plain Radiography | 161850 | 0.0 | 1.0 |
20 | Y58 | RA9 | Torbay and South Devon NHS Foundation Trust | Plain Radiography | 110505 | 0.0 | 0.0 |
28 | Y58 | RK9 | University Hospitals Plymouth NHS Trust | Plain Radiography | 188830 | 0.0 | 3.0 |
sw_imaging[sw_imaging['mdn_days_rtt'] > 30]
region | org_code | provider | imaging_type | n_referrals | mdn_days_rtt | mdn_days_ttr | |
---|---|---|---|---|---|---|---|
3 | Y58 | REF | Royal Cornwall Hospitals NHS Trust | Magnetic Resonance Imaging | 27535 | 40.0 | 5.0 |
22 | Y58 | RA9 | Torbay and South Devon NHS Foundation Trust | Single Photon Emission Computerized Tomography | 90 | 33.0 | 5.5 |
Exercise 8#
Continue to work with the South West imaging dataset from exercise 4.
Task:
What is the total number of imaging referrals made to each trust?
Save the total number of referrals to a file called
total_referrals.csv
Write code to return the
provider
with the highest number of referrals.
Hints:
To solve the first part of this exercise you can make use of a DataFrame
method called groupby()
. This will allow you to group results by provider
. For example:
sw_imaging.groupby(by='provider')
This code returns DataFrameGroupby
object. You can then limit it to a specific column like so.
# This is a `SeriesGroupBy` object
sw_imaging.groupby(by='provider')['n_referrals']
You can then use on of the standard pandas
statistical operations on group by object to return a Series
. For example, .sum()
, or .mean()
# returns a `Series` containing your group by results summed.
sw_imaging.groupby(by='provider')['n_referrals'].sum()
Groupby is a very powerful feature of
pandas
. Full documentation is here
# your code here ...
# example solution
# just to provide this is a Series
results = sw_imaging.groupby(by='provider')['n_referrals'].sum()
type(results)
pandas.core.series.Series
# view the results
results = sw_imaging.groupby(by='provider')['n_referrals'].sum()
results
provider
Royal Cornwall Hospitals NHS Trust 361965
Royal Devon and Exeter NHS Foundation Trust 274925
Torbay and South Devon NHS Foundation Trust 210195
University Hospitals Plymouth NHS Trust 377860
Name: n_referrals, dtype: int64
# save the results
results.to_csv('total_referrals.csv', index=True)
# the maximum number of referrals.
results.max()
np.int64(377860)
# boolean results
results == results.max()
provider
Royal Cornwall Hospitals NHS Trust False
Royal Devon and Exeter NHS Foundation Trust False
Torbay and South Devon NHS Foundation Trust False
University Hospitals Plymouth NHS Trust True
Name: n_referrals, dtype: bool
# trust with the max number of results.
# solution via boolean indexing.
results[results == results.max()]
provider
University Hospitals Plymouth NHS Trust 377860
Name: n_referrals, dtype: int64
# alternative syntax
results.loc[results == results.max()]
provider
University Hospitals Plymouth NHS Trust 377860
Name: n_referrals, dtype: int64