Working with a pandas.Dataframe

Working with a `pandas.Dataframe`#

These exercises and drills aim to build your experience of using an pandas.DataFrame. Try to work through them all without looking at the answers first.

Tip: Undoutedly in practice you will work with more complex examples and data than those found here. These exercises are a very gentle introduction to the syntax of `pandas`. But don't worry we have some more complicated data wrangling examples coming up!

Remember for any code where you wish to use pandas you need to import it. We will also import numpy to help us generate some synthetic data to use.

import pandas as pd
import numpy as np

Exercise 1#

Task:

Create a pandas.Series named “number” with numbers from 1 to 10,000 inclusive.
The datatype of the series should be np.uint32
Check the length of the Series
View the head and tail of the series to quickly validate your code has worked.

# your code here ...

# example solution
column1 = pd.Series(np.arange(1, 10_000+1), name='number', 
                    dtype=np.uint32)
column1.head(2)

0    1
1    2
Name: number, dtype: uint32

column1.tail(2)

9998     9999
9999    10000
Name: number, dtype: uint32

len(column1)

Exercise 2#

Task:

Create a pandas.DataFrame with 5 rows and 5 columns.
The data contained in each columns should all be of type int64.
Columns should be titled “col_1”, “col_2” … “col_5”
Check the datatype, shape and column names using .info()
View the DataFrame
View only the first 2 rows in the DataFrame
Create a new variable col_4 of type pandas.Series that references the data in col_4

Hints:

Data and can take any valid int64 value.
One option is to generate a random matrix using a numpy.random.Generator

# your code here...

# example solution - data could be anything you want.
rng = np.random.default_rng(42)
matrix = rng.integers(0, 50_0000, size=(5,5))
df = pd.DataFrame(matrix, columns=[f'col_{i}' for i in range(1, 6)])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   col_1   5 non-null      int64
 1   col_2   5 non-null      int64
 2   col_3   5 non-null      int64
 3   col_4   5 non-null      int64
 4   col_5   5 non-null      int64
dtypes: int64(5)
memory usage: 332.0 bytes

df.shape

(5, 5)

df

	col_1	col_2	col_3	col_4	col_5
0	44625	386978	327285	219439	216507
1	429298	42972	348684	100734	47088
2	263239	487811	367876	380569	358738
3	393032	256613	64056	419874	225192
4	250175	185399	91274	463382	390783

df.head(2)

	col_1	col_2	col_3	col_4	col_5
0	44625	386978	327285	219439	216507
1	429298	42972	348684	100734	47088

col_4 = df['col_4']
print(type(col_4))
print(col_4)

<class 'pandas.core.series.Series'>
0    219439
1    100734
2    380569
3    419874
4    463382
Name: col_4, dtype: int64

Exercise 3:#

The data below represents outpatient bookings for a hospital.

unique_id	surname	age	female	first_appoint
het1m	Hetfield	50	False	False
ulr33f	Ulrich	89	True	True
ham1f	Hammett	32	True	True
tru4m	Trujillo	65	False	True

Task:

Using a method of your choice create a pandas.DataFrame that represents the patient bookings.
You should set the DataFrame index to be unique_id (assume it is unique)
Filter the dataset for just the ham1f booking
Filer the dataset for the ‘tru4m’ and ‘ulr33f’ bookings.

# your code here ...

# example solution where each column is defined as a seperate list

unique_patient_id = ['het1m', 'ulr33f', 'ham1f', 'tru4m']
patient_surname = ['Hetfield', 'Ulrich', 'Hammett', 'Trujillo']
age = [50, 89, 32, 65]
female = [0, 1, 1, 0]
first_appointment = [0, 1, 1, 1]

patient_bookings = pd.DataFrame()
patient_bookings['unique_id'] = pd.Series(unique_patient_id, dtype=str)
patient_bookings['surname'] = pd.Series(patient_surname, dtype=str)
patient_bookings['age'] = pd.Series(age, dtype=np.uint8)
patient_bookings['female'] = pd.Series(female, dtype=bool)
patient_bookings['first_appoint'] = pd.Series(first_appointment, dtype=bool)

# set index
patient_bookings = patient_bookings.set_index('unique_id')

# print
patient_bookings

	surname	age	female	first_appoint
unique_id
het1m	Hetfield	50	False	False
ulr33f	Ulrich	89	True	True
ham1f	Hammett	32	True	True
tru4m	Trujillo	65	False	True

# filter for index ham1f
patient_bookings.loc['ham1f']

surname          Hammett
age                   32
female              True
first_appoint       True
Name: ham1f, dtype: object

# filter for both records with index tru4m and ulr33df
to_find = ['tru4m', 'ulr33f']
patient_bookings.loc[to_find]

	surname	age	female	first_appoint
unique_id
tru4m	Trujillo	65	False	True
ulr33f	Ulrich	89	True	True

Exercise 4:#

The data contained in sw_imaging.csv is based on data taken from NHS England’s open diagnostic imaging dataset from 2019/20..

This subset contains information on the number of patients referred for imaging, the median waiting time for the imaging and the median waiting time between the imaging and the report. Results are broken down by provider and imaging type.

Task:

Read the CSV data into an appropriately named DataFrame(data URL provided below)
View the information about the DataFrame
View the shape of the DataFrame
View the first two rows in the DataFrame

# your code here...

DATA_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
            + 'hpdm139-datasets/main/sw_imaging.csv'

# example solution

sw_imaging = pd.read_csv(DATA_URL)
sw_imaging.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   region        31 non-null     object 
 1   org_code      31 non-null     object 
 2   provider      31 non-null     object 
 3   imaging_type  31 non-null     object 
 4   n_referrals   31 non-null     int64  
 5   mdn_days_rtt  31 non-null     float64
 6   mdn_days_ttr  31 non-null     float64
dtypes: float64(2), int64(1), object(4)
memory usage: 1.8+ KB

sw_imaging.shape

(31, 7)

sw_imaging.head(2)

	region	org_code	provider	imaging_type	n_referrals	mdn_days_rtt	mdn_days_ttr
0	Y58	REF	Royal Cornwall Hospitals NHS Trust	Computerized Axial Tomography	46160	3.0	0.0
1	Y58	REF	Royal Cornwall Hospitals NHS Trust	Diagnostic Ultrasonography	72985	14.0	0.0

Exercise 5:#

Continue to work with the South West imaging dataset from exercise 4.

Task:

Filter the DataFrame to only show results for “Magnetic Resonance Imaging”
Filter the DataFrame to only show results for “Computerized Axial Tomography”
Filter the DataFrame to only show results for the “Royal Devon and Exeter NHS Foundation Trust”
Filer the DataFrame to only show results for “Magnetic Resonance Imaging” at the “Royal Devon and Exeter NHS Foundation Trust”

Hints:

The type of imaging is a str held in the column imaging_type
The provider of the imaging is a str held in the column provider
The final filter is a bit more tricky to get right first time. Let’s assume you need to find “Diagnostic Ultrasonography” figures for org_code “RK9”. This would be solved as follows:

# assume your `DataFrame` is called sw_imaging
sw_imaging[(sw_imaging['org_code'] == 'RK9') &
           (sw_imaging['imaging_type'] == 'Diagnostic Ultrasonography')]

# example solution
sw_imaging[sw_imaging['imaging_type'] == 'Magnetic Resonance Imaging']

	region	org_code	provider	imaging_type	n_referrals	mdn_days_rtt	mdn_days_ttr
3	Y58	REF	Royal Cornwall Hospitals NHS Trust	Magnetic Resonance Imaging	27535	40.0	5.0
10	Y58	RH8	Royal Devon and Exeter NHS Foundation Trust	Magnetic Resonance Imaging	22380	23.0	7.0
18	Y58	RA9	Torbay and South Devon NHS Foundation Trust	Magnetic Resonance Imaging	14730	14.0	2.0
26	Y58	RK9	University Hospitals Plymouth NHS Trust	Magnetic Resonance Imaging	27210	30.0	3.0

sw_imaging[sw_imaging['imaging_type'] == 'Computerized Axial Tomography']

	region	org_code	provider	imaging_type	n_referrals	mdn_days_rtt
0	Y58	REF	Royal Cornwall Hospitals NHS Trust	Computerized Axial Tomography	46160	3.0
7	Y58	RH8	Royal Devon and Exeter NHS Foundation Trust	Computerized Axial Tomography	40870	3.0
15	Y58	RA9	Torbay and South Devon NHS Foundation Trust	Computerized Axial Tomography	31165	3.0
23	Y58	RK9	University Hospitals Plymouth NHS Trust	Computerized Axial Tomography	54575	2.0

# could also use org_code of 'RH8'
sw_imaging[sw_imaging['provider'] == 
           'Royal Devon and Exeter NHS Foundation Trust']

	region	org_code	provider	imaging_type	n_referrals	mdn_days_rtt	mdn_days_ttr
7	Y58	RH8	Royal Devon and Exeter NHS Foundation Trust	Computerized Axial Tomography	40870	3.0	0.0
8	Y58	RH8	Royal Devon and Exeter NHS Foundation Trust	Diagnostic Ultrasonography	40070	14.0	0.0
9	Y58	RH8	Royal Devon and Exeter NHS Foundation Trust	Fluoroscopy	4745	16.0	0.0
10	Y58	RH8	Royal Devon and Exeter NHS Foundation Trust	Magnetic Resonance Imaging	22380	23.0	7.0
11	Y58	RH8	Royal Devon and Exeter NHS Foundation Trust	Nuclear Medicine Procedure	3465	14.0	1.0
12	Y58	RH8	Royal Devon and Exeter NHS Foundation Trust	Plain Radiography	161850	0.0	1.0
13	Y58	RH8	Royal Devon and Exeter NHS Foundation Trust	Positron Emission Tomography	1000	9.0	4.0
14	Y58	RH8	Royal Devon and Exeter NHS Foundation Trust	Single Photon Emission Computerized Tomography	545	17.0	2.0

# remember to use () around the criteria
sw_imaging[(sw_imaging['provider'] == 
           'Royal Devon and Exeter NHS Foundation Trust') &
           (sw_imaging['imaging_type'] == 'Magnetic Resonance Imaging')]

	region	org_code	provider	imaging_type	n_referrals	mdn_days_rtt	mdn_days_ttr
10	Y58	RH8	Royal Devon and Exeter NHS Foundation Trust	Magnetic Resonance Imaging	22380	23.0	7.0

Exercise 6:#

Continue to work with the South West imaging dataset from exercise 4.

We will now look at combining both a logical and with a logical or when selecting rows from a DataFrame.

Task:

Filter the DataFrame to return both Magnetic Resonance Imaging and Computerized Axial Tomography undertaken at org_code=’REF’ (cornwall)

Hints:

Think carefully about how to express the above task logically.
- You need results when the (org_code == <code>) AND ((imaging_type == <type_1>) OR (imaging_type == <type_2>))
In pandas logical AND is represented by & and OR by |

# your code here ...

# example answer
# make sure you get the () in the correct place
sw_imaging[(sw_imaging['org_code'] == 'REF') &
           ((sw_imaging['imaging_type'] == 'Magnetic Resonance Imaging') |
            (sw_imaging['imaging_type'] == 'Computerized Axial Tomography'))]

	region	org_code	provider	imaging_type	n_referrals	mdn_days_rtt	mdn_days_ttr
0	Y58	REF	Royal Cornwall Hospitals NHS Trust	Computerized Axial Tomography	46160	3.0	0.0
3	Y58	REF	Royal Cornwall Hospitals NHS Trust	Magnetic Resonance Imaging	27535	40.0	5.0

Exercise 7#

Continue to work with the South West imaging dataset from exercise 4.

Task:

Filter the data set for all records where n_referrals is greater than 100,000
Filter the results for all records where mdn_days_rtt is greater than 30

sw_imaging[sw_imaging['n_referrals'] > 100_000]

	region	org_code	provider	imaging_type	n_referrals	mdn_days_ttr
5	Y58	REF	Royal Cornwall Hospitals NHS Trust	Plain Radiography	198295	1.0
12	Y58	RH8	Royal Devon and Exeter NHS Foundation Trust	Plain Radiography	161850	1.0
20	Y58	RA9	Torbay and South Devon NHS Foundation Trust	Plain Radiography	110505	0.0
28	Y58	RK9	University Hospitals Plymouth NHS Trust	Plain Radiography	188830	3.0

sw_imaging[sw_imaging['mdn_days_rtt'] > 30]

	region	org_code	provider	imaging_type	n_referrals	mdn_days_rtt	mdn_days_ttr
3	Y58	REF	Royal Cornwall Hospitals NHS Trust	Magnetic Resonance Imaging	27535	40.0	5.0
22	Y58	RA9	Torbay and South Devon NHS Foundation Trust	Single Photon Emission Computerized Tomography	90	33.0	5.5

Exercise 8#

Continue to work with the South West imaging dataset from exercise 4.

Task:

What is the total number of imaging referrals made to each trust?
Save the total number of referrals to a file called total_referrals.csv
Write code to return the provider with the highest number of referrals.

Hints:

To solve the first part of this exercise you can make use of a DataFrame method called groupby(). This will allow you to group results by provider. For example:

sw_imaging.groupby(by='provider')

This code returns DataFrameGroupby object. You can then limit it to a specific column like so.

# This is a `SeriesGroupBy` object
sw_imaging.groupby(by='provider')['n_referrals']

You can then use on of the standard pandas statistical operations on group by object to return a Series. For example, .sum(), or .mean()

# returns a `Series` containing your group by results summed.
sw_imaging.groupby(by='provider')['n_referrals'].sum()

Groupby is a very powerful feature of pandas. Full documentation is here

# your code here ...

# example solution
# just to provide this is a Series
results = sw_imaging.groupby(by='provider')['n_referrals'].sum()
type(results)

pandas.core.series.Series

# view the results
results = sw_imaging.groupby(by='provider')['n_referrals'].sum()
results

provider
Royal Cornwall Hospitals NHS Trust             361965
Royal Devon and Exeter NHS Foundation Trust    274925
Torbay and South Devon NHS Foundation Trust    210195
University Hospitals Plymouth NHS Trust        377860
Name: n_referrals, dtype: int64

# save the results
results.to_csv('total_referrals.csv', index=True)

# the maximum number of referrals.
results.max()

np.int64(377860)

# boolean results 
results == results.max()

provider
Royal Cornwall Hospitals NHS Trust             False
Royal Devon and Exeter NHS Foundation Trust    False
Torbay and South Devon NHS Foundation Trust    False
University Hospitals Plymouth NHS Trust         True
Name: n_referrals, dtype: bool

# trust with the max number of results.
# solution via boolean indexing.
results[results == results.max()]

provider
University Hospitals Plymouth NHS Trust    377860
Name: n_referrals, dtype: int64

# alternative syntax
results.loc[results == results.max()]

provider
University Hospitals Plymouth NHS Trust    377860
Name: n_referrals, dtype: int64

Working with a pandas.Dataframe

Contents

Working with a pandas.Dataframe#

Exercise 1#

Exercise 2#

Exercise 3:#

Exercise 4:#

Exercise 5:#

Exercise 6:#

Exercise 7#

Exercise 8#

Working with a `pandas.Dataframe`#