Working with a pandas.Dataframe#

These exercises and drills aim to build your experience of using an pandas.DataFrame. Try to work through them all without looking at the answers first.

Tip: Undoutedly in practice you will work with more complex examples and data than those found here. These exercises are a very gentle introduction to the syntax of `pandas`. But don't worry we have some more complicated data wrangling examples coming up!

Remember for any code where you wish to use pandas you need to import it. We will also import numpy to help us generate some synthetic data to use.

import pandas as pd
import numpy as np

Exercise 1#

Task:

  • Create a pandas.Series named “number” with numbers from 1 to 10,000 inclusive.

  • The datatype of the series should be np.uint32

  • Check the length of the Series

  • View the head and tail of the series to quickly validate your code has worked.

# your code here ...
# example solution
column1 = pd.Series(np.arange(1, 10_000+1), name='number', 
                    dtype=np.uint32)
column1.head(2)
0    1
1    2
Name: number, dtype: uint32
column1.tail(2)
9998     9999
9999    10000
Name: number, dtype: uint32
len(column1)
10000

Exercise 2#

Task:

  • Create a pandas.DataFrame with 5 rows and 5 columns.

  • The data contained in each columns should all be of type int64.

  • Columns should be titled “col_1”, “col_2” … “col_5”

  • Check the datatype, shape and column names using .info()

  • View the DataFrame

  • View only the first 2 rows in the DataFrame

  • Create a new variable col_4 of type pandas.Series that references the data in col_4

Hints:

  • Data and can take any valid int64 value.

  • One option is to generate a random matrix using a numpy.random.Generator

# your code here...
# example solution - data could be anything you want.
rng = np.random.default_rng(42)
matrix = rng.integers(0, 50_0000, size=(5,5))
df = pd.DataFrame(matrix, columns=[f'col_{i}' for i in range(1, 6)])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   col_1   5 non-null      int64
 1   col_2   5 non-null      int64
 2   col_3   5 non-null      int64
 3   col_4   5 non-null      int64
 4   col_5   5 non-null      int64
dtypes: int64(5)
memory usage: 332.0 bytes
df.shape
(5, 5)
df
col_1 col_2 col_3 col_4 col_5
0 44625 386978 327285 219439 216507
1 429298 42972 348684 100734 47088
2 263239 487811 367876 380569 358738
3 393032 256613 64056 419874 225192
4 250175 185399 91274 463382 390783
df.head(2)
col_1 col_2 col_3 col_4 col_5
0 44625 386978 327285 219439 216507
1 429298 42972 348684 100734 47088
col_4 = df['col_4']
print(type(col_4))
print(col_4)
<class 'pandas.core.series.Series'>
0    219439
1    100734
2    380569
3    419874
4    463382
Name: col_4, dtype: int64

Exercise 3:#

The data below represents outpatient bookings for a hospital.

unique_id

surname

age

female

first_appoint

het1m

Hetfield

50

False

False

ulr33f

Ulrich

89

True

True

ham1f

Hammett

32

True

True

tru4m

Trujillo

65

False

True

Task:

  • Using a method of your choice create a pandas.DataFrame that represents the patient bookings.

  • You should set the DataFrame index to be unique_id (assume it is unique)

  • Filter the dataset for just the ham1f booking

  • Filer the dataset for the ‘tru4m’ and ‘ulr33f’ bookings.

# your code here ...
# example solution where each column is defined as a seperate list

unique_patient_id = ['het1m', 'ulr33f', 'ham1f', 'tru4m']
patient_surname = ['Hetfield', 'Ulrich', 'Hammett', 'Trujillo']
age = [50, 89, 32, 65]
female = [0, 1, 1, 0]
first_appointment = [0, 1, 1, 1]

patient_bookings = pd.DataFrame()
patient_bookings['unique_id'] = pd.Series(unique_patient_id, dtype=str)
patient_bookings['surname'] = pd.Series(patient_surname, dtype=str)
patient_bookings['age'] = pd.Series(age, dtype=np.uint8)
patient_bookings['female'] = pd.Series(female, dtype=bool)
patient_bookings['first_appoint'] = pd.Series(first_appointment, dtype=bool)

# set index
patient_bookings = patient_bookings.set_index('unique_id')

# print
patient_bookings
surname age female first_appoint
unique_id
het1m Hetfield 50 False False
ulr33f Ulrich 89 True True
ham1f Hammett 32 True True
tru4m Trujillo 65 False True
# filter for index ham1f
patient_bookings.loc['ham1f']
surname          Hammett
age                   32
female              True
first_appoint       True
Name: ham1f, dtype: object
# filter for both records with index tru4m and ulr33df
to_find = ['tru4m', 'ulr33f']
patient_bookings.loc[to_find]
surname age female first_appoint
unique_id
tru4m Trujillo 65 False True
ulr33f Ulrich 89 True True

Exercise 4:#

The data contained in sw_imaging.csv is based on data taken from NHS England’s open diagnostic imaging dataset from 2019/20..

This subset contains information on the number of patients referred for imaging, the median waiting time for the imaging and the median waiting time between the imaging and the report. Results are broken down by provider and imaging type.

Task:

  • Read the CSV data into an appropriately named DataFrame(data URL provided below)

  • View the information about the DataFrame

  • View the shape of the DataFrame

  • View the first two rows in the DataFrame

# your code here...

DATA_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
            + 'hpdm139-datasets/main/sw_imaging.csv'
# example solution

sw_imaging = pd.read_csv(DATA_URL)
sw_imaging.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   region        31 non-null     object 
 1   org_code      31 non-null     object 
 2   provider      31 non-null     object 
 3   imaging_type  31 non-null     object 
 4   n_referrals   31 non-null     int64  
 5   mdn_days_rtt  31 non-null     float64
 6   mdn_days_ttr  31 non-null     float64
dtypes: float64(2), int64(1), object(4)
memory usage: 1.8+ KB
sw_imaging.shape
(31, 7)
sw_imaging.head(2)
region org_code provider imaging_type n_referrals mdn_days_rtt mdn_days_ttr
0 Y58 REF Royal Cornwall Hospitals NHS Trust Computerized Axial Tomography 46160 3.0 0.0
1 Y58 REF Royal Cornwall Hospitals NHS Trust Diagnostic Ultrasonography 72985 14.0 0.0

Exercise 5:#

Continue to work with the South West imaging dataset from exercise 4.

Task:

  • Filter the DataFrame to only show results for “Magnetic Resonance Imaging”

  • Filter the DataFrame to only show results for “Computerized Axial Tomography”

  • Filter the DataFrame to only show results for the “Royal Devon and Exeter NHS Foundation Trust”

  • Filer the DataFrame to only show results for “Magnetic Resonance Imaging” at the “Royal Devon and Exeter NHS Foundation Trust”

Hints:

  • The type of imaging is a str held in the column imaging_type

  • The provider of the imaging is a str held in the column provider

  • The final filter is a bit more tricky to get right first time. Let’s assume you need to find “Diagnostic Ultrasonography” figures for org_code “RK9”. This would be solved as follows:

# assume your `DataFrame` is called sw_imaging
sw_imaging[(sw_imaging['org_code'] == 'RK9') &
           (sw_imaging['imaging_type'] == 'Diagnostic Ultrasonography')]
# example solution
sw_imaging[sw_imaging['imaging_type'] == 'Magnetic Resonance Imaging']
region org_code provider imaging_type n_referrals mdn_days_rtt mdn_days_ttr
3 Y58 REF Royal Cornwall Hospitals NHS Trust Magnetic Resonance Imaging 27535 40.0 5.0
10 Y58 RH8 Royal Devon and Exeter NHS Foundation Trust Magnetic Resonance Imaging 22380 23.0 7.0
18 Y58 RA9 Torbay and South Devon NHS Foundation Trust Magnetic Resonance Imaging 14730 14.0 2.0
26 Y58 RK9 University Hospitals Plymouth NHS Trust Magnetic Resonance Imaging 27210 30.0 3.0
sw_imaging[sw_imaging['imaging_type'] == 'Computerized Axial Tomography']
region org_code provider imaging_type n_referrals mdn_days_rtt mdn_days_ttr
0 Y58 REF Royal Cornwall Hospitals NHS Trust Computerized Axial Tomography 46160 3.0 0.0
7 Y58 RH8 Royal Devon and Exeter NHS Foundation Trust Computerized Axial Tomography 40870 3.0 0.0
15 Y58 RA9 Torbay and South Devon NHS Foundation Trust Computerized Axial Tomography 31165 3.0 0.0
23 Y58 RK9 University Hospitals Plymouth NHS Trust Computerized Axial Tomography 54575 2.0 0.0
# could also use org_code of 'RH8'
sw_imaging[sw_imaging['provider'] == 
           'Royal Devon and Exeter NHS Foundation Trust']
region org_code provider imaging_type n_referrals mdn_days_rtt mdn_days_ttr
7 Y58 RH8 Royal Devon and Exeter NHS Foundation Trust Computerized Axial Tomography 40870 3.0 0.0
8 Y58 RH8 Royal Devon and Exeter NHS Foundation Trust Diagnostic Ultrasonography 40070 14.0 0.0
9 Y58 RH8 Royal Devon and Exeter NHS Foundation Trust Fluoroscopy 4745 16.0 0.0
10 Y58 RH8 Royal Devon and Exeter NHS Foundation Trust Magnetic Resonance Imaging 22380 23.0 7.0
11 Y58 RH8 Royal Devon and Exeter NHS Foundation Trust Nuclear Medicine Procedure 3465 14.0 1.0
12 Y58 RH8 Royal Devon and Exeter NHS Foundation Trust Plain Radiography 161850 0.0 1.0
13 Y58 RH8 Royal Devon and Exeter NHS Foundation Trust Positron Emission Tomography 1000 9.0 4.0
14 Y58 RH8 Royal Devon and Exeter NHS Foundation Trust Single Photon Emission Computerized Tomography 545 17.0 2.0
# remember to use () around the criteria
sw_imaging[(sw_imaging['provider'] == 
           'Royal Devon and Exeter NHS Foundation Trust') &
           (sw_imaging['imaging_type'] == 'Magnetic Resonance Imaging')]
region org_code provider imaging_type n_referrals mdn_days_rtt mdn_days_ttr
10 Y58 RH8 Royal Devon and Exeter NHS Foundation Trust Magnetic Resonance Imaging 22380 23.0 7.0

Exercise 6:#

Continue to work with the South West imaging dataset from exercise 4.

We will now look at combining both a logical and with a logical or when selecting rows from a DataFrame.

Task:

  • Filter the DataFrame to return both Magnetic Resonance Imaging and Computerized Axial Tomography undertaken at org_code=’REF’ (cornwall)

Hints:

  • Think carefully about how to express the above task logically.

    • You need results when the (org_code == <code>) AND ((imaging_type == <type_1>) OR (imaging_type == <type_2>))

  • In pandas logical AND is represented by & and OR by |

# your code here ...
# example answer
# make sure you get the () in the correct place
sw_imaging[(sw_imaging['org_code'] == 'REF') &
           ((sw_imaging['imaging_type'] == 'Magnetic Resonance Imaging') |
            (sw_imaging['imaging_type'] == 'Computerized Axial Tomography'))]
region org_code provider imaging_type n_referrals mdn_days_rtt mdn_days_ttr
0 Y58 REF Royal Cornwall Hospitals NHS Trust Computerized Axial Tomography 46160 3.0 0.0
3 Y58 REF Royal Cornwall Hospitals NHS Trust Magnetic Resonance Imaging 27535 40.0 5.0

Exercise 7#

Continue to work with the South West imaging dataset from exercise 4.

Task:

  • Filter the data set for all records where n_referrals is greater than 100,000

  • Filter the results for all records where mdn_days_rtt is greater than 30

sw_imaging[sw_imaging['n_referrals'] > 100_000]
region org_code provider imaging_type n_referrals mdn_days_rtt mdn_days_ttr
5 Y58 REF Royal Cornwall Hospitals NHS Trust Plain Radiography 198295 0.0 1.0
12 Y58 RH8 Royal Devon and Exeter NHS Foundation Trust Plain Radiography 161850 0.0 1.0
20 Y58 RA9 Torbay and South Devon NHS Foundation Trust Plain Radiography 110505 0.0 0.0
28 Y58 RK9 University Hospitals Plymouth NHS Trust Plain Radiography 188830 0.0 3.0
sw_imaging[sw_imaging['mdn_days_rtt'] > 30]
region org_code provider imaging_type n_referrals mdn_days_rtt mdn_days_ttr
3 Y58 REF Royal Cornwall Hospitals NHS Trust Magnetic Resonance Imaging 27535 40.0 5.0
22 Y58 RA9 Torbay and South Devon NHS Foundation Trust Single Photon Emission Computerized Tomography 90 33.0 5.5

Exercise 8#

Continue to work with the South West imaging dataset from exercise 4.

Task:

  • What is the total number of imaging referrals made to each trust?

  • Save the total number of referrals to a file called total_referrals.csv

  • Write code to return the provider with the highest number of referrals.

Hints:

To solve the first part of this exercise you can make use of a DataFrame method called groupby(). This will allow you to group results by provider. For example:

sw_imaging.groupby(by='provider')

This code returns DataFrameGroupby object. You can then limit it to a specific column like so.

# This is a `SeriesGroupBy` object
sw_imaging.groupby(by='provider')['n_referrals']

You can then use on of the standard pandas statistical operations on group by object to return a Series. For example, .sum(), or .mean()

# returns a `Series` containing your group by results summed.
sw_imaging.groupby(by='provider')['n_referrals'].sum()

Groupby is a very powerful feature of pandas. Full documentation is here

# your code here ...
# example solution
# just to provide this is a Series
results = sw_imaging.groupby(by='provider')['n_referrals'].sum()
type(results)
pandas.core.series.Series
# view the results
results = sw_imaging.groupby(by='provider')['n_referrals'].sum()
results
provider
Royal Cornwall Hospitals NHS Trust             361965
Royal Devon and Exeter NHS Foundation Trust    274925
Torbay and South Devon NHS Foundation Trust    210195
University Hospitals Plymouth NHS Trust        377860
Name: n_referrals, dtype: int64
# save the results
results.to_csv('total_referrals.csv', index=True)
# the maximum number of referrals.
results.max()
np.int64(377860)
# boolean results 
results == results.max()
provider
Royal Cornwall Hospitals NHS Trust             False
Royal Devon and Exeter NHS Foundation Trust    False
Torbay and South Devon NHS Foundation Trust    False
University Hospitals Plymouth NHS Trust         True
Name: n_referrals, dtype: bool
# trust with the max number of results.
# solution via boolean indexing.
results[results == results.max()]
provider
University Hospitals Plymouth NHS Trust    377860
Name: n_referrals, dtype: int64
# alternative syntax
results.loc[results == results.max()]
provider
University Hospitals Plymouth NHS Trust    377860
Name: n_referrals, dtype: int64