Working with a pandas.Dataframe
#
These exercises and drills aim to build your experience of using an pandas.DataFrame
. Try to work through them all without looking at the answers first.
Remember for any code where you wish to use pandas
you need to import it. We will also import numpy
to help us generate some synthetic data to use.
import pandas as pd
import numpy as np
Exercise 1#
Task:
Create a
pandas.Series
named “number” with numbers from 1 to 10,000 inclusive.The datatype of the series should be
np.uint32
Check the length of the
Series
View the head and tail of the series to quickly validate your code has worked.
# your code here ...
Exercise 2#
Task:
Create a
pandas.DataFrame
with 5 rows and 5 columns.The data contained in each columns should all be of type int64.
Columns should be titled “col_1”, “col_2” … “col_5”
Check the datatype, shape and column names using
.info()
View the
DataFrame
View only the first 2 rows in the
DataFrame
Create a new variable
col_4
of typepandas.Series
that references the data incol_4
Hints:
Data and can take any valid int64 value.
One option is to generate a random matrix using a
numpy.random.Generator
# your code here...
Exercise 3:#
The data below represents outpatient bookings for a hospital.
unique_id |
surname |
age |
female |
first_appoint |
---|---|---|---|---|
het1m |
Hetfield |
50 |
False |
False |
ulr33f |
Ulrich |
89 |
True |
True |
ham1f |
Hammett |
32 |
True |
True |
tru4m |
Trujillo |
65 |
False |
True |
Task:
Using a method of your choice create a
pandas.DataFrame
that represents the patient bookings.You should set the
DataFrame
index to beunique_id
(assume it is unique)Filter the dataset for just the
ham1f
bookingFiler the dataset for the ‘tru4m’ and ‘ulr33f’ bookings.
# your code here ...
Exercise 4:#
The data contained in sw_imaging.csv
is based on data taken from NHS England’s open diagnostic imaging dataset from 2019/20..
This subset contains information on the number of patients referred for imaging, the median waiting time for the imaging and the median waiting time between the imaging and the report. Results are broken down by provider and imaging type.
Task:
Read the CSV data into an appropriately named
DataFrame
(data URL provided below)View the information about the
DataFrame
View the shape of the
DataFrame
View the first two rows in the
DataFrame
# your code here...
DATA_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
+ 'hpdm139-datasets/main/sw_imaging.csv'
Exercise 5:#
Continue to work with the South West imaging dataset from exercise 4.
Task:
Filter the
DataFrame
to only show results for “Magnetic Resonance Imaging”Filter the
DataFrame
to only show results for “Computerized Axial Tomography”Filter the
DataFrame
to only show results for the “Royal Devon and Exeter NHS Foundation Trust”Filer the
DataFrame
to only show results for “Magnetic Resonance Imaging” at the “Royal Devon and Exeter NHS Foundation Trust”
Hints:
The type of imaging is a
str
held in the columnimaging_type
The provider of the imaging is a
str
held in the columnprovider
The final filter is a bit more tricky to get right first time. Let’s assume you need to find “Diagnostic Ultrasonography” figures for
org_code
“RK9”. This would be solved as follows:
# assume your `DataFrame` is called sw_imaging
sw_imaging[(sw_imaging['org_code'] == 'RK9') &
(sw_imaging['imaging_type'] == 'Diagnostic Ultrasonography')]
# your code here ...
Exercise 6:#
Continue to work with the South West imaging dataset from exercise 4.
We will now look at combining both a logical and with a logical or when selecting rows from a DataFrame
.
Task:
Filter the
DataFrame
to return both Magnetic Resonance Imaging and Computerized Axial Tomography undertaken atorg_code
=’REF’ (cornwall)
Hints:
Think carefully about how to express the above task logically.
You need results when the
(org_code == <code>) AND ((imaging_type == <type_1>) OR (imaging_type == <type_2>))
In
pandas
logical AND is represented by&
and OR by|
# your code here ...
Exercise 7#
Continue to work with the South West imaging dataset from exercise 4.
Task:
Filter the data set for all records where
n_referrals
is greater than 100,000Filter the results for all records where
mdn_days_rtt
is greater than 30
# your code here ...
Exercise 8#
Continue to work with the South West imaging dataset from exercise 4.
Task:
What is the total number of imaging referrals made to each trust?
Save the total number of referrals to a file called
total_referrals.csv
Write code to return the
provider
with the highest number of referrals.
Hints:
To solve the first part of this exercise you can make use of a DataFrame
method called groupby()
. This will allow you to group results by provider
. For example:
sw_imaging.groupby(by='provider')
This code returns DataFrameGroupby
object. You can then limit it to a specific column like so.
# This is a `SeriesGroupBy` object
sw_imaging.groupby(by='provider')['n_referrals']
You can then use on of the standard pandas
statistical operations on group by object to return a Series
. For example, .sum()
, or .mean()
# returns a `Series` containing your group by results summed.
sw_imaging.groupby(by='provider')['n_referrals'].sum()
Groupby is a very powerful feature of
pandas
. Full documentation is here
# your code here ...