Working with a pandas.Dataframe

Working with a `pandas.Dataframe`#

These exercises and drills aim to build your experience of using an pandas.DataFrame. Try to work through them all without looking at the answers first.

Tip: Undoutedly in practice you will work with more complex examples and data than those found here. These exercises are a very gentle introduction to the syntax of `pandas`. But don't worry we have some more complicated data wrangling examples coming up!

Remember for any code where you wish to use pandas you need to import it. We will also import numpy to help us generate some synthetic data to use.

import pandas as pd
import numpy as np

Exercise 1#

Task:

Create a pandas.Series named “number” with numbers from 1 to 10,000 inclusive.
The datatype of the series should be np.uint32
Check the length of the Series
View the head and tail of the series to quickly validate your code has worked.

# your code here ...

Exercise 2#

Task:

Create a pandas.DataFrame with 5 rows and 5 columns.
The data contained in each columns should all be of type int64.
Columns should be titled “col_1”, “col_2” … “col_5”
Check the datatype, shape and column names using .info()
View the DataFrame
View only the first 2 rows in the DataFrame
Create a new variable col_4 of type pandas.Series that references the data in col_4

Hints:

Data and can take any valid int64 value.
One option is to generate a random matrix using a numpy.random.Generator

# your code here...

Exercise 3:#

The data below represents outpatient bookings for a hospital.

unique_id	surname	age	female	first_appoint
het1m	Hetfield	50	False	False
ulr33f	Ulrich	89	True	True
ham1f	Hammett	32	True	True
tru4m	Trujillo	65	False	True

Task:

Using a method of your choice create a pandas.DataFrame that represents the patient bookings.
You should set the DataFrame index to be unique_id (assume it is unique)
Filter the dataset for just the ham1f booking
Filer the dataset for the ‘tru4m’ and ‘ulr33f’ bookings.

# your code here ...

Exercise 4:#

The data contained in sw_imaging.csv is based on data taken from NHS England’s open diagnostic imaging dataset from 2019/20..

This subset contains information on the number of patients referred for imaging, the median waiting time for the imaging and the median waiting time between the imaging and the report. Results are broken down by provider and imaging type.

Task:

Read the CSV data into an appropriately named DataFrame(data URL provided below)
View the information about the DataFrame
View the shape of the DataFrame
View the first two rows in the DataFrame

# your code here...

DATA_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
            + 'hpdm139-datasets/main/sw_imaging.csv'

Exercise 5:#

Continue to work with the South West imaging dataset from exercise 4.

Task:

Filter the DataFrame to only show results for “Magnetic Resonance Imaging”
Filter the DataFrame to only show results for “Computerized Axial Tomography”
Filter the DataFrame to only show results for the “Royal Devon and Exeter NHS Foundation Trust”
Filer the DataFrame to only show results for “Magnetic Resonance Imaging” at the “Royal Devon and Exeter NHS Foundation Trust”

Hints:

The type of imaging is a str held in the column imaging_type
The provider of the imaging is a str held in the column provider
The final filter is a bit more tricky to get right first time. Let’s assume you need to find “Diagnostic Ultrasonography” figures for org_code “RK9”. This would be solved as follows:

# assume your `DataFrame` is called sw_imaging
sw_imaging[(sw_imaging['org_code'] == 'RK9') &
           (sw_imaging['imaging_type'] == 'Diagnostic Ultrasonography')]

# your code here ...

Exercise 6:#

Continue to work with the South West imaging dataset from exercise 4.

We will now look at combining both a logical and with a logical or when selecting rows from a DataFrame.

Task:

Filter the DataFrame to return both Magnetic Resonance Imaging and Computerized Axial Tomography undertaken at org_code=’REF’ (cornwall)

Hints:

Think carefully about how to express the above task logically.
- You need results when the (org_code == <code>) AND ((imaging_type == <type_1>) OR (imaging_type == <type_2>))
In pandas logical AND is represented by & and OR by |

# your code here ...

Exercise 7#

Continue to work with the South West imaging dataset from exercise 4.

Task:

Filter the data set for all records where n_referrals is greater than 100,000
Filter the results for all records where mdn_days_rtt is greater than 30

# your code here ...

Exercise 8#

Continue to work with the South West imaging dataset from exercise 4.

Task:

What is the total number of imaging referrals made to each trust?
Save the total number of referrals to a file called total_referrals.csv
Write code to return the provider with the highest number of referrals.

Hints:

To solve the first part of this exercise you can make use of a DataFrame method called groupby(). This will allow you to group results by provider. For example:

sw_imaging.groupby(by='provider')

This code returns DataFrameGroupby object. You can then limit it to a specific column like so.

# This is a `SeriesGroupBy` object
sw_imaging.groupby(by='provider')['n_referrals']

You can then use on of the standard pandas statistical operations on group by object to return a Series. For example, .sum(), or .mean()

# returns a `Series` containing your group by results summed.
sw_imaging.groupby(by='provider')['n_referrals'].sum()

Groupby is a very powerful feature of pandas. Full documentation is here

# your code here ...

Working with a pandas.Dataframe

Contents

Working with a pandas.Dataframe#

Exercise 1#

Exercise 2#

Exercise 3:#

Exercise 4:#

Exercise 5:#

Exercise 6:#

Exercise 7#

Exercise 8#

Working with a `pandas.Dataframe`#