Working with a pandas.Dataframe#

These exercises and drills aim to build your experience of using an pandas.DataFrame. Try to work through them all without looking at the answers first.

Tip: Undoutedly in practice you will work with more complex examples and data than those found here. These exercises are a very gentle introduction to the syntax of `pandas`. But don't worry we have some more complicated data wrangling examples coming up!

Remember for any code where you wish to use pandas you need to import it. We will also import numpy to help us generate some synthetic data to use.

import pandas as pd
import numpy as np

Exercise 1#

Task:

  • Create a pandas.Series named “number” with numbers from 1 to 10,000 inclusive.

  • The datatype of the series should be np.uint32

  • Check the length of the Series

  • View the head and tail of the series to quickly validate your code has worked.

# your code here ...

Exercise 2#

Task:

  • Create a pandas.DataFrame with 5 rows and 5 columns.

  • The data contained in each columns should all be of type int64.

  • Columns should be titled “col_1”, “col_2” … “col_5”

  • Check the datatype, shape and column names using .info()

  • View the DataFrame

  • View only the first 2 rows in the DataFrame

  • Create a new variable col_4 of type pandas.Series that references the data in col_4

Hints:

  • Data and can take any valid int64 value.

  • One option is to generate a random matrix using a numpy.random.Generator

# your code here...

Exercise 3:#

The data below represents outpatient bookings for a hospital.

unique_id

surname

age

female

first_appoint

het1m

Hetfield

50

False

False

ulr33f

Ulrich

89

True

True

ham1f

Hammett

32

True

True

tru4m

Trujillo

65

False

True

Task:

  • Using a method of your choice create a pandas.DataFrame that represents the patient bookings.

  • You should set the DataFrame index to be unique_id (assume it is unique)

  • Filter the dataset for just the ham1f booking

  • Filer the dataset for the ‘tru4m’ and ‘ulr33f’ bookings.

# your code here ...

Exercise 4:#

The data contained in sw_imaging.csv is based on data taken from NHS England’s open diagnostic imaging dataset from 2019/20..

This subset contains information on the number of patients referred for imaging, the median waiting time for the imaging and the median waiting time between the imaging and the report. Results are broken down by provider and imaging type.

Task:

  • Read the CSV data into an appropriately named DataFrame(data URL provided below)

  • View the information about the DataFrame

  • View the shape of the DataFrame

  • View the first two rows in the DataFrame

# your code here...

DATA_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
            + 'hpdm139-datasets/main/sw_imaging.csv'

Exercise 5:#

Continue to work with the South West imaging dataset from exercise 4.

Task:

  • Filter the DataFrame to only show results for “Magnetic Resonance Imaging”

  • Filter the DataFrame to only show results for “Computerized Axial Tomography”

  • Filter the DataFrame to only show results for the “Royal Devon and Exeter NHS Foundation Trust”

  • Filer the DataFrame to only show results for “Magnetic Resonance Imaging” at the “Royal Devon and Exeter NHS Foundation Trust”

Hints:

  • The type of imaging is a str held in the column imaging_type

  • The provider of the imaging is a str held in the column provider

  • The final filter is a bit more tricky to get right first time. Let’s assume you need to find “Diagnostic Ultrasonography” figures for org_code “RK9”. This would be solved as follows:

# assume your `DataFrame` is called sw_imaging
sw_imaging[(sw_imaging['org_code'] == 'RK9') &
           (sw_imaging['imaging_type'] == 'Diagnostic Ultrasonography')]
# your code here ...

Exercise 6:#

Continue to work with the South West imaging dataset from exercise 4.

We will now look at combining both a logical and with a logical or when selecting rows from a DataFrame.

Task:

  • Filter the DataFrame to return both Magnetic Resonance Imaging and Computerized Axial Tomography undertaken at org_code=’REF’ (cornwall)

Hints:

  • Think carefully about how to express the above task logically.

    • You need results when the (org_code == <code>) AND ((imaging_type == <type_1>) OR (imaging_type == <type_2>))

  • In pandas logical AND is represented by & and OR by |

# your code here ...

Exercise 7#

Continue to work with the South West imaging dataset from exercise 4.

Task:

  • Filter the data set for all records where n_referrals is greater than 100,000

  • Filter the results for all records where mdn_days_rtt is greater than 30

# your code here ...

Exercise 8#

Continue to work with the South West imaging dataset from exercise 4.

Task:

  • What is the total number of imaging referrals made to each trust?

  • Save the total number of referrals to a file called total_referrals.csv

  • Write code to return the provider with the highest number of referrals.

Hints:

To solve the first part of this exercise you can make use of a DataFrame method called groupby(). This will allow you to group results by provider. For example:

sw_imaging.groupby(by='provider')

This code returns DataFrameGroupby object. You can then limit it to a specific column like so.

# This is a `SeriesGroupBy` object
sw_imaging.groupby(by='provider')['n_referrals']

You can then use on of the standard pandas statistical operations on group by object to return a Series. For example, .sum(), or .mean()

# returns a `Series` containing your group by results summed.
sw_imaging.groupby(by='provider')['n_referrals'].sum()

Groupby is a very powerful feature of pandas. Full documentation is here

# your code here ...