Case study: a large remote dataset#
Here we will download and extract a remote NHS Synthetic Data that uses a non-standard compression.
NHS England publishes a fantastic synthetic dataset for Accident and Emergency departments in England. This is a relatively large clean dataset that contains over 65M rows.
In this worked example you will learn:
How to download a remote data source
How to use third party libraries to decompress a .7z archive.
How to reduce the size of a large
pandas.DataFrame
in memory.
Complications in use of the full dataset.#
Even though this dataset is sanitised and pre-cleaned there are two complications if you are new to data science.
Compression format#
The public URL for this dataset provides access to the data in 7zip format (.7z). The .7z format offers a good compression ratio (4.5GB compresses to 630MB), but at the cost of using a non-standard (and to be honest Microsoft Windows OS centric) format.
Ideally all machine learning studies have a reproducible pipeline from raw data at source through final analysis and results. At the time of writing
pandas
does not natively deal with .7z and a third party dependency is needed or it must be extracted outside of the python environment.pandas
can handle other formats; for example, .zip and .tar
Here we will focus on how to decompress .7z within python using a third party dependency.
Memory requirements#
This notebook demonstrates how to download the NHS England hosted dataset. This is a large dataset to download: 630MB compressed and 4.5GB uncompressed.
If read directly into
pandas
without considering datatypes of each column the size in memory is 8.8GB. Depending on your machine this may be extremely problematic (it is even problematic using the generous memory provided by Google Colab in the cloud). One solution is to make use of a third party scheduling tool such as Dask. If pandas datatypes are specified the size is reduced to ~2.3GB in memory.
pip installs#
This notebook can be run using the hds_code
environment. Outside of this environment you will need to install py7zr
using the code below.
import sys
if 'google.colab' in sys.modules:
# This is the package that can extract .7z
!pip install py7zr==0.14.1
Imports#
import pandas as pd
import os
import time
# used for extracting .7z
import py7zr
# pythonic cross platform approach for HTTPS request
import requests
Helper functions#
We define the following helper functions:
def as_dtype_dict(df):
'''
Converts a `DataFrame` of column names
and datatypes to a dict suitable for importing data.
Returns:
------
dict
{column_name: data_type}
'''
as_list = df.to_dict('split')['data']
return {i[0]:i[1] for i in as_list }
def read_ed_data(data_types, archive, nhs_fname, clean_fname,
verbose=True):
'''
Reads data in from the compressed file.
Params:
-------
data_types: dict
dict linking column names to data types
archive: str
name of the archive file in .7z format
nhs_name: str
The NHS file name for the extracted file
clean_name: str
A more usable and cleaned up filename if needed.
Returns:
--------
pandas.DataFrame
'''
# extract synthetic data preread into pandas
if verbose: print('extracting =>', end = ' ')
with py7zr.SevenZipFile(archive, mode='r') as z:
z.extractall()
if verbose: print('done')
# rename .7z file to cleaner format
os.rename(nhs_fname, clean_fname)
# read into pandas
if verbose: print('reading CSV =>', end = ' ')
df = pd.read_csv(clean_fname, dtype=data_types)
if verbose: print('done')
# cleans up uncompressed file as > 4.5GB in size.
os.remove(clean_fname)
return df
def file_size(url):
'''
Returns remote file size in MB
'''
response = requests.head(url)
conversion = 1_000_000
fsize = float(response.headers['Content-Length']) / conversion
print(f'File size: {fsize:.1f} MB')
Parameters#
All of the data used in this notebook is held remotely.
The first url
DATA_URL
is the location of the compressed NHS Synthetic A&E dataset.META_URL
is a link to a Comma Separated Value (CSV) file in GitHub that contains meta data about the NHS dataset (user defined). There are two columns: field names andpandas
data types.
We also have some parameters to clean up:
NHS_FNAME
: when decompressed the NHS file name contains both whitespace and the charactor ‘&’. This is bad practice so we rename the file to:CLEAN_FNAME
: a more standard file name for the CSV.CLEAN_ARCHIVE_NAME
: when downloaded the arhive has an unusual file name and extension. We rename this to make it easier to work with.CLEAN_UP_WHEN_DONE
: If true the download .7z file is deleted from the local machines harddrive when the analysis is complete.RESPONSE_SUCCESS
: a constant. The value 200 is returned if the dataset downloads successfully.
# NHS data url
DATA_URL = 'https://nhsengland-direct-uploads.s3-eu-west-1.amazonaws.com/' \
+ 'A%26E+Synthetic+Data.7z?versionId=null'
# meta data (created by me - contains pandas data types for columns)
META_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
+ 'hpdm139-datasets/main/synthetic_ed_meta_data.csv'
# renaming info for file names
NHS_FNAME = 'A&E Synthetic Data.csv'
CLEAN_FNAME = 'ed_synthetic.csv'
CLEAN_ARCHIVE_NAME = 'ed_synthetic.7z'
# delete the downloaded file when done.
CLEAN_UP_WHEN_DONE = True
#download successful
RESPONSE_SUCCESS = 200
Download code#
There are multiple ways to download the dataset. Here we will request the file in python using the requests
library. This file is 630MB in size and download time will vary depending on your connection.
Lets test and see this in action.
# download file headers and report file size
file_size(DATA_URL)
# download the file...(only needs to be done once).
print('downloading =>', end=' ')
response = requests.get(DATA_URL)
if response.status_code == RESPONSE_SUCCESS:
print('successful')
# write to file
with open(CLEAN_ARCHIVE_NAME, 'wb') as f:
f.write(response.content)
else:
print('file not found.')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[5], line 2
1 # download file headers and report file size
----> 2 file_size(DATA_URL)
4 # download the file...(only needs to be done once).
5 print('downloading =>', end=' ')
Cell In[3], line 66, in file_size(url)
64 response = requests.head(url)
65 conversion = 1_000_000
---> 66 fsize = float(response.headers['Content-Length']) / conversion
67 print(f'File size: {fsize:.1f} MB')
File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/requests/structures.py:52, in CaseInsensitiveDict.__getitem__(self, key)
51 def __getitem__(self, key):
---> 52 return self._store[key.lower()][1]
KeyError: 'content-length'
Decompress .7z archive and read into pandas.
#
The synthetic A&E dataset is a big file when extracted > 4.5GB and will take 5-7 minutes to read into pandas.
The function read_ed_data
will first decompress the .7z file and then read it into a DataFrame
. The code automatically cleans up the decompressed file afterwards (deletes). This is an optional step and if you need to re-read the data multiple times in your analysis code you may wish to keep it in its decompressed form until the end.
def main():
'''
Read in the data, with specified column data types.
Returns:
-------
pd.DataFrame
'''
# read data types for each column of synthetic ED data
data_types = pd.read_csv(META_URL)
# extract and read
df = read_ed_data(as_dtype_dict(data_types),
CLEAN_ARCHIVE_NAME,
NHS_FNAME, CLEAN_FNAME)
return df
Test#
# second not ignoring data types
start_time = time.time()
df = main()
print(df.info())
print(f'{time.time() - start_time}s')
df.head(5)
# cleans up
if CLEAN_UP_WHEN_DONE:
os.remove(CLEAN_ARCHIVE_NAME)