Case study: a large remote dataset#

Here we will download and extract a remote NHS Synthetic Data that uses a non-standard compression.

NHS England publishes a fantastic synthetic dataset for Accident and Emergency departments in England. This is a relatively large clean dataset that contains over 65M rows.

In this worked example you will learn:

  • How to download a remote data source

  • How to use third party libraries to decompress a .7z archive.

  • How to reduce the size of a large pandas.DataFrame in memory.


Complications in use of the full dataset.#

Even though this dataset is sanitised and pre-cleaned there are two complications if you are new to data science.

Compression format#

  • The public URL for this dataset provides access to the data in 7zip format (.7z). The .7z format offers a good compression ratio (4.5GB compresses to 630MB), but at the cost of using a non-standard (and to be honest Microsoft Windows OS centric) format.

  • Ideally all machine learning studies have a reproducible pipeline from raw data at source through final analysis and results. At the time of writing pandas does not natively deal with .7z and a third party dependency is needed or it must be extracted outside of the python environment.

  • pandas can handle other formats; for example, .zip and .tar

Here we will focus on how to decompress .7z within python using a third party dependency.

Memory requirements#

  • This notebook demonstrates how to download the NHS England hosted dataset. This is a large dataset to download: 630MB compressed and 4.5GB uncompressed.

  • If read directly into pandas without considering datatypes of each column the size in memory is 8.8GB. Depending on your machine this may be extremely problematic (it is even problematic using the generous memory provided by Google Colab in the cloud). One solution is to make use of a third party scheduling tool such as Dask. If pandas datatypes are specified the size is reduced to ~2.3GB in memory.


pip installs#

This notebook can be run using the hds_code environment. Outside of this environment you will need to install py7zr using the code below.

import sys

if 'google.colab' in sys.modules:
  # This is the package that can extract .7z 
  !pip install py7zr==0.14.1

Imports#

import pandas as pd
import os
import time

# used for extracting .7z
import py7zr

# pythonic cross platform approach for HTTPS request
import requests

Helper functions#

We define the following helper functions:

def as_dtype_dict(df):
    '''
    Converts a `DataFrame` of column names
    and datatypes to a dict suitable for importing data.

    Returns:
    ------
    dict
    
    {column_name: data_type}
    '''
    as_list = df.to_dict('split')['data']
    return {i[0]:i[1] for i in as_list }


def read_ed_data(data_types, archive, nhs_fname, clean_fname,
                 verbose=True):
    '''
    Reads data in from the compressed file.
    
    Params:
    -------
    data_types: dict
        dict linking column names to data types

    archive: str
      name of the archive file in .7z format

    nhs_name: str
        The NHS file name for the extracted file
        
    clean_name: str
        A more usable and cleaned up filename if needed.
        
    Returns:
    --------
    pandas.DataFrame
    '''

    # extract synthetic data preread into pandas
    if verbose: print('extracting =>', end = ' ')
    with py7zr.SevenZipFile(archive, mode='r') as z:
        z.extractall()
    if verbose: print('done')

    # rename .7z file to cleaner format
    os.rename(nhs_fname, clean_fname)

    # read into pandas
    if verbose: print('reading CSV =>', end = ' ')
    df = pd.read_csv(clean_fname, dtype=data_types)
    if verbose: print('done')

    # cleans up uncompressed file as > 4.5GB in size.
    os.remove(clean_fname)
    
    return df


def file_size(url):
  '''
  Returns remote file size in MB
  '''
  response = requests.head(url)
  conversion = 1_000_000
  fsize = float(response.headers['Content-Length']) / conversion
  print(f'File size: {fsize:.1f} MB')

Parameters#

All of the data used in this notebook is held remotely.

  • The first url DATA_URL is the location of the compressed NHS Synthetic A&E dataset.

  • META_URL is a link to a Comma Separated Value (CSV) file in GitHub that contains meta data about the NHS dataset (user defined). There are two columns: field names and pandas data types.

We also have some parameters to clean up:

  • NHS_FNAME: when decompressed the NHS file name contains both whitespace and the charactor ‘&’. This is bad practice so we rename the file to:

  • CLEAN_FNAME: a more standard file name for the CSV.

  • CLEAN_ARCHIVE_NAME: when downloaded the arhive has an unusual file name and extension. We rename this to make it easier to work with.

  • CLEAN_UP_WHEN_DONE: If true the download .7z file is deleted from the local machines harddrive when the analysis is complete.

  • RESPONSE_SUCCESS: a constant. The value 200 is returned if the dataset downloads successfully.

# NHS data url
DATA_URL = 'https://nhsengland-direct-uploads.s3-eu-west-1.amazonaws.com/' \
             + 'A%26E+Synthetic+Data.7z?versionId=null'

# meta data (created by me - contains pandas data types for columns)
META_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
                + 'hpdm139-datasets/main/synthetic_ed_meta_data.csv'

# renaming info for file names
NHS_FNAME = 'A&E Synthetic Data.csv'
CLEAN_FNAME = 'ed_synthetic.csv'
CLEAN_ARCHIVE_NAME = 'ed_synthetic.7z'

# delete the downloaded file when done.
CLEAN_UP_WHEN_DONE = True

#download successful
RESPONSE_SUCCESS = 200

Download code#

There are multiple ways to download the dataset. Here we will request the file in python using the requests library. This file is 630MB in size and download time will vary depending on your connection.

Lets test and see this in action.

# download file headers and report file size
file_size(DATA_URL)

# download the file...(only needs to be done once).
print('downloading =>', end=' ')
response = requests.get(DATA_URL)

if response.status_code == RESPONSE_SUCCESS:
    print('successful')

    # write to file
    with open(CLEAN_ARCHIVE_NAME, 'wb') as f:
        f.write(response.content)
else:
    print('file not found.')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[5], line 2
      1 # download file headers and report file size
----> 2 file_size(DATA_URL)
      4 # download the file...(only needs to be done once).
      5 print('downloading =>', end=' ')

Cell In[3], line 66, in file_size(url)
     64 response = requests.head(url)
     65 conversion = 1_000_000
---> 66 fsize = float(response.headers['Content-Length']) / conversion
     67 print(f'File size: {fsize:.1f} MB')

File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/requests/structures.py:52, in CaseInsensitiveDict.__getitem__(self, key)
     51 def __getitem__(self, key):
---> 52     return self._store[key.lower()][1]

KeyError: 'content-length'

Decompress .7z archive and read into pandas.#

The synthetic A&E dataset is a big file when extracted > 4.5GB and will take 5-7 minutes to read into pandas.

The function read_ed_data will first decompress the .7z file and then read it into a DataFrame. The code automatically cleans up the decompressed file afterwards (deletes). This is an optional step and if you need to re-read the data multiple times in your analysis code you may wish to keep it in its decompressed form until the end.

def main():
  '''
  Read in the data, with specified column data types.

  Returns:
  -------
  pd.DataFrame
  '''
  # read data types for each column of synthetic ED data
  data_types = pd.read_csv(META_URL)

  # extract and read
  df = read_ed_data(as_dtype_dict(data_types), 
                    CLEAN_ARCHIVE_NAME, 
                    NHS_FNAME, CLEAN_FNAME)
  return df

Test#

# second not ignoring data types
start_time = time.time()
df = main()
print(df.info())
print(f'{time.time() - start_time}s')
df.head(5)
# cleans up
if CLEAN_UP_WHEN_DONE:
  os.remove(CLEAN_ARCHIVE_NAME)

End#