import pandas as pd

Working with files#

For most of your health data science work you will be working with larger datasets contains tens or hundreds of features. Unless your dataset is very large the storage medium is likely to be a file. Here we will take a look at how to read in the file from both a local storage, URLs and compressed archives. We’ll then take a short look at how to summarise whole or subsets of a DataFrame

Reading data into pandas from a flat file#

A common task is to work with data stored in one or more files. For example in a Comma Seperated Value (CSV) or other type of delimited (e.g. tab or pipe) file.

There are a number of scenarios you may encounter:

  1. The data file is held locally on your machine (or network drive)

  2. The data file is accessed via a URL (e.g. it is located in GitHub or hosted on a third party website.)

  3. The data file is compressed e.g. in a .zip format

The good news is that reading from a local directory and reading from a remote URL is identical in pandas. In both cases we can use the pd.read_csv() function specifying either the local path to the file or the url.

As an example let’s read in the famous Wisonsin Breast Cancer dataset from Github.

# Wisconsin breast cancer dataset URL
url = 'https://raw.githubusercontent.com/health-data-science-OR/' \
      + 'hpdm139-datasets/main/wisconsin.csv'

# read into dataframe
df = pd.read_csv(url, index_col='id')
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 569 entries, 842302 to 92751
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
dtypes: float64(30), int64(1), object(1)
memory usage: 146.7+ KB
df.shape
(569, 32)

Now let’s read in the same file, but this time it is compressed in .zip format.

url = 'https://raw.githubusercontent.com/health-data-science-OR/' \
      + 'hpdm139-datasets/main/wisconsin.zip'
df = pd.read_csv(url, index_col='id')
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 569 entries, 842302 to 92751
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
dtypes: float64(30), int64(1), object(1)
memory usage: 146.7+ KB

Describing features and making selections#

Now that you are working with a larger dataset you will want to summarise it. Good news pandas makes this very easy! You can use Dataframe.describe() to quickly summarise data. The result of the method .decribe() is a DataFrame that contains summary statistics or each variable. By default the index of the results Dataframe is the summary statistic descriptor.

Before we do that we will drop the ‘unnamed: 0’ column as it is not useful.

# drop the 'unnamed' column; axis = 1 for columns.
df = df.drop(labels=['Unnamed: 0'], axis=1)
df.describe()
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 ... 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 ... 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 ... 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 ... 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500

8 rows × 30 columns

When there is a large number of features pandas might not be able to show the full summary to you. There’s a few ways to get sub-summary’s if needed. The first is to transpose the results of describe.

df.describe().T
count mean std min 25% 50% 75% max
radius_mean 569.0 14.127292 3.524049 6.981000 11.700000 13.370000 15.780000 28.11000
texture_mean 569.0 19.289649 4.301036 9.710000 16.170000 18.840000 21.800000 39.28000
perimeter_mean 569.0 91.969033 24.298981 43.790000 75.170000 86.240000 104.100000 188.50000
area_mean 569.0 654.889104 351.914129 143.500000 420.300000 551.100000 782.700000 2501.00000
smoothness_mean 569.0 0.096360 0.014064 0.052630 0.086370 0.095870 0.105300 0.16340
compactness_mean 569.0 0.104341 0.052813 0.019380 0.064920 0.092630 0.130400 0.34540
concavity_mean 569.0 0.088799 0.079720 0.000000 0.029560 0.061540 0.130700 0.42680
concave points_mean 569.0 0.048919 0.038803 0.000000 0.020310 0.033500 0.074000 0.20120
symmetry_mean 569.0 0.181162 0.027414 0.106000 0.161900 0.179200 0.195700 0.30400
fractal_dimension_mean 569.0 0.062798 0.007060 0.049960 0.057700 0.061540 0.066120 0.09744
radius_se 569.0 0.405172 0.277313 0.111500 0.232400 0.324200 0.478900 2.87300
texture_se 569.0 1.216853 0.551648 0.360200 0.833900 1.108000 1.474000 4.88500
perimeter_se 569.0 2.866059 2.021855 0.757000 1.606000 2.287000 3.357000 21.98000
area_se 569.0 40.337079 45.491006 6.802000 17.850000 24.530000 45.190000 542.20000
smoothness_se 569.0 0.007041 0.003003 0.001713 0.005169 0.006380 0.008146 0.03113
compactness_se 569.0 0.025478 0.017908 0.002252 0.013080 0.020450 0.032450 0.13540
concavity_se 569.0 0.031894 0.030186 0.000000 0.015090 0.025890 0.042050 0.39600
concave points_se 569.0 0.011796 0.006170 0.000000 0.007638 0.010930 0.014710 0.05279
symmetry_se 569.0 0.020542 0.008266 0.007882 0.015160 0.018730 0.023480 0.07895
fractal_dimension_se 569.0 0.003795 0.002646 0.000895 0.002248 0.003187 0.004558 0.02984
radius_worst 569.0 16.269190 4.833242 7.930000 13.010000 14.970000 18.790000 36.04000
texture_worst 569.0 25.677223 6.146258 12.020000 21.080000 25.410000 29.720000 49.54000
perimeter_worst 569.0 107.261213 33.602542 50.410000 84.110000 97.660000 125.400000 251.20000
area_worst 569.0 880.583128 569.356993 185.200000 515.300000 686.500000 1084.000000 4254.00000
smoothness_worst 569.0 0.132369 0.022832 0.071170 0.116600 0.131300 0.146000 0.22260
compactness_worst 569.0 0.254265 0.157336 0.027290 0.147200 0.211900 0.339100 1.05800
concavity_worst 569.0 0.272188 0.208624 0.000000 0.114500 0.226700 0.382900 1.25200
concave points_worst 569.0 0.114606 0.065732 0.000000 0.064930 0.099930 0.161400 0.29100
symmetry_worst 569.0 0.290076 0.061867 0.156500 0.250400 0.282200 0.317900 0.66380
fractal_dimension_worst 569.0 0.083946 0.018061 0.055040 0.071460 0.080040 0.092080 0.20750

If there are common strings in the column labels you can use the filter() method to restrict your results summary.

df.filter(like='worst').describe().T.round(2)
count mean std min 25% 50% 75% max
radius_worst 569.0 16.27 4.83 7.93 13.01 14.97 18.79 36.04
texture_worst 569.0 25.68 6.15 12.02 21.08 25.41 29.72 49.54
perimeter_worst 569.0 107.26 33.60 50.41 84.11 97.66 125.40 251.20
area_worst 569.0 880.58 569.36 185.20 515.30 686.50 1084.00 4254.00
smoothness_worst 569.0 0.13 0.02 0.07 0.12 0.13 0.15 0.22
compactness_worst 569.0 0.25 0.16 0.03 0.15 0.21 0.34 1.06
concavity_worst 569.0 0.27 0.21 0.00 0.11 0.23 0.38 1.25
concave points_worst 569.0 0.11 0.07 0.00 0.06 0.10 0.16 0.29
symmetry_worst 569.0 0.29 0.06 0.16 0.25 0.28 0.32 0.66
fractal_dimension_worst 569.0 0.08 0.02 0.06 0.07 0.08 0.09 0.21
df.filter(like='mean').describe().T.round(2)
count mean std min 25% 50% 75% max
radius_mean 569.0 14.13 3.52 6.98 11.70 13.37 15.78 28.11
texture_mean 569.0 19.29 4.30 9.71 16.17 18.84 21.80 39.28
perimeter_mean 569.0 91.97 24.30 43.79 75.17 86.24 104.10 188.50
area_mean 569.0 654.89 351.91 143.50 420.30 551.10 782.70 2501.00
smoothness_mean 569.0 0.10 0.01 0.05 0.09 0.10 0.11 0.16
compactness_mean 569.0 0.10 0.05 0.02 0.06 0.09 0.13 0.35
concavity_mean 569.0 0.09 0.08 0.00 0.03 0.06 0.13 0.43
concave points_mean 569.0 0.05 0.04 0.00 0.02 0.03 0.07 0.20
symmetry_mean 569.0 0.18 0.03 0.11 0.16 0.18 0.20 0.30
fractal_dimension_mean 569.0 0.06 0.01 0.05 0.06 0.06 0.07 0.10

Selecting a subset of the DataFrame#

Let’s assume, for example, that you need to take a closer look at the smoothness statistics for samples where the area_mean is above the 3rd quartile (782).

The first step is to create a DataFrame (here called extremes) that contains the subset of data where area_mean > the 3rd quartile.

# get the threshold (could be hard coded as 782 as an alternative)
threshold = df.describe().T.loc['area_mean']['75%']

# get the extremes of the dataset
extremes = df.loc[df['area_mean'] > threshold]
extremes.shape
(142, 31)

Given the naming convention in the dataset we can use .filter for columns containing the string smoothness and call the describe method.

extremes.filter(like='smoothness').describe()
smoothness_mean smoothness_se smoothness_worst
count 142.000000 142.000000 142.000000
mean 0.100433 0.006653 0.138356
std 0.011829 0.003180 0.018561
min 0.073710 0.002667 0.087740
25% 0.092673 0.004951 0.124750
50% 0.099985 0.006106 0.139950
75% 0.107775 0.007547 0.150150
max 0.144700 0.031130 0.187300

An alternative is to specify the list of columns you want to summarise.

to_select = ['smoothness_mean', 'smoothness_se', 'smoothness_worst']
extremes[to_select].describe()
smoothness_mean smoothness_se smoothness_worst
count 142.000000 142.000000 142.000000
mean 0.100433 0.006653 0.138356
std 0.011829 0.003180 0.018561
min 0.073710 0.002667 0.087740
25% 0.092673 0.004951 0.124750
50% 0.099985 0.006106 0.139950
75% 0.107775 0.007547 0.150150
max 0.144700 0.031130 0.187300