import pandas as pd
Working with files#
For most of your health data science work you will be working with larger datasets contains tens or hundreds of features. Unless your dataset is very large the storage medium is likely to be a file. Here we will take a look at how to read in the file from both a local storage, URLs and compressed archives. We’ll then take a short look at how to summarise whole or subsets of a DataFrame
Reading data into pandas from a flat file#
A common task is to work with data stored in one or more files. For example in a Comma Seperated Value (CSV) or other type of delimited (e.g. tab or pipe) file.
There are a number of scenarios you may encounter:
The data file is held locally on your machine (or network drive)
The data file is accessed via a URL (e.g. it is located in GitHub or hosted on a third party website.)
The data file is compressed e.g. in a
.zip
format
The good news is that reading from a local directory and reading from a remote URL is identical in
pandas
. In both cases we can use thepd.read_csv()
function specifying either the local path to the file or the url.
As an example let’s read in the famous Wisonsin Breast Cancer dataset from Github.
# Wisconsin breast cancer dataset URL
url = 'https://raw.githubusercontent.com/health-data-science-OR/' \
+ 'hpdm139-datasets/main/wisconsin.csv'
# read into dataframe
df = pd.read_csv(url, index_col='id')
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 569 entries, 842302 to 92751
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 569 non-null int64
1 diagnosis 569 non-null object
2 radius_mean 569 non-null float64
3 texture_mean 569 non-null float64
4 perimeter_mean 569 non-null float64
5 area_mean 569 non-null float64
6 smoothness_mean 569 non-null float64
7 compactness_mean 569 non-null float64
8 concavity_mean 569 non-null float64
9 concave points_mean 569 non-null float64
10 symmetry_mean 569 non-null float64
11 fractal_dimension_mean 569 non-null float64
12 radius_se 569 non-null float64
13 texture_se 569 non-null float64
14 perimeter_se 569 non-null float64
15 area_se 569 non-null float64
16 smoothness_se 569 non-null float64
17 compactness_se 569 non-null float64
18 concavity_se 569 non-null float64
19 concave points_se 569 non-null float64
20 symmetry_se 569 non-null float64
21 fractal_dimension_se 569 non-null float64
22 radius_worst 569 non-null float64
23 texture_worst 569 non-null float64
24 perimeter_worst 569 non-null float64
25 area_worst 569 non-null float64
26 smoothness_worst 569 non-null float64
27 compactness_worst 569 non-null float64
28 concavity_worst 569 non-null float64
29 concave points_worst 569 non-null float64
30 symmetry_worst 569 non-null float64
31 fractal_dimension_worst 569 non-null float64
dtypes: float64(30), int64(1), object(1)
memory usage: 146.7+ KB
df.shape
(569, 32)
Now let’s read in the same file, but this time it is compressed in .zip format.
url = 'https://raw.githubusercontent.com/health-data-science-OR/' \
+ 'hpdm139-datasets/main/wisconsin.zip'
df = pd.read_csv(url, index_col='id')
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 569 entries, 842302 to 92751
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 569 non-null int64
1 diagnosis 569 non-null object
2 radius_mean 569 non-null float64
3 texture_mean 569 non-null float64
4 perimeter_mean 569 non-null float64
5 area_mean 569 non-null float64
6 smoothness_mean 569 non-null float64
7 compactness_mean 569 non-null float64
8 concavity_mean 569 non-null float64
9 concave points_mean 569 non-null float64
10 symmetry_mean 569 non-null float64
11 fractal_dimension_mean 569 non-null float64
12 radius_se 569 non-null float64
13 texture_se 569 non-null float64
14 perimeter_se 569 non-null float64
15 area_se 569 non-null float64
16 smoothness_se 569 non-null float64
17 compactness_se 569 non-null float64
18 concavity_se 569 non-null float64
19 concave points_se 569 non-null float64
20 symmetry_se 569 non-null float64
21 fractal_dimension_se 569 non-null float64
22 radius_worst 569 non-null float64
23 texture_worst 569 non-null float64
24 perimeter_worst 569 non-null float64
25 area_worst 569 non-null float64
26 smoothness_worst 569 non-null float64
27 compactness_worst 569 non-null float64
28 concavity_worst 569 non-null float64
29 concave points_worst 569 non-null float64
30 symmetry_worst 569 non-null float64
31 fractal_dimension_worst 569 non-null float64
dtypes: float64(30), int64(1), object(1)
memory usage: 146.7+ KB
Describing features and making selections#
Now that you are working with a larger dataset you will want to summarise it. Good news pandas
makes this very easy! You can use Dataframe.describe()
to quickly summarise data. The result of the method .decribe()
is a DataFrame
that contains summary statistics or each variable. By default the index of the results Dataframe
is the summary statistic descriptor.
Before we do that we will drop the ‘unnamed: 0’ column as it is not useful.
# drop the 'unnamed' column; axis = 1 for columns.
df = df.drop(labels=['Unnamed: 0'], axis=1)
df.describe()
radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | fractal_dimension_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | ... | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 |
mean | 14.127292 | 19.289649 | 91.969033 | 654.889104 | 0.096360 | 0.104341 | 0.088799 | 0.048919 | 0.181162 | 0.062798 | ... | 16.269190 | 25.677223 | 107.261213 | 880.583128 | 0.132369 | 0.254265 | 0.272188 | 0.114606 | 0.290076 | 0.083946 |
std | 3.524049 | 4.301036 | 24.298981 | 351.914129 | 0.014064 | 0.052813 | 0.079720 | 0.038803 | 0.027414 | 0.007060 | ... | 4.833242 | 6.146258 | 33.602542 | 569.356993 | 0.022832 | 0.157336 | 0.208624 | 0.065732 | 0.061867 | 0.018061 |
min | 6.981000 | 9.710000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | 0.000000 | 0.106000 | 0.049960 | ... | 7.930000 | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156500 | 0.055040 |
25% | 11.700000 | 16.170000 | 75.170000 | 420.300000 | 0.086370 | 0.064920 | 0.029560 | 0.020310 | 0.161900 | 0.057700 | ... | 13.010000 | 21.080000 | 84.110000 | 515.300000 | 0.116600 | 0.147200 | 0.114500 | 0.064930 | 0.250400 | 0.071460 |
50% | 13.370000 | 18.840000 | 86.240000 | 551.100000 | 0.095870 | 0.092630 | 0.061540 | 0.033500 | 0.179200 | 0.061540 | ... | 14.970000 | 25.410000 | 97.660000 | 686.500000 | 0.131300 | 0.211900 | 0.226700 | 0.099930 | 0.282200 | 0.080040 |
75% | 15.780000 | 21.800000 | 104.100000 | 782.700000 | 0.105300 | 0.130400 | 0.130700 | 0.074000 | 0.195700 | 0.066120 | ... | 18.790000 | 29.720000 | 125.400000 | 1084.000000 | 0.146000 | 0.339100 | 0.382900 | 0.161400 | 0.317900 | 0.092080 |
max | 28.110000 | 39.280000 | 188.500000 | 2501.000000 | 0.163400 | 0.345400 | 0.426800 | 0.201200 | 0.304000 | 0.097440 | ... | 36.040000 | 49.540000 | 251.200000 | 4254.000000 | 0.222600 | 1.058000 | 1.252000 | 0.291000 | 0.663800 | 0.207500 |
8 rows × 30 columns
When there is a large number of features pandas might not be able to show the full summary to you. There’s a few ways to get sub-summary’s if needed. The first is to transpose the results of describe.
df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
radius_mean | 569.0 | 14.127292 | 3.524049 | 6.981000 | 11.700000 | 13.370000 | 15.780000 | 28.11000 |
texture_mean | 569.0 | 19.289649 | 4.301036 | 9.710000 | 16.170000 | 18.840000 | 21.800000 | 39.28000 |
perimeter_mean | 569.0 | 91.969033 | 24.298981 | 43.790000 | 75.170000 | 86.240000 | 104.100000 | 188.50000 |
area_mean | 569.0 | 654.889104 | 351.914129 | 143.500000 | 420.300000 | 551.100000 | 782.700000 | 2501.00000 |
smoothness_mean | 569.0 | 0.096360 | 0.014064 | 0.052630 | 0.086370 | 0.095870 | 0.105300 | 0.16340 |
compactness_mean | 569.0 | 0.104341 | 0.052813 | 0.019380 | 0.064920 | 0.092630 | 0.130400 | 0.34540 |
concavity_mean | 569.0 | 0.088799 | 0.079720 | 0.000000 | 0.029560 | 0.061540 | 0.130700 | 0.42680 |
concave points_mean | 569.0 | 0.048919 | 0.038803 | 0.000000 | 0.020310 | 0.033500 | 0.074000 | 0.20120 |
symmetry_mean | 569.0 | 0.181162 | 0.027414 | 0.106000 | 0.161900 | 0.179200 | 0.195700 | 0.30400 |
fractal_dimension_mean | 569.0 | 0.062798 | 0.007060 | 0.049960 | 0.057700 | 0.061540 | 0.066120 | 0.09744 |
radius_se | 569.0 | 0.405172 | 0.277313 | 0.111500 | 0.232400 | 0.324200 | 0.478900 | 2.87300 |
texture_se | 569.0 | 1.216853 | 0.551648 | 0.360200 | 0.833900 | 1.108000 | 1.474000 | 4.88500 |
perimeter_se | 569.0 | 2.866059 | 2.021855 | 0.757000 | 1.606000 | 2.287000 | 3.357000 | 21.98000 |
area_se | 569.0 | 40.337079 | 45.491006 | 6.802000 | 17.850000 | 24.530000 | 45.190000 | 542.20000 |
smoothness_se | 569.0 | 0.007041 | 0.003003 | 0.001713 | 0.005169 | 0.006380 | 0.008146 | 0.03113 |
compactness_se | 569.0 | 0.025478 | 0.017908 | 0.002252 | 0.013080 | 0.020450 | 0.032450 | 0.13540 |
concavity_se | 569.0 | 0.031894 | 0.030186 | 0.000000 | 0.015090 | 0.025890 | 0.042050 | 0.39600 |
concave points_se | 569.0 | 0.011796 | 0.006170 | 0.000000 | 0.007638 | 0.010930 | 0.014710 | 0.05279 |
symmetry_se | 569.0 | 0.020542 | 0.008266 | 0.007882 | 0.015160 | 0.018730 | 0.023480 | 0.07895 |
fractal_dimension_se | 569.0 | 0.003795 | 0.002646 | 0.000895 | 0.002248 | 0.003187 | 0.004558 | 0.02984 |
radius_worst | 569.0 | 16.269190 | 4.833242 | 7.930000 | 13.010000 | 14.970000 | 18.790000 | 36.04000 |
texture_worst | 569.0 | 25.677223 | 6.146258 | 12.020000 | 21.080000 | 25.410000 | 29.720000 | 49.54000 |
perimeter_worst | 569.0 | 107.261213 | 33.602542 | 50.410000 | 84.110000 | 97.660000 | 125.400000 | 251.20000 |
area_worst | 569.0 | 880.583128 | 569.356993 | 185.200000 | 515.300000 | 686.500000 | 1084.000000 | 4254.00000 |
smoothness_worst | 569.0 | 0.132369 | 0.022832 | 0.071170 | 0.116600 | 0.131300 | 0.146000 | 0.22260 |
compactness_worst | 569.0 | 0.254265 | 0.157336 | 0.027290 | 0.147200 | 0.211900 | 0.339100 | 1.05800 |
concavity_worst | 569.0 | 0.272188 | 0.208624 | 0.000000 | 0.114500 | 0.226700 | 0.382900 | 1.25200 |
concave points_worst | 569.0 | 0.114606 | 0.065732 | 0.000000 | 0.064930 | 0.099930 | 0.161400 | 0.29100 |
symmetry_worst | 569.0 | 0.290076 | 0.061867 | 0.156500 | 0.250400 | 0.282200 | 0.317900 | 0.66380 |
fractal_dimension_worst | 569.0 | 0.083946 | 0.018061 | 0.055040 | 0.071460 | 0.080040 | 0.092080 | 0.20750 |
If there are common strings in the column labels you can use the filter()
method to restrict your results summary.
df.filter(like='worst').describe().T.round(2)
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
radius_worst | 569.0 | 16.27 | 4.83 | 7.93 | 13.01 | 14.97 | 18.79 | 36.04 |
texture_worst | 569.0 | 25.68 | 6.15 | 12.02 | 21.08 | 25.41 | 29.72 | 49.54 |
perimeter_worst | 569.0 | 107.26 | 33.60 | 50.41 | 84.11 | 97.66 | 125.40 | 251.20 |
area_worst | 569.0 | 880.58 | 569.36 | 185.20 | 515.30 | 686.50 | 1084.00 | 4254.00 |
smoothness_worst | 569.0 | 0.13 | 0.02 | 0.07 | 0.12 | 0.13 | 0.15 | 0.22 |
compactness_worst | 569.0 | 0.25 | 0.16 | 0.03 | 0.15 | 0.21 | 0.34 | 1.06 |
concavity_worst | 569.0 | 0.27 | 0.21 | 0.00 | 0.11 | 0.23 | 0.38 | 1.25 |
concave points_worst | 569.0 | 0.11 | 0.07 | 0.00 | 0.06 | 0.10 | 0.16 | 0.29 |
symmetry_worst | 569.0 | 0.29 | 0.06 | 0.16 | 0.25 | 0.28 | 0.32 | 0.66 |
fractal_dimension_worst | 569.0 | 0.08 | 0.02 | 0.06 | 0.07 | 0.08 | 0.09 | 0.21 |
df.filter(like='mean').describe().T.round(2)
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
radius_mean | 569.0 | 14.13 | 3.52 | 6.98 | 11.70 | 13.37 | 15.78 | 28.11 |
texture_mean | 569.0 | 19.29 | 4.30 | 9.71 | 16.17 | 18.84 | 21.80 | 39.28 |
perimeter_mean | 569.0 | 91.97 | 24.30 | 43.79 | 75.17 | 86.24 | 104.10 | 188.50 |
area_mean | 569.0 | 654.89 | 351.91 | 143.50 | 420.30 | 551.10 | 782.70 | 2501.00 |
smoothness_mean | 569.0 | 0.10 | 0.01 | 0.05 | 0.09 | 0.10 | 0.11 | 0.16 |
compactness_mean | 569.0 | 0.10 | 0.05 | 0.02 | 0.06 | 0.09 | 0.13 | 0.35 |
concavity_mean | 569.0 | 0.09 | 0.08 | 0.00 | 0.03 | 0.06 | 0.13 | 0.43 |
concave points_mean | 569.0 | 0.05 | 0.04 | 0.00 | 0.02 | 0.03 | 0.07 | 0.20 |
symmetry_mean | 569.0 | 0.18 | 0.03 | 0.11 | 0.16 | 0.18 | 0.20 | 0.30 |
fractal_dimension_mean | 569.0 | 0.06 | 0.01 | 0.05 | 0.06 | 0.06 | 0.07 | 0.10 |
Selecting a subset of the DataFrame#
Let’s assume, for example, that you need to take a closer look at the smoothness statistics for samples where the area_mean is above the 3rd quartile (782).
The first step is to create a DataFrame
(here called extremes
) that contains the subset of data where area_mean > the 3rd quartile.
# get the threshold (could be hard coded as 782 as an alternative)
threshold = df.describe().T.loc['area_mean']['75%']
# get the extremes of the dataset
extremes = df.loc[df['area_mean'] > threshold]
extremes.shape
(142, 31)
Given the naming convention in the dataset we can use .filter
for columns containing the string smoothness and call the describe method.
extremes.filter(like='smoothness').describe()
smoothness_mean | smoothness_se | smoothness_worst | |
---|---|---|---|
count | 142.000000 | 142.000000 | 142.000000 |
mean | 0.100433 | 0.006653 | 0.138356 |
std | 0.011829 | 0.003180 | 0.018561 |
min | 0.073710 | 0.002667 | 0.087740 |
25% | 0.092673 | 0.004951 | 0.124750 |
50% | 0.099985 | 0.006106 | 0.139950 |
75% | 0.107775 | 0.007547 | 0.150150 |
max | 0.144700 | 0.031130 | 0.187300 |
An alternative is to specify the list of columns you want to summarise.
to_select = ['smoothness_mean', 'smoothness_se', 'smoothness_worst']
extremes[to_select].describe()
smoothness_mean | smoothness_se | smoothness_worst | |
---|---|---|---|
count | 142.000000 | 142.000000 | 142.000000 |
mean | 0.100433 | 0.006653 | 0.138356 |
std | 0.011829 | 0.003180 | 0.018561 |
min | 0.073710 | 0.002667 | 0.087740 |
25% | 0.092673 | 0.004951 | 0.124750 |
50% | 0.099985 | 0.006106 | 0.139950 |
75% | 0.107775 | 0.007547 | 0.150150 |
max | 0.144700 | 0.031130 | 0.187300 |