Introduction to pandas
#
pandas
is a data science package orignally developed by Wes McKinney. It builds on top of numpy
to provide a higher level API for wrangling, analysing and visualising data. It is also closely coupled to matplotlib
with a number of shorthand methods to create plots of data.
In practice, I often use both numpy
and pandas
for data wrangling: switching between them (which is straightforward) when needed as each offers different benefits.
For your Machine Learning and Data Science career you will find pandas
very useful indeed. It is very easy to import data from a variety of sources e.g. CSV or a database. (It also works with Excel, but I’d encourage you to not use Excel formatted files - save it as a CSV instead).
Read the full docs here: https://pandas.pydata.org/
The higher level API offered by
pandas
comes at the cost of efficency i.e. execution speed. This statement is relative tonumpy
which approaches the speed ofC
. I emphasise thatpandas
is still fast especially compared Microsoft Excel. However, in generalpandas
is not recommended for implementing high performance computational algorithms over and abovenumpy
.
Imports#
We usually use the alias pd
when import pandas
import numpy as np
import pandas as pd
print(pd.__version__)
print(np.__version__)
2.2.2
2.0.1
The pd.Series
and pd.DataFrame
classes#
The central contribution of the pandas
package to data science in python are the Series
and DataFrame
class. These provide the high level abstraction of data.
pd.Series
#
If you are familiar with analysing tabular data in a software package like stata, spss or Excel then you can think of a series as a column of data.
The example below:
Create a variable called
column1
of typepd.Series
column1
has 10 rows (a sequence of 1 to 10).The data in column 1 has a data type of 8-bit unsigned integer (0 - 255)
# create a series
column1 = pd.Series(np.arange(1, 10), name='unique_id',
dtype=np.uint8)
column1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
Name: unique_id, dtype: uint8
# data type
type(column1)
pandas.core.series.Series
# shape
column1.shape
(9,)
Another way to do this would have been to ignore the data type and name parameters. Notice now that pandas
has defaulted to int64
for the data.
Depending on your application, this may or may not be a big deal. But note that it uses more memory.
int64
is equivalent to Python’sint
and C’slong
type:-9_223_372_036_854_775_808
to9_223_372_036_854_775_807
.
# create a series just passing data.
column2 = pd.Series(np.arange(1, 10))
column2
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
dtype: int64
In a data science application its likely you will have a very large column of data. You can take a look at the head or tail of the Series
by using the .head()
and .tail()
methods
# 10,000 rows. uint16 = 0 - 65_535
column3 = pd.Series(np.arange(1, 10_000), name='longer',
dtype=np.uint16)
column3.head()
0 1
1 2
2 3
3 4
4 5
Name: longer, dtype: uint16
# To view more or less rows
column3.head(2)
0 1
1 2
Name: longer, dtype: uint16
column3.tail()
9994 9995
9995 9996
9996 9997
9997 9998
9998 9999
Name: longer, dtype: uint16
# side note: to drop to numpy
column3.to_numpy()
array([ 1, 2, 3, ..., 9997, 9998, 9999], dtype=uint16)
### pd.DataFrame
For most machine learning applications you will be working with a full DataFrame
.
The data you will use is likely to be imported from an external data source, such as a Comma Seperated Value (CSV) file or large scale database such as PostgreSQL. But while you build familiarity with DataFrames we will look at building them manually.
Building
DataFrames
manually is quite handy for practice, as I often use dataframes to summarise the results of computational procedures and simulations.DataFrames
can also generate LaTeX which is handy for quickly producing tables of results for a report/paper I am writing in LaTeX.
# create 5 x 4 matrix
raw_data = np.arange(20, dtype=np.uint8).reshape(5, -1)
raw_data.shape
(5, 4)
raw_data
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]], dtype=uint8)
df = pd.DataFrame(raw_data)
df.columns = (['col_' + str(i) for i in range(df.shape[1])])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col_0 5 non-null uint8
1 col_1 5 non-null uint8
2 col_2 5 non-null uint8
3 col_3 5 non-null uint8
dtypes: uint8(4)
memory usage: 152.0 bytes
# its a small matrix so lets view it all
df
col_0 | col_1 | col_2 | col_3 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
#for bigger `DataFrames` use .head()/.tail()
df.head(2)
col_0 | col_1 | col_2 | col_3 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
# access a named column
df['col_3']
0 3
1 7
2 11
3 15
4 19
Name: col_3, dtype: uint8
# alternative approach for accessing
df.col_3
0 3
1 7
2 11
3 15
4 19
Name: col_3, dtype: uint8
# side bar 1 - let's ignore the datatype.
# the size in memory is almost double.
raw_data = np.arange(20).reshape(5, -1)
df = pd.DataFrame(raw_data)
df.columns = (['col_' + str(i) for i in range(df.shape[1])])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col_0 5 non-null int64
1 col_1 5 non-null int64
2 col_2 5 non-null int64
3 col_3 5 non-null int64
dtypes: int64(4)
memory usage: 292.0 bytes
# side bar 2: drop to numpy
df.to_numpy()
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]])
Creating a pd.DataFrame
from python lists#
In the previous example we created a DataFrame
from a numpy.ndarray
. But a DataFrame
or Series
can be created from anything array-like. So for example, we could work with one or more python lists.
# creating an individual series
thrash_metal_bands = pd.Series(['pantera', 'metallica', 'megadeth',
'anthrax'])
thrash_metal_bands
0 pantera
1 metallica
2 megadeth
3 anthrax
dtype: object
# create a full data frame
# each column is defined as a seperate list
band_name = ['pantera', 'metallica', 'megadeth', 'anthrax']
n_albums = [9, 10, 15, 11]
formed = [1981, 1981, 1983, 1981]
still_active = [0, 1, 1, 1]
# empty dataframe
thrash_metal_bands = pd.DataFrame()
# create new columns from lists
thrash_metal_bands['band'] = band_name
thrash_metal_bands['n_albums'] = n_albums
thrash_metal_bands['yr_formed'] = formed
thrash_metal_bands['active'] = still_active
thrash_metal_bands
band | n_albums | yr_formed | active | |
---|---|---|---|---|
0 | pantera | 9 | 1981 | 0 |
1 | metallica | 10 | 1981 | 1 |
2 | megadeth | 15 | 1983 | 1 |
3 | anthrax | 11 | 1981 | 1 |
# take a look at the df summary information.
thrash_metal_bands.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 band 4 non-null object
1 n_albums 4 non-null int64
2 yr_formed 4 non-null int64
3 active 4 non-null int64
dtypes: int64(3), object(1)
memory usage: 260.0+ bytes
# could also be specific about datatype using pd.Series
thrash_metal_bands = pd.DataFrame()
thrash_metal_bands['band'] = pd.Series(band_name, dtype=str)
thrash_metal_bands['n_albums'] = pd.Series(n_albums, dtype=np.uint8)
thrash_metal_bands['yr_formed'] = pd.Series(formed, dtype=np.uint16)
thrash_metal_bands['active'] = pd.Series(still_active, dtype=bool)
thrash_metal_bands.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 band 4 non-null object
1 n_albums 4 non-null uint8
2 yr_formed 4 non-null uint16
3 active 4 non-null bool
dtypes: bool(1), object(1), uint16(1), uint8(1)
memory usage: 180.0+ bytes
Note that in practice its also useful to know that you can create a DataFrame
from a dict. I often forget exactly what format my data need to be in, but luckily you can call the .to_dict()
method of a DataFrame
to see what is required.
This functionality has proved useful in practice as it is often useful to use a simple
dict
to track results of an algorithm. At the end of the experiment the dict can quickly be cast to aDataFrame
and easily viewed in a notebook.
bands_dict = thrash_metal_bands.to_dict()
bands_dict
{'band': {0: 'pantera', 1: 'metallica', 2: 'megadeth', 3: 'anthrax'},
'n_albums': {0: 9, 1: 10, 2: 15, 3: 11},
'yr_formed': {0: 1981, 1: 1981, 2: 1983, 3: 1981},
'active': {0: False, 1: True, 2: True, 3: True}}
# the code to use a dict looks like pandas code.
bands_dict['band']
{0: 'pantera', 1: 'metallica', 2: 'megadeth', 3: 'anthrax'}
# reverse engineer
new_df = pd.DataFrame(bands_dict)
new_df
band | n_albums | yr_formed | active | |
---|---|---|---|---|
0 | pantera | 9 | 1981 | False |
1 | metallica | 10 | 1981 | True |
2 | megadeth | 15 | 1983 | True |
3 | anthrax | 11 | 1981 | True |
DataFrame
Indexes#
In each of the examples so far you will notice that the first (unlabelled) a column is a sequence of arbitrary numbers. This is the DataFrame
index. When we create a DataFrame
manually an additional step is often to set the index to a column we have created.
thrash_metal_bands.index
RangeIndex(start=0, stop=4, step=1)
# set the index to 'band' column
# note that this method returns a 'copy' unless we set 'inplace=True'
thrash_metal_bands.set_index('band', inplace=True)
thrash_metal_bands
n_albums | yr_formed | active | |
---|---|---|---|
band | |||
pantera | 9 | 1981 | False |
metallica | 10 | 1981 | True |
megadeth | 15 | 1983 | True |
anthrax | 11 | 1981 | True |
Accessing elements in a DataFrame
#
In nearly all data wrangling and analysis projects, you will want to explore subsets of your data. This might be, for example, to eyeball the data close up or to calculate summary statistics for particular populations. To do that in pandas
we need to understand how we access individual and subsets or rows and columns in our DataFrame
.
For simplicity, we will again use thrash_metal_bands
.
Accessing Rows#
To access an individal row we can use the .loc
and .iloc
accessors. The former of these references by name while the latter references by numeric index.
# return the records for panteria
thrash_metal_bands.loc['pantera']
n_albums 9
yr_formed 1981
active False
Name: pantera, dtype: object
# return the records the row at index 2
thrash_metal_bands.iloc[2]
n_albums 15
yr_formed 1983
active True
Name: megadeth, dtype: object
Note that we can return multiple rows if we supply a list of indexes. For example to find records for both pantera and megadeth:
to_find = ['pantera', 'megadeth']
thrash_metal_bands.loc[to_find]
n_albums | yr_formed | active | |
---|---|---|---|
band | |||
pantera | 9 | 1981 | False |
megadeth | 15 | 1983 | True |
Indexes can be sliced in a similar fashion to arrays
# All rows from index 2 onwards
thrash_metal_bands.iloc[2:]
n_albums | yr_formed | active | |
---|---|---|---|
band | |||
megadeth | 15 | 1983 | True |
anthrax | 11 | 1981 | True |
# Rows 0 and 1
thrash_metal_bands.iloc[:2]
n_albums | yr_formed | active | |
---|---|---|---|
band | |||
pantera | 9 | 1981 | False |
metallica | 10 | 1981 | True |
# slicing by name
thrash_metal_bands.loc['pantera': 'megadeth']
n_albums | yr_formed | active | |
---|---|---|---|
band | |||
pantera | 9 | 1981 | False |
metallica | 10 | 1981 | True |
megadeth | 15 | 1983 | True |
Accessing columns#
We have already seen that accessing a column is done as so
df['column_name']
We can also select multiple columns.
# select yr_formed and active columns only
mask = ['yr_formed', 'active']
thrash_metal_bands[mask]
yr_formed | active | |
---|---|---|
band | ||
pantera | 1981 | False |
metallica | 1981 | True |
megadeth | 1983 | True |
anthrax | 1981 | True |
or restrict both columns and rows
columns = ['yr_formed', 'active']
rows = ['pantera', 'anthrax']
thrash_metal_bands.loc[rows, mask]
yr_formed | active | |
---|---|---|
band | ||
pantera | 1981 | False |
anthrax | 1981 | True |
Selecting individual cells#
To access an individual cell within a DataFrame
use the .at
and .iat
accessors. The former uses the index row and column names while the latter is an integer based lookup.
# lookup the year pantera was formed.
thrash_metal_bands.at['pantera', 'yr_formed']
np.uint16(1981)
# look up the cell value in position (1, 1)
thrash_metal_bands.iat[1, 1]
np.uint16(1981)
# to update the individual value
# lookup the year pantera was formed.
thrash_metal_bands.at['pantera', 'yr_formed'] = 9999
thrash_metal_bands
n_albums | yr_formed | active | |
---|---|---|---|
band | |||
pantera | 9 | 9999 | False |
metallica | 10 | 1981 | True |
megadeth | 15 | 1983 | True |
anthrax | 11 | 1981 | True |