Introduction to pandas#

pandas is a data science package orignally developed by Wes McKinney. It builds on top of numpy to provide a higher level API for wrangling, analysing and visualising data. It is also closely coupled to matplotlib with a number of shorthand methods to create plots of data.

In practice, I often use both numpy and pandas for data wrangling: switching between them (which is straightforward) when needed as each offers different benefits.

For your Machine Learning and Data Science career you will find pandas very useful indeed. It is very easy to import data from a variety of sources e.g. CSV or a database. (It also works with Excel, but I’d encourage you to not use Excel formatted files - save it as a CSV instead).

Read the full docs here: https://pandas.pydata.org/

The higher level API offered by pandas comes at the cost of efficency i.e. execution speed. This statement is relative to numpy which approaches the speed of C. I emphasise that pandas is still fast especially compared Microsoft Excel. However, in general pandas is not recommended for implementing high performance computational algorithms over and above numpy.


Imports#

We usually use the alias pd when import pandas

import numpy as np
import pandas as pd
print(pd.__version__)
print(np.__version__)
2.2.2
2.0.1

The pd.Series and pd.DataFrame classes#

The central contribution of the pandas package to data science in python are the Series and DataFrame class. These provide the high level abstraction of data.

pd.Series#

If you are familiar with analysing tabular data in a software package like stata, spss or Excel then you can think of a series as a column of data.

The example below:

  • Create a variable called column1 of type pd.Series

  • column1 has 10 rows (a sequence of 1 to 10).

  • The data in column 1 has a data type of 8-bit unsigned integer (0 - 255)

# create a series
column1 = pd.Series(np.arange(1, 10), name='unique_id', 
                    dtype=np.uint8)
column1
0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
Name: unique_id, dtype: uint8
# data type
type(column1)
pandas.core.series.Series
# shape
column1.shape
(9,)

Another way to do this would have been to ignore the data type and name parameters. Notice now that pandas has defaulted to int64 for the data.

Depending on your application, this may or may not be a big deal. But note that it uses more memory. int64 is equivalent to Python’s int and C’s long type: -9_223_372_036_854_775_808 to 9_223_372_036_854_775_807.

# create a series just passing data.
column2 = pd.Series(np.arange(1, 10))
column2
0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
dtype: int64

In a data science application its likely you will have a very large column of data. You can take a look at the head or tail of the Series by using the .head() and .tail() methods

# 10,000 rows. uint16 = 0 - 65_535
column3 = pd.Series(np.arange(1, 10_000), name='longer', 
                    dtype=np.uint16)
column3.head()
0    1
1    2
2    3
3    4
4    5
Name: longer, dtype: uint16
# To view more or less rows
column3.head(2)
0    1
1    2
Name: longer, dtype: uint16
column3.tail()
9994    9995
9995    9996
9996    9997
9997    9998
9998    9999
Name: longer, dtype: uint16
# side note: to drop to numpy
column3.to_numpy()
array([   1,    2,    3, ..., 9997, 9998, 9999], dtype=uint16)

### pd.DataFrame

For most machine learning applications you will be working with a full DataFrame.

The data you will use is likely to be imported from an external data source, such as a Comma Seperated Value (CSV) file or large scale database such as PostgreSQL. But while you build familiarity with DataFrames we will look at building them manually.

Building DataFrames manually is quite handy for practice, as I often use dataframes to summarise the results of computational procedures and simulations. DataFrames can also generate LaTeX which is handy for quickly producing tables of results for a report/paper I am writing in LaTeX.

# create 5 x 4 matrix
raw_data = np.arange(20, dtype=np.uint8).reshape(5, -1)
raw_data.shape
(5, 4)
raw_data
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]], dtype=uint8)
df = pd.DataFrame(raw_data)
df.columns = (['col_' + str(i) for i in range(df.shape[1])])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   col_0   5 non-null      uint8
 1   col_1   5 non-null      uint8
 2   col_2   5 non-null      uint8
 3   col_3   5 non-null      uint8
dtypes: uint8(4)
memory usage: 152.0 bytes
# its a small matrix so lets view it all
df
col_0 col_1 col_2 col_3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
#for bigger `DataFrames` use .head()/.tail()
df.head(2)
col_0 col_1 col_2 col_3
0 0 1 2 3
1 4 5 6 7
# access a named column
df['col_3']
0     3
1     7
2    11
3    15
4    19
Name: col_3, dtype: uint8
# alternative approach for accessing
df.col_3
0     3
1     7
2    11
3    15
4    19
Name: col_3, dtype: uint8
# side bar 1 - let's ignore the datatype. 
# the size in memory is almost double.
raw_data = np.arange(20).reshape(5, -1)
df = pd.DataFrame(raw_data)
df.columns = (['col_' + str(i) for i in range(df.shape[1])])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   col_0   5 non-null      int64
 1   col_1   5 non-null      int64
 2   col_2   5 non-null      int64
 3   col_3   5 non-null      int64
dtypes: int64(4)
memory usage: 292.0 bytes
# side bar 2: drop to numpy
df.to_numpy()
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

Creating a pd.DataFrame from python lists#

In the previous example we created a DataFrame from a numpy.ndarray. But a DataFrame or Seriescan be created from anything array-like. So for example, we could work with one or more python lists.

# creating an individual series
thrash_metal_bands = pd.Series(['pantera', 'metallica', 'megadeth',
                                'anthrax'])
thrash_metal_bands
0      pantera
1    metallica
2     megadeth
3      anthrax
dtype: object
# create a full data frame

# each column is defined as a seperate list
band_name = ['pantera', 'metallica', 'megadeth', 'anthrax']
n_albums = [9, 10, 15, 11]
formed = [1981, 1981, 1983, 1981]
still_active = [0, 1, 1, 1]

# empty dataframe
thrash_metal_bands = pd.DataFrame()

# create new columns from lists
thrash_metal_bands['band'] = band_name
thrash_metal_bands['n_albums'] = n_albums
thrash_metal_bands['yr_formed'] = formed
thrash_metal_bands['active'] = still_active

thrash_metal_bands
band n_albums yr_formed active
0 pantera 9 1981 0
1 metallica 10 1981 1
2 megadeth 15 1983 1
3 anthrax 11 1981 1
# take a look at the df summary information.
thrash_metal_bands.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   band       4 non-null      object
 1   n_albums   4 non-null      int64 
 2   yr_formed  4 non-null      int64 
 3   active     4 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 260.0+ bytes
# could also be specific about datatype using pd.Series 
thrash_metal_bands = pd.DataFrame()

thrash_metal_bands['band'] = pd.Series(band_name, dtype=str)
thrash_metal_bands['n_albums'] = pd.Series(n_albums, dtype=np.uint8)
thrash_metal_bands['yr_formed'] = pd.Series(formed, dtype=np.uint16)
thrash_metal_bands['active'] = pd.Series(still_active, dtype=bool)

thrash_metal_bands.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   band       4 non-null      object
 1   n_albums   4 non-null      uint8 
 2   yr_formed  4 non-null      uint16
 3   active     4 non-null      bool  
dtypes: bool(1), object(1), uint16(1), uint8(1)
memory usage: 180.0+ bytes

Note that in practice its also useful to know that you can create a DataFrame from a dict. I often forget exactly what format my data need to be in, but luckily you can call the .to_dict() method of a DataFrame to see what is required.

This functionality has proved useful in practice as it is often useful to use a simple dict to track results of an algorithm. At the end of the experiment the dict can quickly be cast to a DataFrame and easily viewed in a notebook.

bands_dict = thrash_metal_bands.to_dict()
bands_dict
{'band': {0: 'pantera', 1: 'metallica', 2: 'megadeth', 3: 'anthrax'},
 'n_albums': {0: 9, 1: 10, 2: 15, 3: 11},
 'yr_formed': {0: 1981, 1: 1981, 2: 1983, 3: 1981},
 'active': {0: False, 1: True, 2: True, 3: True}}
# the code to use a dict looks like pandas code.
bands_dict['band']
{0: 'pantera', 1: 'metallica', 2: 'megadeth', 3: 'anthrax'}
# reverse engineer
new_df = pd.DataFrame(bands_dict)
new_df
band n_albums yr_formed active
0 pantera 9 1981 False
1 metallica 10 1981 True
2 megadeth 15 1983 True
3 anthrax 11 1981 True

DataFrame Indexes#

In each of the examples so far you will notice that the first (unlabelled) a column is a sequence of arbitrary numbers. This is the DataFrame index. When we create a DataFrame manually an additional step is often to set the index to a column we have created.

thrash_metal_bands.index
RangeIndex(start=0, stop=4, step=1)
# set the index to 'band' column
# note that this method returns a 'copy' unless we set 'inplace=True'
thrash_metal_bands.set_index('band', inplace=True)
thrash_metal_bands
n_albums yr_formed active
band
pantera 9 1981 False
metallica 10 1981 True
megadeth 15 1983 True
anthrax 11 1981 True

Accessing elements in a DataFrame#

In nearly all data wrangling and analysis projects, you will want to explore subsets of your data. This might be, for example, to eyeball the data close up or to calculate summary statistics for particular populations. To do that in pandas we need to understand how we access individual and subsets or rows and columns in our DataFrame.

For simplicity, we will again use thrash_metal_bands.

Accessing Rows#

To access an individal row we can use the .loc and .iloc accessors. The former of these references by name while the latter references by numeric index.

# return the records for panteria
thrash_metal_bands.loc['pantera']
n_albums         9
yr_formed     1981
active       False
Name: pantera, dtype: object
# return the records the row at index 2
thrash_metal_bands.iloc[2]
n_albums       15
yr_formed    1983
active       True
Name: megadeth, dtype: object

Note that we can return multiple rows if we supply a list of indexes. For example to find records for both pantera and megadeth:

to_find = ['pantera', 'megadeth']
thrash_metal_bands.loc[to_find]
n_albums yr_formed active
band
pantera 9 1981 False
megadeth 15 1983 True

Indexes can be sliced in a similar fashion to arrays

# All rows from index 2 onwards
thrash_metal_bands.iloc[2:]
n_albums yr_formed active
band
megadeth 15 1983 True
anthrax 11 1981 True
# Rows 0 and 1
thrash_metal_bands.iloc[:2]
n_albums yr_formed active
band
pantera 9 1981 False
metallica 10 1981 True
# slicing by name
thrash_metal_bands.loc['pantera': 'megadeth']
n_albums yr_formed active
band
pantera 9 1981 False
metallica 10 1981 True
megadeth 15 1983 True

Accessing columns#

We have already seen that accessing a column is done as so

df['column_name']

We can also select multiple columns.

# select yr_formed and active columns only
mask = ['yr_formed', 'active']
thrash_metal_bands[mask]
yr_formed active
band
pantera 1981 False
metallica 1981 True
megadeth 1983 True
anthrax 1981 True

or restrict both columns and rows

columns = ['yr_formed', 'active']
rows = ['pantera', 'anthrax']
thrash_metal_bands.loc[rows, mask]
yr_formed active
band
pantera 1981 False
anthrax 1981 True

Selecting individual cells#

To access an individual cell within a DataFrame use the .at and .iat accessors. The former uses the index row and column names while the latter is an integer based lookup.

# lookup the year pantera was formed.
thrash_metal_bands.at['pantera', 'yr_formed']
np.uint16(1981)
# look up the cell value in position (1, 1)
thrash_metal_bands.iat[1, 1]
np.uint16(1981)
# to update the individual value
# lookup the year pantera was formed.
thrash_metal_bands.at['pantera', 'yr_formed'] = 9999
thrash_metal_bands
n_albums yr_formed active
band
pantera 9 9999 False
metallica 10 1981 True
megadeth 15 1983 True
anthrax 11 1981 True

End#