Case study 1: a bootstrap routine#
Bootstrapping is a computational approach to resample with replacement \(B\) bootstrap datasets from an original sample. For example, let’s assume you want to construct the distribution of the statistic from a given (sometimes small sample).
From a coding perspective bootstrap routines are extremely simple. In its simplest form bootstrapping is essentially two nested for loops along with a function to compute the statistic of interest. This is quite an interesting case study for numpy
i.e. how do we eliminate nested loops efficiently?
Imports#
from random import randint
import numpy as np
Dataset#
We will test our code using a simple sequence of 10 integer values. To avoid any overhead due to type conversion we will have one stored as a np.ndarray
and the other as a list
.
sample_np = np.arange(1, 11)
sample_lst = sample_np.tolist()
print(sample_np)
[ 1 2 3 4 5 6 7 8 9 10]
Benchmark: Standard python implementation#
Before we get into the numpy
code here is some standard python code. We will benchmark its performance at generating 1000 boostrap datasets of the mean using the %timeit
.
def bootstrap_python(data, boots):
"""
Create bootstrap datasets that represent the distribution of the mean.
Returns a list containing the bootstrap datasets
Parameters:
-----------
data: array-like
vector of original data
boots:int
number of bootstrap samples to generate
Returns:
-------
list
"""
to_return = []
# outer loop - create bootstrap datasets
for b in range(boots):
# resample with replacement and sum sample values
# inner loop sample with replacement and calculate mean
total = 0
for _ in range(len(data)):
# resample with replacement and increment running total.
total += data[randint(0, len(data)-1)]
# mean of the bootstrap dataset
to_return.append(total / len(data))
return to_return
%timeit bootstrap_python(sample_lst, 1000)
3.77 ms ± 189 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The exact performance might vary a bit on your machine, but you should get performance in milliseconds (ms). On my machine this was around 6.2 to 6.5ms on average.
Converting to numpy
#
In theory, execution of loops in python is slow relative to a low level language like C. To speed up our code we need to eliminate them. This can sometimes be simple to do and sometimes require you to be much more creative!
Our first attempt will eliminate the inner loop (a list comprehension) in python
Attempt 1: the inner loop#
The function bootstrap_numpy
below eliminates one of the python for loops - the inner list comprehension. Instead of iterating in python we make use numpy
’s random number generator function integers
. This has a size
parameter that allows us to specify the number of loops that will occur ‘closer to the metal’.
Note that in
bootstrap
numpy we create and specify the size of our arrayto_return
upfront.
def bootstrap_numpy(data, boots):
"""
Create bootstrap datasets that represent the distribution of the mean.
Makes better use of built in numpy methods for more efficient sampling
Returns a numpy array containing the bootstrap datasets
Keyword arguments:
data -- numpy array of systems to boostrap
boots -- number of bootstrap
"""
# numpy random number generator
rng = np.random.default_rng()
# empty vector to return to caller. Its len=boots
to_return = np.empty(boots)
for b in range(boots):
# we get numpy's randint function to do the iteration work for us.
total = data[rng.integers(0, data.shape[0], size=data.shape[0])].sum()
to_return[b] = total / data.shape[0]
return to_return
%timeit bootstrap_numpy(sample_np, 1000)
6.98 ms ± 88.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
An unexpected result! Again the exact results will vary, but on my machine the numpy implementation is actually marginally slower than the pure python implementation. Coming in a around 8ms per loop¬ There must be some overhead (hidden runtime cost) that we are missing. A hypothesis is that there may be some overhead in calling the rng.integers
method from python. We do that boot
times during the run.
We know that there is still one python loop in our code. Can we do better than the pure python code if we eliminate it and only call a numpy method once?
Attempt 2: eliminating the nested loops#
Bootstrapping is simply a sample with replacement. In most implementations we would use a nested loop structure as we want to create \(B\) bootstrap datasets (loop one) of size \(n\) (loop 2). But in general we are just taken \(B \times n\) samples with replacement.
One option to reimplement our algorithm in numpy
is
create vector of length \(B \times n\).
Fill it will data sampled with replacement
Split it into \(B\) vectors of length \(n\)
Calculate a statistic of interest, e.g. the mean, for each of the \(B\) datasets.
We will implement steps 3 and 4 here by reshaping a 1D array into a \(B \times n\) matrix and take the mean of the rows (axis=1)
def bootstrap_better_numpy(data, boots):
"""
Create bootstrap datasets that represent the distribution of the mean.
Makes better use of built in numpy methods for more efficient sampling
Returns a numpy array containing the bootstrap datasets
Parameters:
----------
data: array-like
original sample data
boots: int
number of bootstrap datasets to create.
Returns:
-------
np.ndarray
"""
# create random number generator
rng = np.random.default_rng()
# get all the bootstrap data in one go boots * len(data)
boot_data = data[rng.integers(0, data.shape[0], size=data.shape[0]*boots)]
# reshape array into (boots, len(data)) and calculate mean
return boot_data.reshape(-1, len(data)).sum(axis=1) / len(data)
%timeit bootstrap_better_numpy(sample_np, 1000)
79.3 μs ± 239 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
That’s improvement! We are now seeing an order of magnitude speed up in the code. On my machine the average runtime of the function is less than the variation between runs of the pure python code.
Step by step breakdown#
Our new numpy
implementation executes a lot quicker. But it is harder to understand how it works. Let’s break it down step by step.
Random number generator#
To do our sampling we create a random number generator. This has a function called integers.
# create random number generator
rng = np.random.default_rng()
rng.integers(0, 10, size=5)
array([6, 6, 8, 0, 0])
Sampling all data in one go#
The next line of code does all of the sampling with replacement that we need.
boot_data = data[rng.integers(0, data.shape[0], size=data.shape[0]*boots)]
We have set the size parameter to the length of our data vector multipled by the number of bootstraps needed. These figures are 10 and 1000 respectively. So we end up with a vector of length 10,000
data = sample_np
boots = 1000
boot_data = data[rng.integers(0, data.shape[0], size=data.shape[0]*boots)]
print(boot_data.shape)
(10000,)
Reshaping into a \(B \times n\) matrix#
The next line of code combines steps 3 and 4.
return boot_data.reshape(-1, len(data)).sum(axis=1) / len(data)
Its easier to understand if we look at the result of the reshaping first.
matrix = boot_data.reshape(-1, len(data))
print(matrix)
[[ 2 8 6 ... 10 4 10]
[ 1 4 2 ... 7 8 5]
[ 8 2 9 ... 3 4 5]
...
[ 9 7 2 ... 6 6 1]
[ 7 1 6 ... 8 9 8]
[ 9 1 7 ... 6 5 3]]
In the reshaping we have specified that there needs to be len(data)
columns while the number of rows is automatically inferred by number (-1). So the result is a matrix with the same dimensions as we would get using nested for loops (but with the performance bonus there there are no native python for loops!).
From here it just a case of summing over the columns .sum(axis=1)
and dividing by the number of samples in each bootstrap dataset i.e. len(data)
.
Summing up.#
We saw a dramatic performance boost from numpy here. However, to get there is not always intuitive (and to be frank not as easy as this example). The result is also not as readable as the original code. The trade-off is that this may be the only way to make an algorithm feasible to execute in python.