Case study: Cross-validation#
In supervised learning studies we distinguish between the training and test error rates. The test error rate refers to the average performance of a model when predicting a new observation (data that was not used in the training of the model). A useful model is one that can accurately predict new observations!
An issue in real ML studies is that data are expensive to collect and often a health data scientist doesn’t have as much validation data as they would ideally like for estimating the test error rate. Cross-validation is a general term to describe a set of statistical methods for efficiently splitting a dataset in order to estimating the test error of a model.
Cross validation methods are a nice example of where python classes can be used to design a framework for your ML. In the case study we will explore creating two classes for cross validation in machine learning: Leave-One-Out Cross-Validation and K-Fold Cross Validation.
If you are interested to read further on cross validation I recommend the excellent: An introduction to statistical learning (with applications in R) by James, Witten, Hastie and Tibshirani.
In a real machine learning study in python there is no doubt that you would implement these classes using a
numpy
array. In practice I would therefore implement these classes slightly differently. We will covernumpy
in the next chapter. Another bullet proof option, that requires no implementation, is to use thesklearn.model_selection
namespace. I’ve based the naming on sklearn’s interface here. After you have completed the case study do check out thesklearn
source code that’s available via its documentation.
We’ll begin by creating each class independently. Then we’ll work on our OOP design credentials by extracting what can be encapsulated into a common baseclass. Finally we will take a short look at what is meant by an Abstract class and how this is implemented in Python.
import random
import numpy as np
Synthetic data#
The function below generates synthetic data that can be used to test the classes.
def synthetic_classification(n_samples=10, n_features=1, shuffle=False,
random_seed=None):
'''
Generates a simple random synthetic dataset in a given shape.
Used for testing of generator classes.
X: each feature is a sequence i to i + n_samples where i is the feature no.
y data is 0 or 1 weighted very roughly 50/50.
These sequences are randomised if shuffle is set to True.
No error checking. Assumes all inputs are valid.
Params:
------
n_samples: int, optional (default=10)
The number of samples
n_features: int, optional (default=1)
The number of features in the classification problem
shuffle: bool, optional (default=False)
If true then sequences are randomly shuffled
random_seed: int or None, optional (default=None)
If shuffle then controls the ordering of the sequences generated.
Returns:
--------
X, y
Where X and y are python lists and X will be a list of lists sized
(n_samples, n_features)
'''
X = [[(col * (n_samples)) + row for col in range(n_features)]
for row in range(n_samples)]
y = ([1] * (n_samples // 2)) + ([0] * ((n_samples // 2) + (n_samples % 2)))
if shuffle:
for lst in [X, y]:
random.seed(random_seed)
random.shuffle(lst)
return X, y
X, y = synthetic_classification(n_samples=3)
print(X)
print(y)
[[0], [1], [2]]
[1, 0, 0]
X, y = synthetic_classification(n_samples=6, n_features=2)
print(X)
print(y)
[[0, 6], [1, 7], [2, 8], [3, 9], [4, 10], [5, 11]]
[1, 1, 1, 0, 0, 0]
X, y = synthetic_classification(n_samples=6, n_features=2, shuffle=True,
random_seed=42)
print(X)
print(y)
[[3, 9], [1, 7], [2, 8], [4, 10], [0, 6], [5, 11]]
[0, 1, 1, 0, 1, 0]
Leave one out cross validation#
As it’s name suggests Leave on out cross validation (LOOCV) incrementally leaves one data point out of the training of a model. It then predicts the single holdout sample. This is repeated for every sample in the dataset.
I’ve rarely seen this approach used in practice and its not a method I would consider using myself. The reason is that it requires a lot of training cycles. Combine LOOCV with parameter optimisation, a moderate sized dataset and a computationally intensive training routine and you’ll need to leave your code running overnight! But its a nice simple example that we can use in our learning.
Rather than perform both the generation of the splits and cross validation in the same class we will seperate their responsibilities into seperate classes. We will have one class called LeaveOneOut
and a function called cross_validate
. The function will accept an instance of LeaveOneOut
and use it to generate the data needed for cross validation.
class LeaveOneOut:
'''
Leave one out dataset generator for cross validation.
'''
def __init__(self):
pass
def __repr__(self):
return 'LeaveOneOut()'
def get_n_splits(self, X):
'''
The number of splits returned by the cross validation
method.
'''
return len(X)
def split(self, X, y):
'''
Generator method. Split the dataset
'''
for test_index in range(len(X)):
# training data indexes
train_X = X[:test_index] + X[test_index + 1:]
train_y = y[:test_index] + y[test_index + 1:]
# test data
test_X, test_y = X[test_index], y[test_index]
yield train_X, train_y, test_X, test_y
I will use the synthetic_classification
to generate a test dataset for LeaveOneOut
. I will keep the inputs as a sequence so you can easily see it working.
Remember that
LeaveOneOut.split()
uses the yield keyword. This means it is a generator method and we need to call it repeatedly in a loop in order to generate new splits for validation.
# generate test dataset
X, y = synthetic_classification(n_samples=6, n_features=2, shuffle=False)
# create an instance of LeaveOneOut
cv = LeaveOneOut()
# basic cross validation loop.
# I've zipped together a range and the splits into order to get fold no.
for i, split_data in zip(range(cv.get_n_splits(X)), cv.split(X, y)):
train_X, train_y, test_X, test_y = split_data
print(f'Fold {i+1}:\nTrain:\tX:{train_X}, y:{train_y}')
print(f'Test:\tX:{test_X}, y:{test_y}')
Fold 1:
Train: X:[[1, 7], [2, 8], [3, 9], [4, 10], [5, 11]], y:[1, 1, 0, 0, 0]
Test: X:[0, 6], y:1
Fold 2:
Train: X:[[0, 6], [2, 8], [3, 9], [4, 10], [5, 11]], y:[1, 1, 0, 0, 0]
Test: X:[1, 7], y:1
Fold 3:
Train: X:[[0, 6], [1, 7], [3, 9], [4, 10], [5, 11]], y:[1, 1, 0, 0, 0]
Test: X:[2, 8], y:1
Fold 4:
Train: X:[[0, 6], [1, 7], [2, 8], [4, 10], [5, 11]], y:[1, 1, 1, 0, 0]
Test: X:[3, 9], y:0
Fold 5:
Train: X:[[0, 6], [1, 7], [2, 8], [3, 9], [5, 11]], y:[1, 1, 1, 0, 0]
Test: X:[4, 10], y:0
Fold 6:
Train: X:[[0, 6], [1, 7], [2, 8], [3, 9], [4, 10]], y:[1, 1, 1, 0, 0]
Test: X:[5, 11], y:0
Now let’s create a function to perform the cross validation.
This could easily be another class. But for simplicity I’ve used a function.
def cross_validate(X, y, cv):
scores = []
print('split=> ', end='')
for i, split_data in zip(range(cv.get_n_splits(X)), cv.split(X, y)):
print(f'{i+1},', end=' ')
train_X, train_y, test_X, test_y = split_data
# model fitting, prediction and accuracy assesment is
# just simulated here.
scores.append(random.uniform(0.8, 1))
print('end')
return scores
X, y = X, y = synthetic_classification(n_samples=160)
scores = cross_validate(X, y, LeaveOneOut())
# print out fake accuracy!
print(f'cv score: {sum(scores) / len(scores):.2f}')
split=> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, end
cv score: 0.90
What about if we wanted to use a different cross validation method?#
The second cross validation method on the menu is k-fold cross validation where \(k\) is typically set to 5 or 10. K-fold cross validation is therefore less computationally demanding than LOOCV. Here we will shuffle the dataset and then split it into k segments. In each loop of our cross validation a model is trained on \(k-1\) segments while the final one is held out for testing.
We will implement KFold
so that it has the same interface as LeaveOneOut
. This means it has the a set of method signatures and properties in common with LeaveOneOut
. It will have the following method signatures in common:
get_n_splits(self, X)
- this is a generatorsplit(self, X, y)
- returns an int
Implementing a KFold
class#
The implementation of KFold
is slightly more involved than LOOCV, but it is perfectly possible in standard python using either for
loops or list comprehensions. I’ve opted for the latter as it allows a proportion of the logic to be expressed in a single (I would argue readable) line of python.
First take a look at the class and then I’ll spend a bit of time explaining how it works.
class KFold:
'''
A generator class for k-fold cross validation.
'''
def __init__(self, k=5, shuffle=False, random_seed=None):
self.k = k
self.random_seed = random_seed
self.shuffle = shuffle
def __repr__(self):
return f'KFold({self.k=}, {self.shuffle=}, {self.random_seed=})'
def get_n_splits(self, X):
'''
Return an integer representing the number of splits that
will be generated.
'''
return self.k
def split(self, X, y):
'''
Generator method. Returns incremental splits of the dataset
on each call.
Params:
------
X: list
python list containing X data. For multiple features
shape should be (n_samples, n_features)
y: list
python list containing y target data. For multiple
targets shape should be (n_samples, n_targets)
Returns:
--------
train_X, test_X, train_y, test_y
Where each is a python list
'''
# store the indexes of each element
idx = [i for i in range(len(X))]
if self.shuffle:
random.seed(self.random_seed)
random.shuffle(indicies)
# length of k - 1 splits... final split continues to end.
split_len = int(len(X) / (self.k))
for test_idx in range(0, len(X), split_len):
# create k - 1 training folds for X
train_X = self._fold_training_data(X, idx, test_idx, split_len)
# X test data for fold
test_X = [X[idx[i]] for i in range(test_idx, test_idx + split_len)]
# create k - 1 training segments for y
train_y = self._fold_training_data(y, idx, test_idx, split_len)
# y test data fold
test_y = [y[idx[i]] for i in range(test_idx, test_idx + split_len)]
yield train_X, test_X, train_y, test_y
def _fold_training_data(self, data, idx, test_idx, split_len):
'''
create training segments for X or y
'''
train_seg1 = [data[idx[i]] for i in range(test_idx)]
train_seg2 = [data[idx[i]] for i in range((test_idx + split_len),
len(data))]
return train_seg1 + train_seg2
KFold.split()
#
In the split()
method you should note that I do not directly randomise the X
and y
list parameters (containing the training data). I do not directly modify these lists because it will also modify the variables pointing to the same data outside of the class (its the same data because of the way python passes a list to a function).
Here’s a simple function modify_list
that illustrates what happens when you modify a list passed as a parameter.
def modify_list(to_modify):
to_modify.append('shrubbery')
to_modify.append('another shrubbery')
print(f'inside function {to_modify=}')
lst = ['knights', 'who', 'say', 'ni']
print(f'Outside function BEFORE call {lst=}')
modify_list(lst)
print(f'Outside function AFTER call {lst=}')
Outside function BEFORE call lst=['knights', 'who', 'say', 'ni']
inside function to_modify=['knights', 'who', 'say', 'ni', 'shrubbery', 'another shrubbery']
Outside function AFTER call lst=['knights', 'who', 'say', 'ni', 'shrubbery', 'another shrubbery']
In my implementation you will see that I first create idx
which is a simple integer list 0 to len(X)
. I then shuffle the indexes and use them to create a new list via a comprehension. For example to create the test data
test_X = [X[idx[i]] for i in range(test_idx, test_idx + split_len)]
A key bit of code in the comprehension is X[idx[i]]
. Let’s say that we have
idx = [5, 3, 0, 2, 1, 4]
X = [10, 100, 1_000, 10_000, 100_000, 1_000_000]
If i = 2
then idx[2]
would evaluate to 0
. X[idx[2]
evaluates to 10
. You can think of this technique as an indirect lookup of your data.
In the above example the reordered data is
reordered_X = [1_000_000, 10_000, 10, 1_000, 100, 100_000]
I chose to use a list of indexes, but you could also
import copy
and useinternal_X = copy.copy(X)
to achieve the same results - that might even be clearer to read and understand. You may also not care about modifying the data your assigned to your external variable. Personally I think that modifying in place is bad practice and will may lead to unintended silent side-effects and bugs in your code at a later date.
Using KFold
#
As KFold implements the same interface as LeaveOneOut
we can reuse cross_validate
.
X, y = X, y = synthetic_classification(n_samples=160)
scores = cross_validate(X, y, KFold(k=5))
# print out fake accuracy!
print(f'cv score: {sum(scores) / len(scores):.2f}')
split=> 1, 2, 3, 4, 5, end
cv score: 0.85
Abstract classes#
The two classes we have put together is the start of a small framework for cross validation. Its important to recognise that the moment you develop a framework of classes you hit problems - understanding, maintenance, and extension! Here’s a few scenarios where you might hit problems:
When you have a large number of classes to generate splits for cross validation it becomes difficult to keep these all in your head! A far simpler concept is the idea of a
SplitGenerator
.Other data scientists, in your team or elsewhere, might want to add their own custom classes to use in place of yours. Its likely that you want to control this to a certain extent to avoid crashes and improve the experience of your fellow data scientists.
In python, to an extent, you can provide this control via a concept called Abstract base classes. These are strange beasts! Here’s one for generating our CV splits.
from abc import ABC, abstractmethod
class AbstractSplitGenerator(ABC):
'''
Abstract base class for generating splits of X, y data
for machine learning cross validation.
'''
@abstractmethod
def get_n_splits(self, X):
pass
@abstractmethod
def split(X, y):
pass
The first line of code imports ABC
and abstractmethod
from the built in module abc
. This gives us the ingredients we need for creating the base class. The first step is to tell python that this is a abstract class by subclassing ABC
class AbstractSplitGenerator(ABC):
pass
It is not possible to create an instance of AbstractSplitGenerator
. Try it! Python won’t let you.
cv = AbstractSplitGenerator()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[14], line 1
----> 1 cv = AbstractSplitGenerator()
TypeError: Can't instantiate abstract class AbstractSplitGenerator with abstract methods get_n_splits, split
In the class body we have defined two abstract methods using the @abstractmethod
decorator. This decorator forces any subclass of AbstractSplitGenerator
to implement these methods. Otherwise we get a run time error.
class DodgyImplementation(AbstractSplitGenerator):
pass
cv = DodgyImplementation()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_433104/1287983807.py in <module>
2 pass
3
----> 4 cv = DodgyImplementation()
TypeError: Can't instantiate abstract class DodgyImplementation with abstract methods get_n_splits, split
The style of Abstract classes I have shown you here is akin to interface based inheritance from other languages. Its a neat way to use inheritance without introducing logic bugs from higher up the class tree (as there is no logic code). You can think of it as a weak “contract”. If you want to write a class in this framework then you guarantee that you provide the methods that other code expects.
I specifically called this a weak contract above. This is because in python almost anything goes! If you want to enforce your contract you need to modify cross_validation
to check the type of cv
. For example:
def cross_validate(X, y, cv):
# enforce interface
if not isinstance(cv, AbstractSplitGenerator):
raise TypeError(f'Expected cv to be AbstractSplitGenerator, '
+ f'but found {type(cv)}')
scores = []
print('split=> ', end='')
for i, split_data in zip(range(cv.get_n_splits(X)), cv.split(X, y)):
print(f'{i+1},', end=' ')
train_X, train_y, test_X, test_y = split_data
# model fitting, prediction and accuracy assesment is
# just simulated here.
scores.append(random.uniform(0.8, 1))
print('end')
return scores
X, y = X, y = synthetic_classification(n_samples=160)
scores = cross_validate(X, y, LeaveOneOut())
# print out fake accuracy!
print(f'cv score: {sum(scores) / len(scores):.2f}')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_433104/696997810.py in <module>
1 X, y = X, y = synthetic_classification(n_samples=160)
----> 2 scores = cross_validate(X, y, LeaveOneOut())
3
4 # print out fake accuracy!
5 print(f'cv score: {sum(scores) / len(scores):.2f}')
/tmp/ipykernel_433104/2306781539.py in cross_validate(X, y, cv)
3 # enforce interface
4 if not isinstance(cv, AbstractSplitGenerator):
----> 5 raise TypeError(f'Expected cv to be AbstractSplitGenerator, '
6 + f'but found {type(cv)}')
7
TypeError: Expected cv to be AbstractSplitGenerator, but found <class '__main__.LeaveOneOut'>
You then simply create concrete classes by subclassing AbstractSplitGenerator. For example:
class LeaveOneOut(AbstractSplitGenerator):
'''
Leave one out dataset generator for cross validation.
'''
def __init__(self):
pass
def __repr__(self):
return 'LeaveOneOut()'
def get_n_splits(self, X):
'''
The number of splits returned by the cross validation
method.
'''
return len(X)
def split(self, X, y):
'''
Generator method. Split the dataset
'''
for test_index in range(len(X)):
# training data indexes
train_X = X[:test_index] + X[test_index + 1:]
train_y = y[:test_index] + y[test_index + 1:]
# test data
test_X, test_y = X[test_index], y[test_index]
yield train_X, train_y, test_X, test_y
X, y = X, y = synthetic_classification(n_samples=160)
scores = cross_validate(X, y, LeaveOneOut())
# print out fake accuracy!
print(f'cv score: {sum(scores) / len(scores):.2f}')
split=> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, end
cv score: 0.90