Case study: Cross-validation#

In supervised learning studies we distinguish between the training and test error rates. The test error rate refers to the average performance of a model when predicting a new observation (data that was not used in the training of the model). A useful model is one that can accurately predict new observations!

An issue in real ML studies is that data are expensive to collect and often a health data scientist doesn’t have as much validation data as they would ideally like for estimating the test error rate. Cross-validation is a general term to describe a set of statistical methods for efficiently splitting a dataset in order to estimating the test error of a model.

Cross validation methods are a nice example of where python classes can be used to design a framework for your ML. In the case study we will explore creating two classes for cross validation in machine learning: Leave-One-Out Cross-Validation and K-Fold Cross Validation.

If you are interested to read further on cross validation I recommend the excellent: An introduction to statistical learning (with applications in R) by James, Witten, Hastie and Tibshirani.

In a real machine learning study in python there is no doubt that you would implement these classes using a numpy array. In practice I would therefore implement these classes slightly differently. We will cover numpy in the next chapter. Another bullet proof option, that requires no implementation, is to use the sklearn.model_selection namespace. I’ve based the naming on sklearn’s interface here. After you have completed the case study do check out the sklearn source code that’s available via its documentation.

We’ll begin by creating each class independently. Then we’ll work on our OOP design credentials by extracting what can be encapsulated into a common baseclass. Finally we will take a short look at what is meant by an Abstract class and how this is implemented in Python.

import random
import numpy as np

Synthetic data#

The function below generates synthetic data that can be used to test the classes.

def synthetic_classification(n_samples=10, n_features=1, shuffle=False, 
                             random_seed=None):
    '''
    Generates a simple random synthetic dataset in a given shape.
    Used for testing of generator classes.
    
    X: each feature is a sequence i to i + n_samples where i is the feature no.
    y data is 0 or 1 weighted very roughly 50/50.
    
    These sequences are randomised if shuffle is set to True.
    
    No error checking. Assumes all inputs are valid.
    
    Params:
    ------
    n_samples: int, optional (default=10)
        The number of samples
        
    n_features: int, optional (default=1)
        The number of features in the classification problem
        
    shuffle: bool, optional (default=False)
        If true then sequences are randomly shuffled
        
    random_seed: int or None, optional (default=None)
        If shuffle then controls the ordering of the sequences generated.
    
    Returns:
    --------
    X, y
    Where X and y are python lists and X will be a list of lists sized 
    (n_samples, n_features) 
        
    '''
    X = [[(col * (n_samples)) + row for col in range(n_features)] 
                                          for row in range(n_samples)]
    y = ([1] * (n_samples // 2)) + ([0] * ((n_samples // 2) + (n_samples % 2)))
    
    if shuffle: 
        for lst in [X, y]:
            random.seed(random_seed)
            random.shuffle(lst)
    return X, y
X, y = synthetic_classification(n_samples=3)
print(X)
print(y)
[[0], [1], [2]]
[1, 0, 0]
X, y = synthetic_classification(n_samples=6, n_features=2)
print(X)
print(y)
[[0, 6], [1, 7], [2, 8], [3, 9], [4, 10], [5, 11]]
[1, 1, 1, 0, 0, 0]
X, y = synthetic_classification(n_samples=6, n_features=2, shuffle=True, 
                                random_seed=42)
print(X)
print(y)
[[3, 9], [1, 7], [2, 8], [4, 10], [0, 6], [5, 11]]
[0, 1, 1, 0, 1, 0]

Leave one out cross validation#

As it’s name suggests Leave on out cross validation (LOOCV) incrementally leaves one data point out of the training of a model. It then predicts the single holdout sample. This is repeated for every sample in the dataset.

I’ve rarely seen this approach used in practice and its not a method I would consider using myself. The reason is that it requires a lot of training cycles. Combine LOOCV with parameter optimisation, a moderate sized dataset and a computationally intensive training routine and you’ll need to leave your code running overnight! But its a nice simple example that we can use in our learning.

Rather than perform both the generation of the splits and cross validation in the same class we will seperate their responsibilities into seperate classes. We will have one class called LeaveOneOut and a function called cross_validate. The function will accept an instance of LeaveOneOut and use it to generate the data needed for cross validation.

class LeaveOneOut:
    '''
    Leave one out dataset generator for cross validation.
    '''
    def __init__(self):
        pass
    
    def __repr__(self):
        return 'LeaveOneOut()'

    def get_n_splits(self, X):
        '''
        The number of splits returned by the cross validation
        method.
        '''
        return len(X)
    
    def split(self, X, y):
        '''
        Generator method.  Split the dataset
        '''
        for test_index in range(len(X)):
        
            # training data indexes
            train_X = X[:test_index] + X[test_index + 1:]
            train_y = y[:test_index] + y[test_index + 1:]
            
            # test data
            test_X, test_y = X[test_index], y[test_index]
            
            yield train_X, train_y, test_X, test_y

I will use the synthetic_classification to generate a test dataset for LeaveOneOut. I will keep the inputs as a sequence so you can easily see it working.

Remember that LeaveOneOut.split() uses the yield keyword. This means it is a generator method and we need to call it repeatedly in a loop in order to generate new splits for validation.

# generate test dataset
X, y = synthetic_classification(n_samples=6, n_features=2, shuffle=False)

# create an instance of LeaveOneOut
cv = LeaveOneOut()

# basic cross validation loop.
# I've zipped together a range and the splits into order to get fold no.
for i, split_data in zip(range(cv.get_n_splits(X)), cv.split(X, y)):
    train_X, train_y, test_X, test_y = split_data
    print(f'Fold {i+1}:\nTrain:\tX:{train_X}, y:{train_y}')
    print(f'Test:\tX:{test_X}, y:{test_y}')
Fold 1:
Train:	X:[[1, 7], [2, 8], [3, 9], [4, 10], [5, 11]], y:[1, 1, 0, 0, 0]
Test:	X:[0, 6], y:1
Fold 2:
Train:	X:[[0, 6], [2, 8], [3, 9], [4, 10], [5, 11]], y:[1, 1, 0, 0, 0]
Test:	X:[1, 7], y:1
Fold 3:
Train:	X:[[0, 6], [1, 7], [3, 9], [4, 10], [5, 11]], y:[1, 1, 0, 0, 0]
Test:	X:[2, 8], y:1
Fold 4:
Train:	X:[[0, 6], [1, 7], [2, 8], [4, 10], [5, 11]], y:[1, 1, 1, 0, 0]
Test:	X:[3, 9], y:0
Fold 5:
Train:	X:[[0, 6], [1, 7], [2, 8], [3, 9], [5, 11]], y:[1, 1, 1, 0, 0]
Test:	X:[4, 10], y:0
Fold 6:
Train:	X:[[0, 6], [1, 7], [2, 8], [3, 9], [4, 10]], y:[1, 1, 1, 0, 0]
Test:	X:[5, 11], y:0

Now let’s create a function to perform the cross validation.

This could easily be another class. But for simplicity I’ve used a function.

def cross_validate(X, y, cv):
    scores = []
    print('split=> ', end='')
    for i, split_data in zip(range(cv.get_n_splits(X)), cv.split(X, y)):
        print(f'{i+1},', end=' ')
        train_X, train_y, test_X, test_y = split_data
        # model fitting, prediction and accuracy assesment is
        # just simulated here.
        scores.append(random.uniform(0.8, 1))
    print('end')
    return scores
X, y = X, y = synthetic_classification(n_samples=160)
scores = cross_validate(X, y, LeaveOneOut())

# print out fake accuracy!
print(f'cv score: {sum(scores) / len(scores):.2f}')
split=> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, end
cv score: 0.90

What about if we wanted to use a different cross validation method?#

The second cross validation method on the menu is k-fold cross validation where \(k\) is typically set to 5 or 10. K-fold cross validation is therefore less computationally demanding than LOOCV. Here we will shuffle the dataset and then split it into k segments. In each loop of our cross validation a model is trained on \(k-1\) segments while the final one is held out for testing.

We will implement KFold so that it has the same interface as LeaveOneOut. This means it has the a set of method signatures and properties in common with LeaveOneOut. It will have the following method signatures in common:

  • get_n_splits(self, X) - this is a generator

  • split(self, X, y) - returns an int

Implementing a KFold class#

The implementation of KFold is slightly more involved than LOOCV, but it is perfectly possible in standard python using either for loops or list comprehensions. I’ve opted for the latter as it allows a proportion of the logic to be expressed in a single (I would argue readable) line of python.

First take a look at the class and then I’ll spend a bit of time explaining how it works.

class KFold:
    '''
    A generator class for k-fold cross validation.
    '''
    def __init__(self, k=5, shuffle=False, random_seed=None):
        self.k = k
        self.random_seed = random_seed
        self.shuffle = shuffle
    
    def __repr__(self):
        return f'KFold({self.k=}, {self.shuffle=}, {self.random_seed=})'
    
    
    def get_n_splits(self, X):
        '''
        Return an integer representing the number of splits that 
        will be generated.
    
        '''
        return self.k
    
    def split(self, X, y):
        '''
        Generator method.  Returns incremental splits of the dataset
        on each call.
        
        Params:
        ------
        X: list
            python list containing X data. For multiple features
            shape should be (n_samples, n_features)
        
        y: list
            python list containing y target data. For multiple 
            targets shape should be (n_samples, n_targets)
        
        Returns:
        --------
        train_X, test_X, train_y, test_y 
        
        Where each is a python list
        '''
        
        # store the indexes of each element 
        idx = [i for i in range(len(X))]
        if self.shuffle:
            random.seed(self.random_seed)
            random.shuffle(indicies)
        
        # length of k - 1 splits... final split continues to end.
        split_len = int(len(X) / (self.k))

        for test_idx in range(0, len(X), split_len):
        
            # create k - 1 training folds for X 
            train_X = self._fold_training_data(X, idx, test_idx, split_len)
            # X test data for fold
            test_X = [X[idx[i]] for i in range(test_idx, test_idx + split_len)]
            
            # create k - 1 training segments for y
            train_y = self._fold_training_data(y, idx, test_idx, split_len)
            # y test data fold
            test_y = [y[idx[i]] for i in range(test_idx, test_idx + split_len)]
            
            yield train_X, test_X, train_y, test_y
            
        
    def _fold_training_data(self, data, idx, test_idx, split_len):
        '''
        create training segments for X or y
        '''
        train_seg1 = [data[idx[i]] for i in range(test_idx)]
        train_seg2 = [data[idx[i]] for i in range((test_idx + split_len), 
                                                 len(data))]                                
        return train_seg1 + train_seg2

KFold.split()#

In the split() method you should note that I do not directly randomise the X and y list parameters (containing the training data). I do not directly modify these lists because it will also modify the variables pointing to the same data outside of the class (its the same data because of the way python passes a list to a function).

Here’s a simple function modify_list that illustrates what happens when you modify a list passed as a parameter.

def modify_list(to_modify):
    to_modify.append('shrubbery')
    to_modify.append('another shrubbery')
    print(f'inside function {to_modify=}')
    
lst = ['knights', 'who', 'say', 'ni']
print(f'Outside function BEFORE call {lst=}')
modify_list(lst)
print(f'Outside function AFTER call {lst=}')
Outside function BEFORE call lst=['knights', 'who', 'say', 'ni']
inside function to_modify=['knights', 'who', 'say', 'ni', 'shrubbery', 'another shrubbery']
Outside function AFTER call lst=['knights', 'who', 'say', 'ni', 'shrubbery', 'another shrubbery']

In my implementation you will see that I first create idx which is a simple integer list 0 to len(X). I then shuffle the indexes and use them to create a new list via a comprehension. For example to create the test data

test_X = [X[idx[i]] for i in range(test_idx, test_idx + split_len)]

A key bit of code in the comprehension is X[idx[i]]. Let’s say that we have

idx = [5, 3, 0, 2, 1, 4]
X = [10, 100, 1_000, 10_000, 100_000, 1_000_000]

If i = 2 then idx[2] would evaluate to 0. X[idx[2] evaluates to 10. You can think of this technique as an indirect lookup of your data.

In the above example the reordered data is

reordered_X = [1_000_000, 10_000, 10, 1_000, 100, 100_000]

I chose to use a list of indexes, but you could also import copy and use internal_X = copy.copy(X) to achieve the same results - that might even be clearer to read and understand. You may also not care about modifying the data your assigned to your external variable. Personally I think that modifying in place is bad practice and will may lead to unintended silent side-effects and bugs in your code at a later date.

Using KFold#

As KFold implements the same interface as LeaveOneOut we can reuse cross_validate.

X, y = X, y = synthetic_classification(n_samples=160)
scores = cross_validate(X, y, KFold(k=5))

# print out fake accuracy!
print(f'cv score: {sum(scores) / len(scores):.2f}')
split=> 1, 2, 3, 4, 5, end
cv score: 0.85

Abstract classes#

The two classes we have put together is the start of a small framework for cross validation. Its important to recognise that the moment you develop a framework of classes you hit problems - understanding, maintenance, and extension! Here’s a few scenarios where you might hit problems:

  • When you have a large number of classes to generate splits for cross validation it becomes difficult to keep these all in your head! A far simpler concept is the idea of a SplitGenerator.

  • Other data scientists, in your team or elsewhere, might want to add their own custom classes to use in place of yours. Its likely that you want to control this to a certain extent to avoid crashes and improve the experience of your fellow data scientists.

In python, to an extent, you can provide this control via a concept called Abstract base classes. These are strange beasts! Here’s one for generating our CV splits.

from abc import ABC, abstractmethod

class AbstractSplitGenerator(ABC):
    '''
    Abstract base class for generating splits of X, y data
    for machine learning cross validation. 
    '''
    @abstractmethod
    def get_n_splits(self, X):
        pass
    
    @abstractmethod
    def split(X, y):
        pass

The first line of code imports ABC and abstractmethod from the built in module abc. This gives us the ingredients we need for creating the base class. The first step is to tell python that this is a abstract class by subclassing ABC

class AbstractSplitGenerator(ABC):
    pass

It is not possible to create an instance of AbstractSplitGenerator. Try it! Python won’t let you.

cv = AbstractSplitGenerator()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 cv = AbstractSplitGenerator()

TypeError: Can't instantiate abstract class AbstractSplitGenerator with abstract methods get_n_splits, split

In the class body we have defined two abstract methods using the @abstractmethod decorator. This decorator forces any subclass of AbstractSplitGenerator to implement these methods. Otherwise we get a run time error.

class DodgyImplementation(AbstractSplitGenerator):
    pass

cv = DodgyImplementation()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_433104/1287983807.py in <module>
      2     pass
      3 
----> 4 cv = DodgyImplementation()

TypeError: Can't instantiate abstract class DodgyImplementation with abstract methods get_n_splits, split

The style of Abstract classes I have shown you here is akin to interface based inheritance from other languages. Its a neat way to use inheritance without introducing logic bugs from higher up the class tree (as there is no logic code). You can think of it as a weak “contract”. If you want to write a class in this framework then you guarantee that you provide the methods that other code expects.

I specifically called this a weak contract above. This is because in python almost anything goes! If you want to enforce your contract you need to modify cross_validation to check the type of cv. For example:

def cross_validate(X, y, cv):
    
    # enforce interface
    if not isinstance(cv, AbstractSplitGenerator):
        raise TypeError(f'Expected cv to be AbstractSplitGenerator, '
                        + f'but found {type(cv)}')
    
    scores = []
    print('split=> ', end='')
    for i, split_data in zip(range(cv.get_n_splits(X)), cv.split(X, y)):
        print(f'{i+1},', end=' ')
        train_X, train_y, test_X, test_y = split_data
        # model fitting, prediction and accuracy assesment is
        # just simulated here.
        scores.append(random.uniform(0.8, 1))
    print('end')
    return scores
X, y = X, y = synthetic_classification(n_samples=160)
scores = cross_validate(X, y, LeaveOneOut())

# print out fake accuracy!
print(f'cv score: {sum(scores) / len(scores):.2f}')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_433104/696997810.py in <module>
      1 X, y = X, y = synthetic_classification(n_samples=160)
----> 2 scores = cross_validate(X, y, LeaveOneOut())
      3 
      4 # print out fake accuracy!
      5 print(f'cv score: {sum(scores) / len(scores):.2f}')

/tmp/ipykernel_433104/2306781539.py in cross_validate(X, y, cv)
      3     # enforce interface
      4     if not isinstance(cv, AbstractSplitGenerator):
----> 5         raise TypeError(f'Expected cv to be AbstractSplitGenerator, '
      6                         + f'but found {type(cv)}')
      7 

TypeError: Expected cv to be AbstractSplitGenerator, but found <class '__main__.LeaveOneOut'>

You then simply create concrete classes by subclassing AbstractSplitGenerator. For example:

class LeaveOneOut(AbstractSplitGenerator):
    '''
    Leave one out dataset generator for cross validation.
    '''
    def __init__(self):
        pass
    
    def __repr__(self):
        return 'LeaveOneOut()'

    def get_n_splits(self, X):
        '''
        The number of splits returned by the cross validation
        method.
        '''
        return len(X)
    
    def split(self, X, y):
        '''
        Generator method.  Split the dataset
        '''
        for test_index in range(len(X)):
        
            # training data indexes
            train_X = X[:test_index] + X[test_index + 1:]
            train_y = y[:test_index] + y[test_index + 1:]
            
            # test data
            test_X, test_y = X[test_index], y[test_index]
            
            yield train_X, train_y, test_X, test_y
X, y = X, y = synthetic_classification(n_samples=160)
scores = cross_validate(X, y, LeaveOneOut())

# print out fake accuracy!
print(f'cv score: {sum(scores) / len(scores):.2f}')
split=> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, end
cv score: 0.90