Making Custom Data Generators in Keras
Introduction
If you have ever tried to do deep learning on a task that requires data in un-conventional formats that is, Not in the usual (X, y)
format but something else, then you surely would’ve felt the need for custom data generators in keras.
An example of where you would need a custom data generator is when training a Convolutional Auto-encoder: You train it by passing your training images as both X (training data) and as y (ground truth).
In situations like these (where a normal tf.keras.preprocessing.image.ImageDataGenerator()
) just wouldn’t do the job; you have only 1 choice remaining which is to get all the data in a huge numpy array. But this is very memory consuming and it doesn’t even work when your dataset is large.
So here’s a 2nd way of doing it! By making your very own “Customized” Data Generator in Keras!
So how do you do it?
So the Below code is the custom Data Generator we’ll be making.
class CustomDataGenerator(tf.keras.utils.Sequence):
"""
Custom data generator class for the MNIST dataset
"""
def __init__(self, data: pd.DataFrame, batch_size: int=64):
self.labels = data['label'].values
self.images = data.drop(['label'], axis=1).values.reshape(-1, 28, 28)
self.labels = tf.keras.utils.to_categorical(self.labels)
self.batch_size = batch_size
def __len__(self):
return np.math.ceil(len(self.images) / self.batch_size)
def __getitem__(self, index):
"""
Returns a batch of data
"""
batch_images = self.images[index * self.batch_size : (index + 1) * self.batch_size]
batch_labels = self.labels[index * self.batch_size : (index + 1) * self.batch_size]
return batch_images, batch_labels
Explanation
- The
tf.keras.utils.Sequence
is basically a module from Keras that we need to use if we want to make our own data generators. - Every Data Generator class needs to have atleast the following 3 functions:
- The
__init__()
function:- This is the constructor function that will be called when we make the objects of this class.
- You can do data splitting, reshaping and everything in here.
- You should always get the batch size when object is instantiated (it’ll be useful later).
- The
__len__()
function:- This function will return the number of batches in the data set.
- It will be calculated using the formula: batches = trainingSamples/batchSize
- The
__getitem__()
function:- This functions is the main driver in this code.
- It takes
index
as an argument (the index is same as the indexes we use in lists, etc) - This function does the following:
- It calculates the current batch indexes for all the data that you will return.
- For example:
batch_images = self.images[index * self.batch_size : (index + 1) * self.batch_size]
takes the list of all the images and calculates the current batch (which starts from index * batch_size and goes to index+1 * batch_size). - Since you want to return data in batches, you have to do this for all the different data you want to return.
- In this case, I am just returning the batched images and batched labels after converting them into a numpy array
- The
And this is basically it! This is how you create a custom data generator!
Conclusion
Thanks for reading this brief tutorial! Here’s the link to the Kaggle notebook where I’ve implemented Custom Data Generators (and also explained them).
–Tanay Mehta