Preprocessing
Preprocessing is getting data ready for analysis. We’ve gotten ahold of our data, looked at it to confirm its approximate properties and condition, and thought about what our output should look like. Sometimes, there is extensive preprocessing that has to be done; other times this step could almost be skipped over. Here, our goal is to get data ready so that when we tell our model to train, it has the data cleaned, in the right format, right shape, etc so it can train correctly.
This guide is set up to be step by step, but as always, you might find you don’t need to do some steps.
Data Cleaning
Often there is some cleaning that has to be done. Removing values which don’t make sense in the context of the data, removing dimensions that we think are irrelevant, or focusing on specific aspects of our dataset and removing the pieces we don’t want.
Data Formats
Did we discover that our data format might cause some problems for our output? Now is the time to fix any data format issues.
Normalization
Normalization is the process of setting all your numerical values on a similar scale, often 0 to 1 (or less commonly, -1 to 1). This improves the performance and stability of training a model. As an example, with image data, often you will see all the numbers in the image array divided by 255.0. This effectively puts all the numerical values in the image array between 0 and 1, and hence is normalization.
This Google article has a list of normalization techniques. Most often, scaling to a range is used.
Train, Valid and Test Splits
Once the data is cleaned and in the right format, we can split it up into training, validation, and testing splits.
Data Shapes
Here, we make sure the data is in the right shape for model building. The shape of your data here will depend on what kind of model building you are doing.
Flattening
Commonly, reshaping means going from a higher dimension to a lower dimension, not just expanding or contracting the same shape. For instance, this can mean going from 2 dimensions to 1. You can see a demonstration of this in the neural network introduction notebook; the shape required for the neural network training at the start is (784, ), which is a 1 dimensional array of 784 numbers. We reshaped the data from 28*28 pixel images instead to a 1 dimensional sequence of 28*28=784 pixels.
Data Labeling
If you have data that needs labels applied, this is where you’d do it. Sometimes you’ll see this called "encoding", "categorical encoding", or "categorical labeling". There is an example of this in the neural network introduction notebook.