Chapter 3: Data Preprocessing

Chapter 3

Data Preprocessing

Carefully preparing your data for a deep neural network can make a significant difference in your model’s accuracy. This chapter looks at why it’s so important to deep learning in particular, and considerations for different networks and data types.


Why Preprocessing Is Necessary

Data preprocessing is a pretty broad term. It’s basically anything you do to the raw data prior to inputting it into your specific machine learning operations, and it’s important for at least two reasons:

  • It can help reduce the dimensionality of your data and make patterns more obvious.
  • It can transform the data into a form that is suitable for the network architecture.

Reducing the Dimensionality of Your Data

Deep learning trains a network to recognize patterns in data. Therefore, any information that is not needed to recognize the patterns you’re looking for can be removed without impacting the overall classification. 


Not only is that extraneous data not needed, but removing it helps make the remaining pattern more obvious. In general, if the patterns are more obvious to a human, they are going to be more obvious to the deep learning algorithms as well, which will help the training process.

The other reason to reduce the dimensionality of your data is the so-called “curse of dimensionality.” Data with higher dimensions has more features and variations of each feature, and therefore, more training data is needed to cover all possible combinations in the solution space. So, not only is the data itself larger with higher dimensions, but you need more of it to train the network and overall it takes more network complexity, more data storage, and more time to train.

A downside to dimensionality reduction is that to do it successfully you have to understand your data. You need to know that you can effectively reduce the dimensions, but not accidentally remove critical information from your data set. This is part of the reason why specific domain knowledge is still extremely important for deep learning applications.

For example, if you are training a network that can visually identify manufacturing defects in hex nuts, you would need to understand what flaws you were looking for and how they manifest themselves in the data. In the data images below, it wouldn’t be a good idea to reduce the size of these images by just scaling them down. The flaws or the patterns you are looking for are quite small, and you would lose the detail that distinguishes them. In this case, a better dimensional reduction approach might be to crop the images instead.


Prepping the Data for the Network Architecture

Raw data often needs to be modified so that it is suitable for the network architecture. This means ensuring that the data is what the network expects in terms of the size, the units, and the type of signals. The following few examples are meant to give you a sense of different networks and data types and highlight the different kinds of data preprocessing that they might need.

Two common examples of network architectures are convolutional neural networks (CNNs) and long short-term memory (LSTM) networks.


The core building block of a CNN is the convolutional layer. It works by sliding a filter, or a kernel, across the input volume and looking for which regions are activated. From these activated regions, the CNN learns which features are present and in which areas of the data. By combining these features, the network can determine the most probable classification.


An LSTM network is a type of recurrent neural network (RNN) that can learn long-term dependencies between time steps of sequence data. The core components of an LSTM network are a sequence input layer and an LSTM layer. A sequence input layer feeds a sequence like text or time series data into the network. An LSTM layer learns long-term dependencies between time steps of sequence data.

For both of these networks (and for most of others as well), the input is fixed in terms of the number of elements you feed into it and what the data represents. This means that if the data that you collect isn’t a consistent size or data type, it needs to be preprocessed into a form that the network is expecting.

Data types

How you preprocess your data depends on the type of data that you are working with. Here are some examples.

Tabular data: While tabular data is not too common for deep learning applications, it still has some use. You may need to convert a list of tabular data to a sparse matrix with one hot encoding or to a denser matrix with entity embedding.

Images and video: Each image that you feed into the network needs to be the same size in terms of width, height, and color layers. This might mean that part of preprocessing is to crop, pad out, or resize images that don’t have the correct dimensions.


Signals: The length and sample rate of signals need to be consistent, or cropping, padding, and resampling is again required. 


Example: Preprocessing Audio Signals Using Short-Time Fourier Transform

This is an audio waveform of a person saying the word “allow.” It’s recorded at 44.1 kHz and is about 0.8 seconds long. The classification network is expecting audio signals that are 1 second long. so the first preprocessing step in this case is to pad the beginning and end of the signal with zeros.


Most of the important audio content in the human voice is generated at frequencies lower than 8 kHz; therefore, this audio signal can be resampled at 16 kHz without loss of information.

A short-time Fourier transform can be used to visualize how frequency content in the audio signal changes over time. This is done by selecting a window size that is smaller than the full signal, and then running a fast Fourier transform (FFT) on that subset of data as you jump that window across the entire signal.


A windowing function is multiplied with the windowed data to ensure that it starts and ends at zero. This will remove the artificial high frequency information that would have been introduced when taking an FFT of a signal that is discontinuous when it repeats.


Each FFT produces thousands of values across the entire spectrum. This level of granularity is not needed to recognize individual words in audio data. One common way to reduce the amount of frequency information is to divide the spectrum into a number of bins and then scale and sum the frequencies in each bin with a Mel filter bank. A Mel filter bank is a set of triangular bandpass filters that are spaced closer together at the lower frequencies and gradually get wider and further apart as frequency increases. This type of filter bank mimics the sensitivity of human hearing, in which the ear is more sensitive to lower frequencies than higher frequencies.

When the Mel filter bank is applied to each of the window spectra, the result is a single value per triangular bin that represents the amount of frequency content in that bin. In the below image, the bin frequency content is plotted with a square marker that is colored based on the value.


All of this information by frequency bins and time windows is combined into a single image: the spectrogram. The spectrogram is one way to preprocess time-series data into an image that can be used as the input into a convolutional neural network.


Learn More About Preprocessing and Keyword Detection