Convolutional Neural Networks (CNN) are neural networks that are mainly used for image recognition and image classification.
In this post, we’ll break down how a CNN works under the hoods.
Backgroud
If we used a traditional neural network without any of the prior convolution steps, the network would not scale well at all.
28 x 28 pixel of MNIST in a fully connected model gives use 784 input weights. Obviously, most pictures are a lot larger than 28 x 28. A 200 x 200 pixel picture would result in 120,000 input weights.
To minimize the number of input parameters, we need produce lower representations of the image that captures the most amount of information.
CNN was inspired the visual cortex, where in the human brain, parts of the visual cortex fired when detecting edges. Furthermore, studies have shown that the visual cortex works in layers; a given layer works on the features detected in the previous layer, from lines, to contours, to shapes, to entire objects.
Vector Representation of Images
As we know, all models only take in numerical inputs to perform their actions. Non-numerical data such as text, and in this case images, must first be converted to numerical vectors.

An image with multiple colors can be converted into a grayscale image, and each pixel is represented by its intensity from a range of 0-255.
This gives us a resulting numerical vector representation of an image
Convolution?
Before we talk about convolutional neural networks, we need to understand what is the meaning of convolution first.
In mathematical terms, a convolution is the combination of two functions to produce a third function, which has the properties of the two combined functions.
The term convolution refers to the resulting third function, as well as the process of computing the combination of two functions.

By sliding function g(t) onto f(t), we produce a third function (f*g)(t). We say that (f*g)(t) is the convolution of f(t) and g(t)
Now we have a rough idea of what convolution is, we can go back to see how convolution works in the context of image processing.
Convolution of Images
To begin, we have converted the image to a n x n matrix of numbers from 0-255 which indicates the intensity.
Next, we take a smaller matrix of size m x m, where m < n, and slide it over the original matrix. This smaller matrix is called a filter.

Image that has been converted to a matrix of numbers. For simplicity, we’ll just use 0 and 1.

A smaller matrix, called a filter, that we’ll use to slide over the original matrix

As we slide the filter over the matrix, we do a matrix multiplication, and take the result of the multiplication for our convolution matrix.
The convolution here can be seen as combining the original matrix and the filter to produce a third matrix, which is our convolved feature matrix.
The intuition behind this is that we are using the filter to extract features from the image. Different filter values will extract out different features from the image.
We can also use multiple filters to produce multiple Convoluted feature maps, which is called “Depth”

When building a CNN, the model learns the values of the filters on its own, while we have to specify other parameters like number of filters, filter size, stride and zero-padding.
For a given set of values, convolution (which is a set of filters) generates a new set of values. The depth of the new set of output corresponds to the number of filters, as each filter generates its own set of values.
Removing Negative values from Convolved Features
After we produce a Convolved feature map from the original image, we perform another operation called ReLU (Rectified Linear Unit) on each element.
What ReLU does is that it replaces all negative values to 0.
Why we need to apply ReLU on a convolved feature map is because the Convolution step is a linear operation. To account for non-linearity, we need to introduce a nonlinear function such as ReLU.
The resulting feature map after applying ReLU is called a Rectified feature map.

This process changes all negative values to a 0 value
Dimensionality Reduction through Pooling
After we have extracted the Convoluted feature map, and passed it through our ReLU function to produce a Rectified feature map, we can reduce the feature map through a process called pooling.
There are 3 types of pooling: Max pooling, Sum pooling and Average pooling. We’ll talk about Max pooling, because it works better in practice, and once you understand Max pooling, Sum pooling and Average pooling works the same way.
In doing Max pooling, we define yet another window size k x k, but in this case, we do not slide the window across the Rectified feature map. Instead, we divide the feature map up into the window size, and take the max value from it.

After we pass the Convolved feature map through ReLU, we get a Rectified feature map.
We take the maximum value of the window size to get the reduced matrix.
The Fully Connected Layer
After we have broken down the image through iterative process of Convolution, ReLU and pooling, we get a set of matrices to represent the important features of the original image.
We then line up each of the values of the pooled matrix into a single vector, and feed it into a fully connected neural network.
When the neural network does it’s learning via gradient descent or some other optimization algorithm, only the weights in the neural network and the values in the filter layer changes. The size of the filter and step size do not change.
Features at each Layer
We now have the 3 basic steps of a CNN: Convolution, ReLU and Pooling.
We can repeat this step numerous times to reduce the image, and extract out important features.

The more layers we have, the more complicated features we can extract out from the image. At each layer, we reconstruct simple layers to form more complex layers.

In the first layer, we pick out simple features like edges and lines.
In the second layer, we’re able to form parts of the face such as eyes and ears.
In the last layer, we can form the full face from all the layers
In another example, we can visually see how the CNN breaks down an image using Convolution + ReLU and pooling to extract important features, and make a classification at the end.

The intuition here is that we are making predictions here based on several features maps. If we have feature maps telling us there is two eyes, a nose and a mouth, we can make a prediction that it is a face.
Conclusion
We’ve seen in this post how to do the following steps in a CNN
- Transform an image to a numerical vector
- Apply a filter to extract a Convoluted feature map
- Apply ReLU to transform negative values to 0
- Apply Pooling to get your Rectified feature map
- Repeat until extract important features
- Pass them into a fully connected layer to perform prediction
- Learning only changes the weights of the connected layer and the filter matrix values
Leave a Reply