The structure of a deep learning model consists mainly of nodes, and connections between them. Most of the time, every single node is connected to every other node in the next layer, which we call a Dense layer.
Within each node is a mathematical equation, decides, based on the input values and their weights, what values to output to the next layer. These mathematical equations are called Activation Functions.
Different Activation Functions
There are several kinds of Activation Functions, or in other words, different kinds of mathematical operations that a node can take. They are:
- Sigmoid Function
- Tanh Function
- ReLU Function
- Leaky ReLU Function
These activation functions take in the inputs
z from the previous layer, and feed it into their equations to produce an output
Sigmoid vs TanH
The TanH function is almost strictly superior to the Sigmoid function, because the TanH function has it’s mean centered at
0. This feature will result in a higher value of derivative, and a faster learning rate. Also, having a
0 value mean will avoid having bias in the gradients.
ReLU vs (Sigmoid + TanH)
The drawback of both Sigmoid and TanH, given that they have a curved graph, is that if the value of
z is either extremely large or small, the gradient on the curve will be extremely small as well. This small gradient will have an adverse effect on the learning rate when performing Gradient Descent.
The solution to this is ReLU (Rectified Linear Unit), which has a constant gradient regardless of the value of
z. But for ReLU, having a negative value of
z will result in a
0 value activation. The solution for that is a Leaky ReLU, which allows for a small value of
a for negative values of
Must they Always be Non-Linear?
Yes, Activation Function must always be non-linear. Having multiple linear activation functions can be condensed together, effectively negating the need for any hidden layers or hidden nodes.
In this post, we talked very briefly about the different kinds of Activation Functions, and compared their pro and cons.
A recommendation for building a neural network model is to have the hidden nodes all be either TanH or ReLU, and never having Sigmoid.
The only time you can have a Sigmoid is at your output layer, if your problem is a binary classification problem.