If you accidentally touch a hot object, you automatically pull your hand away without even thinking, which is a reflex action. But what causes this reflex action? This reflex action is caused by the sensory neurons present in your hand. So, these neurons get activated after sensing the hot object and send the message to the brain to pull the hand away from the hot object. But, only the neurons in the area which touched the hot object get activated, other neurons remain deactivated.
Similarly, Activation Functions are the decision-making unit of a neural network in deep learning. When the input features are fed into the input layer of the neural network, the input features get multiplied with the weights and a bias is added to it. Now the value of this processed input can be anything from -∞ to +∞. So, an activation function is applied to this processed input. It returns bounded values of the net input. Finally, the output from the activation function moves to the next hidden layer and the same process is repeated.
Why Activation Functions ?
It is important to use activation functions otherwise, the neural network will be nothing but a linear regression model. It does the non-linear transformation or adds non-linearity to the input before sending it to the next layer of neurons so that the model could learn and perform more complex tasks.
Mathematically, we can write it as:
y = (w1x1 + w2x2 + w3x3) + bias
z = Activation(y)
Z = z * w4
output = Act(z)
In general,
y = weights * input features + bias
z = Activation(y)
The Activation Functions can be divided into 2 types:
Linear Activation Function
Non – Linear Activation Functions
Linear or identity Activation Function
This function is a straight line where the activation function is directly proportional to the input. Therefore, the output of the function will range from -∞ to +∞. It is useful when the data can be separated just by a line. But, it doesn’t help with complex and variable input to the neural network.
Mathematical definition:
f(x) = x
f’ (x) = 1
Non – Linear Activation Functions
In neural networks, Non–linear Activation Functions are mostly used because they can easily separate variable data.
The Non- linear Activation Functions are mainly classified based on the range of their curves. Several different types of activation functions are used in Deep Learning. Some of them are discussed below:
Sigmoid or Logistic Activation function
This function transforms all the input values between 0 and 1. The threshold value for this function is 0.5. If the input value is greater than 0.5, it will activate the neuron that means the output value will be 1 and if the input value is less than 0.5 then it will not activate the neuron so, the output value will be 0. But, a problem arises during backpropagation when we use the sigmoid function.
In backpropagation, weights get updated and the formula for updating weights is given by: Wnew = Wold –η∂L/ ∂ Wold , where Wnew is the updated weight , Wold is the old weight, η is the learning rate and ∂L/ ∂ Wold is the derivative of the slope with respect to the weight. Since it is mathematically proved that the derivative of the sigmoid function is in the range of 0 to 0.25. So, as the number of layers increases, the value of the derivative decreases. Thus, the updated weight becomes approximately equal to the old weight. This problem is called Vanishing Gradient Problem.
Mathematical definition:
f’ (x) = f(x)(1- f(x))
Thresohld (tanh) Activation function
Tanh activation function is similar to the sigmoid activation function. It transforms all the values between -1 and 1. It is also prone to the vanishing gradient problem. Unlike sigmoid, it is a zero centered function. It takes more time in computation than sigmoid.
Mathematical definition:
ReLU (Rectified Linear Unit) Activation function
It is the most commonly used activation function. This function finds the maximum of 0 and the input value. If the input value is positive it will return the same input value and if the input value is negative it gives the value as 0 because the maximum of 0 and a negative number will always be 0. Using ReLU can solve the problem of vanishing gradient.
But, there is also a problem with the ReLU activation function. During backpropagation, while applying the chain rule the values of the negative values will be 0 and so will be the derivative. This will create a dead neuron. So, Wnew = Wold. To fix this problem, another activation function Leaky ReLU is used.
Mathematical definition:
f(x) = max(0, x)
Leaky Relu
Leaky ReLU is used to fix the problem of the dead neuron. It adds some small values to the negative values so that they do not become 0. But how do we know when to use Leaky ReLU?
If in the training cycle out of 100 neurons 50 neurons are not activated then we should use Leaky ReLU.
Mathematical definition:
f(x) = max(0.01x, x)
ELU(Exponential Linear Unit) Activation function
This function is more efficient in dealing with negative values. It is a zero-centered function. It solves the problem of dying ReLU but it is computationally expensive.
Mathematical definition:
f(x) = x ,x ≥ 0
α(ex – 1) ,x < 0
f’ (x) = 1 ,x ≥ 0
f(x) + α ,x < 0
PReLU(Parametric ReLU) Activation function
Like ELU, PReLU is also a variant of ReLU . If α = 0.01, it will become Leaky ReLU and if α = 0, it will become ReLU.
Mathematical definition:
f(x) = x , x ≥ 0
αx , x < 0
f’(x) = 1 , x ≥ 0
α , x < 0
Swish Activation function
Swish is as computationally efficient as ReLU but shows better performance than ReLU on deeper models. This function is used only when we have a neural network of more than 40 layers. It is mostly used in LSTM.
Mathematical definition:
Softmax Activation function
It is a combination of multiple sigmoids. It is mainly used for multiclass classification problems. It is often used as the last activation function to normalize the output of the neural network. It calculates the probability for each class to determine the class to which the input data point belongs. The sum of all these probabilities should always be 1.
Mathematical definition:
Softplus Activation function
It is a smooth version of ReLU. It is similar to ReLU except near 0, where softplus is smooth and differentiable. It takes more time in computation than ReLU as it has log and exp. And interestingly, the derivative of the softplus function is the sigmoid function.
Mathematical definition:
Now that we have understood activation functions theoretically. But, that is not enough to build a deep learning model. Understanding the implementation is also necessary.
Activation functions can either be used through an Activation layer or through the activation argument.
So, here I have considered an example of the Keras sequential model to show the implementation of activation functions through activation layers.
from keras.models import Sequential
from keras.layers import Dense
model = Sequential(
[
Dense(6, activation="relu"),
Dense(15, activation="sigmoid"),
Dense(20),
]
)
x = tf.ones((6, 15))#This operation returns a tensor of type `dtype` with shape `shape` and all elements set to one.
y = model(x) #calling the model
print("No.of weights=", len(model.weights))#after calling our model
# getting the info of our model
model.summary()
Output: No. of weights= 6
Here, I have considered only two activation functions just to show the implementation. But, we can add other activations also according to the requirement of our model.
Here, I have shown the implementation of some Activation functions through activation argument:
tf.keras.activations.relu(x, alpha=0.0, max_value=None, threshold=0)
tf.keras.activations.sigmoid(x)
tf.keras.activations.softmax(x, axis=-1)
tf.keras.activations.softplus(x)
tf.keras.activations.tanh(x)
tf.keras.activations.elu(x, alpha=1.0)
Hope you have enjoyed the blog. Feel free to provide your feedback and ask your queries in the comment box.
Hello, I would like to know why it's interesting that the derivative of the softplus function is the sigmoide function. Thanks for your response.