Activation Functions in Neural Network

A concise analysis of roles and types of activation functions in neural networks. Detailed pros and cons of 5 popularly used ones in an Artificial Neural Network.

Sambit Mahapatra
Published in
6 min readAug 7, 2020

--

Activation function, as the name suggests, decides whether a neuron should be activated or not based on the addition of a bias with the weighted sum of inputs. Hence, it is a very significant component of Deep Learning, as they in a way determine the output of models. The activation function has to be efficient so the model can scale along the increase in the number of neurons.

To be precise, the activation function decides how much information of the input relevant for the next stage.

For example, suppose x1 and x2 are two inputs with w1 and w2 their respective weights to the neuron. The output Y = activation_function(y).

Here, y = x1.w1 + x2.w2 + b i.e. weighted sum of inputs and bias.

Activation functions are mainly of 3 types. We will analyze the curves, pros and cons of each here. The input we work on will be an arithmetic progression in [-10, 10] with a constant difference of 0.1

x = tf.Variable(tf.range(-10, 10, 0.1), dtype=tf.float32)

Binary step

A binary step function is a threshold-based activation function. If the input value is above or below a certain threshold, the neuron is activated and sends exactly the same signal to the next layer.

#Binary Step Activation
def binary_step(x):
return np.array([1 if each > 0 else 0 for each in list(x.numpy())])
do_plot(x.numpy(), binary_step(x), 'Binary Step')

The binary step is not used mostly because of two reasons. Firstly it allows only 2 outputs that don’t work for multi-class problems. Also, it doesn’t have a derivative.

Linear

As the name suggests, the output is a linear function of the input i.e. y = cx

#Linear Activation
def linear_activation(x):
c = 0.1
return c*x.numpy()
do_plot(x.numpy(), linear_activation(x), 'Linear Activation')

The linear activation function is also not used is neural networks because of two main reasons.

Firstly, with this activation at multiple layers, the last output just becomes a linear function of the input. That defeats the purpose of multiple neurons and layers.

Secondly, as y = cx, the derivative becomes dy/dx = c, which is a constant. So it can’t be used in backpropagation to train neural networks.

Non-linear

Non-linear activation functions are used everywhere in neural networks as they capture the complex pattern in data due to the non-linearity nature and support back-propagation as they have a derivative function.

Here we discuss 5 popularly used activation functions.

  1. Sigmoid

Sigmoid function squashes the output between 0 to 1 and the function looks like sigmoid(x) = 1 / (1 + e^(−x))

y = tf.nn.sigmoid(x)
do_plot(x.numpy(), y.numpy(), 'Sigmoid Activation')

The major advantage of this function is that its gradient is smooth and output always lies between 0 to 1.

It has a few cons like, the output always lies between 0 to 1 which is not suitable for multi-class problems. With multiple layers and neurons, the training gets slower as it is computationally expensive.

with tf.GradientTape() as t:
y = tf.nn.sigmoid(x)
do_plot(x.numpy(), t.gradient(y, x).numpy(), 'Grad of Sigmoid')

Also as seen in the gradient graph, it suffers from vanishing gradient problem. The input changes from -10 to -5 and 5 to 10, but the gradient output doesn’t change.

2. Tanh

Tanh activation function is more like sigmoid but it squashes output between -1 to 1. The function looks like tanh(x) =(1-e^(-2x))/(1+e^(2x))

y = tf.nn.tanh(x)
do_plot(x.numpy(), y.numpy(), 'Tanh Activation')

In addition to the advantages of the sigmoid function, its output is 0 centered.

With multiple layers and neurons, the training gets slower as it is also computationally expensive.

with tf.GradientTape() as t:
y = tf.nn.tanh(x)
do_plot(x.numpy(), t.gradient(y, x).numpy(), 'Grad of Tanh')

As seen in the gradient graph, it also suffers from a vanishing gradient problem. Here the input gets out of [-2.5, 2.5], but the gradient output doesn’t change no matter how much the input changes.

3. ReLU

ReLU or Rectified Linear Unit either passes the information further or completely blocks it. The function looks like relu(x) = max(0, x)

y = tf.nn.relu(x)
do_plot(x.numpy(), y.numpy(), 'ReLU Activation')

It is most popular due to its simplicity and non-linearity. Its derivatives are particularly well behaved: either they vanish or they just let the argument through.

with tf.GradientTape() as t:
y = tf.nn.relu(x)
do_plot(x.numpy(), t.gradient(y, x).numpy(), 'Grad of ReLU')

One disadvantage it has is that it doesn’t retain negative inputs thus if inputs are negative, it doesn’t learn anything. That is called the dying ReLU problem.

4. Softmax

Softmax activation function gives output in terms of probability and the number of outputs is equal to the number of inputs. The function looks like,

softmax(xi) = xi / sum(xj)

x1 = tf.Variable(tf.range(-1, 1, .5), dtype=tf.float32)
y = tf.nn.softmax(x1)

The major advantage of this activation function is it gives multiple outputs that make it popularly used in the output layer of a neural network. It makes it easier to classify multiple categories.

The main limitation of this algorithm is that it won’t work if data is not linearly separable. Another limitation is that it does not support null rejection, so you need to train the algorithm with a specific null class if you need one.

5. Swish

Google Brain Team has proposed this activation function, named Swish. The function looks like swish(x) = x.sigmoid(x)

According to their paper, it performs better than ReLU with a similar level of computational efficiency.

y = tf.nn.swish(x)
do_plot(x.numpy(), y.numpy(), 'Swish Activation')

One reason swish might be performing better than ReLU is it addresses the dying ReLU issue as shown in the below graph.

with tf.GradientTape() as t:
y = tf.nn.swish(x)
do_plot(x.numpy(), t.gradient(y, x).numpy(), 'Grad of Swish')

On an additional note, ReLU has got other variances of its type also which are quite popular like leaky ReLU, parametric ReLU, etc.

--

--

Sambit Mahapatra

Putting ML to Customer Support at CSAT.AI | Natural Language Processing | Full Stack Data Scientist (sambit9238@gmail.com)