Introducing Lit-GPT: Hackable implementation of open-source large language models released under Apache 2.0 →

Log in or create a free account to track your progress and access additional course materials  

4.2 Multilayer Neural Networks (Part 1-3)


What we covered in this video lecture

In this lecture, we discussed the limitations of models we covered earlier in this course: the perceptron and logistic regression models. Multilayer networks help us to overcome these. (If you wonder what limitations we are talking about, you get to answer this in the quiz!)

We then discussed the advantages and disadvantages of designing wide versus deep neural networks. Here, width refers to the number of hidden units in the hidden layers. And depth refers to the number of layers.

Lastly, we also discussed different architecture design considerations, for example, using different (or no) nonlinear activation functions and the importance of random weight initialization.

Additional resources if you want to learn more

If you are interested in learning more about the different activation functions, as teasered in 4.2 Part 2, I recommend this A Comprehensive Survey and Performance Analysis of Activation Functions in Deep Learning.

Note that it is possible to override PyTorch’s default weight initialization scheme using the following code:

def weights_init(m):
    if isinstance(m, torch.nn.Linear):


The * above is a placeholder for a weight initialization function in PyTorch. Which weight initialization function should be used depends on the activation function. For example, a common choice for ReLU activations is kaiming initialization:

nn.init.kaiming_normal_(, nonlinearity='relu')
nn.init.constant_(, 0)

You can find out more about Kaiming initialization in the paper

  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Log in or create a free account to access:

  • Quizzes
  • Completion badges
  • Progress tracking
  • Additional downloadable content
  • Additional AI education resources
  • Notifications when new units are released
  • Free cloud computing credits

Quiz: 4.2 Multilayer Neural Networks and Why We Need Them (PART 1)

What is one of the advantages of softmax regression over logistic regression?

Incorrect. Both softmax and logistic regression should always converge (unlike the perceptron).

Correct. Softmax regression works with an arbitrary number of classes.

Incorrect. Unfortunately, softmax regression is restricted to linear decision boundaries. We need to add hidden layers (aka multilayer neural networks) to get nonlinear boundaries.

Please answer all questions to proceed.

Quiz: 4.2 Multilayer Neural Networks and Why We Need Them (PART 2)

What changes do we need to make to the softmax regression model to convert it into a multilayer perceptron?

Correct. Yes, that’s the only change we need to make (assuming the hidden layer comes with a nonlinear activation function)

Incorrect. We may use a ReLU function for the hidden layer, but we don’t need to change the output layer.

Incorrect. We don’t need to change the loss function, and there is no such thing as multilayer cross-entropy.

Please answer all questions to proceed.

Quiz: 4.2 Multilayer Neural Networks and Why We Need Them (PART 3)

Which of the following is crucial for producing nonlinear decision boundaries?

Incorrect. 1 hidden layer is already sufficient for producing nonlinear decision boundaries.

Correct. Without nonlinear activation functions, the model is a generalized linear model that can’t learn nonlinear decision boundaries.

Correct. Without random weight initialization, the hidden layer acts as a layer with only one hidden unit, which doesn’t allow us to learn good decision boundaries.

Please answer all questions to proceed.
Watch Video 1

Unit 4.2

Questions or Feedback?

Join the Discussion