Lightning AI Studios: Never set up a local environment again →

Log in or create a free Lightning.ai account to track your progress and access additional course materials  

3.3 Model Training with Stochastic Gradient Descent (Part 1-4)

What we covered in this video lecture

This lecture introduced the training algorithm behind logistic regression: stochastic gradient descent. This is the same training algorithm we use for training deep neural networks.

Stochastic gradient descent is based on calculus: we compute the loss function’s derivatives (or gradients) with respect to the model weights. Why? The loss measures “how wrong” the predictions are. And the gradient tells us how we have to change the weights to minimize (improve) the loss.

The loss is correlated to the accuracy, but sadly, we cannot optimize the accuracy directly using stochastic gradient descent. That’s because accuracy is not a smooth function.

Computing the loss gradients is based on the chain rule from calculus, and if you are not familiar with it, it may look daunting at first. But do not worry. We will introduce PyTorch functions that can handle the differentiation (that is, the calculation of the gradients) automatically for us. This is known as automatic differentiation or autograd.

Additional resources if you want to learn more

The following lecture introduces PyTorch functionality that calculates the gradients automatically for us. However, if you are new to calculus or need a refresher and you want to learn more (not required for this course), I have written a concise calculus primer that you might find helpful: Calculus and Differentiation Primer.

Moreover, if you are interested in an alternative introduction to stochastic gradient descent, you may find my article Single-Layer Neural Networks and Gradient Descent helpful.

Log in or create a free Lightning.ai account to access:

  • Quizzes
  • Completion badges
  • Progress tracking
  • Additional downloadable content
  • Additional AI education resources
  • Notifications when new units are released
  • Free cloud computing credits

Quiz: 3.3 Model Training with Stochastic Gradient Descent - PART 1

Using gradient descent to update the weight w1, which of the following values do we need to compute and/or are part of the computation? (Check all that apply)

Correct. To update w1, we compute “∂L / ∂w1”

Incorrect. This value is not needed since we don’t need/want to update x1.

Incorrect. This value is not needed since we don’t need/want to update x1.

Correct. This term is part of the computation via the chain rule. We wrote it as “da / dz”

Please answer all questions to proceed.

Quiz: 3.3 Model Training with Stochastic Gradient Descent - PART 2

We can think of a “gradient” as a fancy term to describe the concept of a derivative in multiple dimensions.

Correct. If we have a function with multiple inputs, we can compute a gradient to capture the slope in multiple dimensions. E.g., if the function takes 2 inputs, we have a 2D slope.

Incorrect. The concept of a gradient is similar to that of a derivative. We use gradients when we work with functions that have multiple inputs.

Please answer all questions to proceed.

Quiz: 3.3 Model Training with Stochastic Gradient Descent - PART 3

Comparing the perceptron learning algorithm with gradient descent, which of the following answers is/are correct?

Correct. Based on the predicted label, the weights are immediately updated.

Incorrect. The gradient descent algorithm computes the loss (and gradient) based on the whole training set.

Incorrect. The weights are updated after each wrong prediction.

Correct. The gradient descent algorithm computes the loss (and gradient) based on the whole training set.

Please answer all questions to proceed.

Quiz: 3.3 Model Training with Stochastic Gradient Descent - PART 4

Stochastic gradient descent is a flavor of gradient descent that introduces a certain level of randomness into the training process. In order to do so, stochastic gradient descent …

Correct. For each weight update, we compute the loss based on a single training example or a minibatch, which introduces a certain level of noise (or randomness) compared to regular gradient descent, which computes the weight update based on the whole training set. In this sense, the gradient for the weight update in stochastic gradient descent is an approximation of the full-gradient from regular gradient descent.

Incorrect. We do not explicitly modify the weight update values.

Incorrect. Training examples are usually selected randomly (e.g., by shuffling the training set or drawing them in random order), but it still uses the full training set, not a subset.

Please answer all questions to proceed.
Watch Video 1

Unit 3.3

Videos