Deep Learning Fundamentals

Pages

Deep Learning Fundamentals

Deep Learning Fundamentals > Unit 3 > Unit 3.3

Course Progress:

3.3 Model Training with Stochastic Gradient Descent (Part 1-4)

Slides

Part 1: Minimizing the Loss
Part 2: Derivatives vs Gradients
Part 3: Minimizing the Loss Using Gradient Descent
Part 4: Gradient Descent Vs Stochastic Gradient Descent

What we covered in this video lecture

This lecture introduced the training algorithm behind logistic regression: stochastic gradient descent. This is the same training algorithm we use for training deep neural networks.

Stochastic gradient descent is based on calculus: we compute the loss function’s derivatives (or gradients) with respect to the model weights. Why? The loss measures “how wrong” the predictions are. And the gradient tells us how we have to change the weights to minimize (improve) the loss.

The loss is correlated to the accuracy, but sadly, we cannot optimize the accuracy directly using stochastic gradient descent. That’s because accuracy is not a smooth function.

Computing the loss gradients is based on the chain rule from calculus, and if you are not familiar with it, it may look daunting at first. But do not worry. We will introduce PyTorch functions that can handle the differentiation (that is, the calculation of the gradients) automatically for us. This is known as automatic differentiation or autograd.

Additional resources if you want to learn more

The following lecture introduces PyTorch functionality that calculates the gradients automatically for us. However, if you are new to calculus or need a refresher and you want to learn more (not required for this course), I have written a concise calculus primer that you might find helpful: Calculus and Differentiation Primer.

Moreover, if you are interested in an alternative introduction to stochastic gradient descent, you may find my article Single-Layer Neural Networks and Gradient Descent helpful.

Log in or create a free Lightning.ai account to access:

Quizzes
Completion badges
Progress tracking
Additional downloadable content
Additional AI education resources
Notifications when new units are released
Free cloud computing credits

Quiz: 3.3 Model Training with Stochastic Gradient Descent - PART 1

Using gradient descent to update the weight w1, which of the following values do we need to compute and/or are part of the computation? (Check all that apply)

The partial derivative of the loss with respect to the weight w1

Correct. To update w1, we compute “∂L / ∂w1”

The partial derivative of the loss with respect to the bias

Incorrect. This value is not needed since we don’t need/want to update x1.

The partial derivative of the loss with respect to the feature value x1

Incorrect. This value is not needed since we don’t need/want to update x1.

The derivative of the activation with respect to the net input

Correct. This term is part of the computation via the chain rule. We wrote it as “da / dz”

Please answer all questions to proceed.

Quiz: 3.3 Model Training with Stochastic Gradient Descent - PART 2

We can think of a “gradient” as a fancy term to describe the concept of a derivative in multiple dimensions.

True

Correct. If we have a function with multiple inputs, we can compute a gradient to capture the slope in multiple dimensions. E.g., if the function takes 2 inputs, we have a 2D slope.

False

Incorrect. The concept of a gradient is similar to that of a derivative. We use gradients when we work with functions that have multiple inputs.

Please answer all questions to proceed.

Quiz: 3.3 Model Training with Stochastic Gradient Descent - PART 3

Comparing the perceptron learning algorithm with gradient descent, which of the following answers is/are correct?

The perceptron algorithm updates the weights after each training example

Correct. Based on the predicted label, the weights are immediately updated.

Gradient descent updates the weights after each training example

Incorrect. The gradient descent algorithm computes the loss (and gradient) based on the whole training set.

The perceptron algorithm updates the weights after each iteration over the complete training set

Incorrect. The weights are updated after each wrong prediction.

Gradient descent updates the weights after each iteration over the complete training set

Correct. The gradient descent algorithm computes the loss (and gradient) based on the whole training set.

Please answer all questions to proceed.

Quiz: 3.3 Model Training with Stochastic Gradient Descent - PART 4

Stochastic gradient descent is a flavor of gradient descent that introduces a certain level of randomness into the training process. In order to do so, stochastic gradient descent …

updates the weights more than once for each pass over the training set

Correct. For each weight update, we compute the loss based on a single training example or a minibatch, which introduces a certain level of noise (or randomness) compared to regular gradient descent, which computes the weight update based on the whole training set. In this sense, the gradient for the weight update in stochastic gradient descent is an approximation of the full-gradient from regular gradient descent.

adds a small random number to each weight update

Incorrect. We do not explicitly modify the weight update values.

selects a random subset of training examples

Incorrect. Training examples are usually selected randomly (e.g., by shuffling the training set or drawing them in random order), but it still uses the full training set, not a subset.

Please answer all questions to proceed.

Watch Video 1 Mark complete and go to Unit 3.4 →

Unit 3.3

Videos

Follow along in a Lightning Studio

DL Fundamentals 3: Model Training in PyTorch

Sebastian

Launch Studio →

Questions or Feedback?

Join the Discussion