Deep Learning FundamentalsCertification Exam

Machine learning is a subfield of ... (Check all terms that apply.)

Deep learning

Generating new data can be considered as a subcategory of ...

Supervised learning

Unsupervised learning

Depending on the method, either supervised or unsupervised learning

Typically, we shuffle the dataset before we divide it into training and test set to make sure that the

Class labels are approximately evenly distributed

Training and test sets are equally difficult for the model

Training and test sets contain the same features

Using gradient descent to update the weight w1, which of the following values do we need to compute and/or are part of the computation? (Check all that apply)

The partial derivative of the loss with respect to the weight w1

The partial derivative of the loss with respect to the bias

The partial derivative of the loss with respect to the feature value x1

The derivative of the activation with respect to the net input

Comparing the perceptron learning algorithm with gradient descent, which of the following answers is/are correct?

The perceptron algorithm updates the weights after each training example

Gradient descent updates the weights after each training example

The perceptron algorithm updates the weights after each iteration over the complete training set

Gradient descent updates the weights after each iteration over the complete training set

Suppose you initialize a neural network layer using torch.nn.Linear(in_features=5, out_features=1). How many trainable parameters does this layer have?

1 parameter

5 parameters

6 parameters

10 parameters

Suppose we implement a logistic regression model as a binary classifier for a dataset with 4 features using a linear layer self.linear = torch.nn.Linear(a, b). What are the numeric values for a and b in this case?

a=1, b=2

a=2, b=1

a=4, b=1

For each training example, the softmax function returns one probability membership score for each class. Which of the following statements about the sum of these scores (for one training example) is correct?

The sum of the scores depends on the training example -- it's differs based on the input

The scores always sum up to 1

The scores always sum up to 1 times the number of classes

Suppose you have 2 training examples in a 3-class classification setting. What is the cross-entropy loss for a perfectly random prediction?

0.33

1.1

0.5

0.0

infinity

negative infinity

10.

Which of the following is crucial for producing nonlinear decision boundaries?

Using more than 1 hidden layer

Using nonlinear activation functions

Random weight initialization

11.

Suppose we implemented the following multilayer perceptron architecture for a 2-dimensional dataset with 3 classes:

How many parameters does this neural network approximately have?

1500

150

1000

100

12.

Before we are moving on to using the LightningModule and Trainer, suppose you want to implement a PyTorch training loop for comparison. Can you put the following code into the right order?

(1) loss.backward()
(2) logits = model(features)
(3) optimizer.zero_grad()
(4) optimizer.step()
(5) loss = F.cross_entropy(logits, labels)

(1), (2), (3), (4), (5)

(3), (2), (4), (5), (1)

(2), (5), (3), (1), (4)

(3), (5), (2), (1), (4)

13.

Suppose we have a binary classification dataset with 731 data points from class 0 and 269 data points from class 1. What is the expected classification accuracy if our classifier makes totally random predictions?

73.1%

26.9%

50.0%

Cannot be determined

14.

A drop probability of 0.5 in a dropout layer means that we are dropping 50% of the

weights in a hidden layer

activations in a hidden layer

layers in a network

15.

Suppose we have a 3x3 filter that we slide over an 12x12 dimensional image (with one input and one output channel), how many weight parameters do we need for that kernel (/filter)?

144

16.

What is the primary purpose of skip connections in a deep neural network?

Speed up training (given a fixed number of epochs)

Improve model interpretability

Facilitate deeper network architectures

Reduce the number of trainable parameters

17.

Which of the following is NOT a common data augmentation technique for image data?

Rotation

Flipping

Padding

Stemming

18.

Transfer learning typically involves:

Using a pretrained model as a feature extractor

Training a new model from scratch with additional layers

Creating an ensemble of multiple models

Removing layers from a pretrained model

19.

The attention mechanism for RNNs was introduced to address which limitation of the original sequence-to-sequence models?

Difficulty in handling long input sequences due to fixed-size context vectors

Overfitting on small datasets

High computational complexity during training

Inability to capture local word order information

20.

If we have an input text with 3 words, how many outputs vectors does the attention mechanism yield?

It depends

21.

If we have a multi-head attention layer with 8 heads, how many weight matrices does this include?

22.

How is BERT pretrained?

Predicting the next word in the sentence

Predicting masked words

Predicting randomly replaced words

Predicting sentence order

23.

Which of the following are valid finetuning approaches?

Updating all layers

Updating the last layer

Updating the first two layers

Updating the last two layers

24.

On a GPU with tensor cores that support the bfloat16 type, neural networks with bfloat16 weights train ...

much faster than with float16 weights

about the same as with float16 weights

much slower than with float16 weights

25.

In which type of parallelism is each layer of the model computed across all GPUs, but each GPU only computes a subset of the neurons in each layer?

Model parallelism

Tensor parallelism

Time is Up!

Time's up

Deep Learning FundamentalsCertification Exam

Please log in or sign up to take this exam.