Why does training fails with "require grad and does not have a grad_fn"?

Falco · May 15, 2023, 12:44pm

Hi,
I am running a Temporal Fusion Transformer model using a custom data module that provides data as torch.tensor objects. The loss I am using is a QuantileLoss as it is used in pytorch_forecasting (highly customised metric).

Training begins and runs until last step. It throws the following RuntimeError:

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Epoch 0: 100%|█████████▉| 540/541 [00:31<00:00, 17.00it/s, v_num=561]

What I could identify already:

it is not the optimizer (switching, disabling does not make a difference)
manual_optimization mode is possible (but not desirable)
the LAST loss is somehow a print(f"{self.loss.requires_grad}") => False' wheras it was always True` in previous steps.
Any parameters within the model are checked using:

for module in self.modules():
    test = list(module.parameters())
    if np.sum([not a.requires_grad for a in test]) > 0:
        print(f"{module}")
    else:
        print(f"{module} passed grad check")

The trainings loop is very basic:

    def training_step(self, batch, batch_idx):
        """Train step on batch."""
        y_hat = self(batch)
        loss = self.loss(
            y_prediction=y_hat["predicted_quantiles"],
            target=batch["future_ts_numeric"],
            desired_quantiles=self.output_quantiles,
        )
        return loss

The y_hat has set requires_grad correctly

In [3]: for key in y_hat.keys():
   ...:     print(f"{key}: {y_hat[key].requires_grad}")
   ...:
predicted_quantiles: True
static_weights: True
historical_selection_weights: True
future_selection_weights: True
attention_scores: True

So how does it come that suddenly my loss is without gradient information? I am helpless here… May someone has an idea which pipeline can infer my loss in that way? Where may I look, how do I fix it?

Best, Falco

Falco · May 15, 2023, 1:19pm

Sometimes it just needs a push… Solved it already by myself.
There was an update in the loss function that reassigned the loss to a freshly initialized loss. That fails as expected. The line now reads with the error line commented:

            if not torch.isfinite(losses):
                losses = losses.fill_(1e9)
                # losses = torch.tensor(1e9, device=losses.device)

PhucLee2605 · June 28, 2023, 12:49pm

I’m glad that you can solve this problem. I’m wondering where exactly you put that code snippet in. I tried to figure this out but still in a mess.

Falco · August 8, 2023, 9:55am

Hello @PhucLee2605 sorry for answering late. I put this directly in the loss update method of my custom torchmetric.Metric. See the broader code snippet below.

    def update(self, y_prediction: torch.tensor, target: torch.tensor):
        """this method is invoked by the trainer to update the metric state."""
        self.losses = self.loss(y_prediction, target)
        if not torch.isfinite(self.losses).any():
            self.losses = self.losses.fill_(1e9)
            warnings.warn("Loss is not finite. Resetting it to 1e9", stacklevel=2)

Topic		Replies	Views
F1 score output tensor does not require grad and does not have a grad_fn	0	781	March 4, 2021
Getting element 0 error while fine tuning llm implementation help	3	581	July 17, 2023
Computing gradients wrt inputs within training_step implementation help	1	980	November 27, 2022
Easily skipping optimizers for modular networks implementation help	4	1099	September 7, 2020
Torch.no_grad() calls implementations	4	3876	August 2, 2023

Why does training fails with "require grad and does not have a grad_fn"?

Related topics