I am training the model but got this error, how can i solve this,please help me figure out this asap

/home/shree/anaconda3/envs/ellipse_rcnn/bin/python /home/shree/aniket_aug2022/MODELS/EllipseR_CNN/3.ellipse-rcnn-main/train.py
2022-11-18 11:00:54.774518: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-18 11:00:54.979053: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-18 11:00:55.632529: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer.so.7’; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-11-18 11:00:55.632616: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer_plugin.so.7’; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-11-18 11:00:55.632625: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torchvision/models/_utils.py:135: UserWarning: Using ‘backbone_name’ as positional parameter(s) is deprecated since 0.13 and may be removed in the future. Please use keyword parameter(s) instead.
warnings.warn(
/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter ‘pretrained’ is deprecated since 0.13 and may be removed in the future, please use ‘weights’ instead.
warnings.warn(
/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for ‘weights’ are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=ResNet50_Weights.IMAGENET1K_V1. You can also use weights=ResNet50_Weights.DEFAULT to get the most up-to-date weights.
warnings.warn(msg)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params

0 | transform | GeneralizedRCNNTransform | 0

1 | backbone | BackboneWithFPN | 26.8 M
2 | rpn | RegionProposalNetwork | 593 K
3 | roi_heads | EllipseRoIHeads | 27.8 M

55.2 M Trainable params
0 Non-trainable params
55.2 M Total params
220.763 Total estimated model params size (MB)
/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 12 which is the number of cpus on this machine) in theDataLoader` init to improve performance.
rank_zero_warn(
Epoch 0: 0%| | 0/79 [00:00<?, ?it/s] /home/shree/aniket_aug2022/MODELS/EllipseR_CNN/3.ellipse-rcnn-main/data_my.py:37: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at /opt/conda/conda-bld/pytorch_1666643003845/work/torch/csrc/utils/tensor_new.cpp:230.)
A_craters = torch.Tensor(dataset[self.group][“para/parameter”])
Epoch 15: 82%|████████▏ | 65/79 [02:03<00:26, 1.90s/it, loss=0.437, v_num=0, loss_classifier=0.00949, loss_box_reg=0.0436, loss_ellipse=0.358, loss_objectness=0.000142, loss_rpn_box_reg=0.00138, total_loss=0.413]/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torch/autograd/init.py:197: UserWarning: Error detected in BmmBackward0. Traceback of forward call that caused the error:
File “/home/shree/aniket_aug2022/MODELS/EllipseR_CNN/3.ellipse-rcnn-main/train.py”, line 25, in
trainer.fit(model, train_dataloaders=train_loader)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 579, in fit
call._call_and_handle_interrupt(
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py”, line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 621, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1058, in _run
results = self._run_stage()
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1137, in _run_stage
self._run_train()
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1160, in _run_train
self.fit_loop.run()
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py”, line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py”, line 214, in advance
batch_output = self.batch_loop.run(kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py”, line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py”, line 200, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py”, line 247, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get(“batch_idx”, 0), closure)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py”, line 357, in _optimizer_step
self.trainer._call_lightning_module_hook(
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1302, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/core/module.py”, line 1661, in optimizer_step
optimizer.step(closure=optimizer_closure)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py”, line 169, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py”, line 234, in optimizer_step
return self.precision_plugin.optimizer_step(
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py”, line 121, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torch/optim/optimizer.py”, line 140, in wrapper
out = func(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torch/optim/optimizer.py”, line 23, in _use_grad
ret = func(self, *args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torch/optim/sgd.py”, line 130, in step
loss = closure()
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py”, line 107, in _wrap_closure
closure_result = closure()
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py”, line 147, in call
self._result = self.closure(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py”, line 133, in closure
step_output = self._step_fn()
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py”, line 406, in _training_step
training_step_output = self.trainer._call_strategy_hook(“training_step”, *kwargs.values())
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1440, in _call_strategy_hook
output = fn(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py”, line 378, in training_step
return self.model.training_step(*args, **kwargs)
File “/home/shree/aniket_aug2022/MODELS/EllipseR_CNN/3.ellipse-rcnn-main/ellipse_rcnn/core/model.py”, line 172, in training_step
loss_dict = self(images, targets)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1190, in _call_impl
return forward_call(*input, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torchvision/models/detection/generalized_rcnn.py”, line 105, in forward
detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1190, in _call_impl
return forward_call(*input, **kwargs)
File “/home/shree/aniket_aug2022/MODELS/EllipseR_CNN/3.ellipse-rcnn-main/ellipse_rcnn/core/roi_heads.py”, line 224, in forward
rcnn_loss_ellipse = self.ellipse_loss_fn(
File “/home/shree/aniket_aug2022/MODELS/EllipseR_CNN/3.ellipse-rcnn-main/ellipse_rcnn/core/roi_heads.py”, line 94, in ellipse_loss_GA
return gaussian_angle_distance(A_pred, A_target).mean()
File “/home/shree/aniket_aug2022/MODELS/EllipseR_CNN/3.ellipse-rcnn-main/ellipse_rcnn/core/metrics.py”, line 222, in gaussian_angle_distance
-0.5 * (m1 - m2).transpose(-1, -2) @ cov1 @ (cov1 + cov2).inverse() @ cov2 @ (m1 - m2)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torch/fx/traceback.py”, line 57, in format_stack
return traceback.format_stack()
(Triggered internally at /opt/conda/conda-bld/pytorch_1666643003845/work/torch/csrc/autograd/python_anomaly_mode.cpp:114.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File “/home/shree/aniket_aug2022/MODELS/EllipseR_CNN/3.ellipse-rcnn-main/train.py”, line 25, in
trainer.fit(model, train_dataloaders=train_loader)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 579, in fit
call._call_and_handle_interrupt(
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py”, line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 621, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1058, in _run
results = self._run_stage()
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1137, in _run_stage
self._run_train()
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1160, in _run_train
self.fit_loop.run()
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py”, line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py”, line 214, in advance
batch_output = self.batch_loop.run(kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py”, line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py”, line 200, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py”, line 247, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get(“batch_idx”, 0), closure)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py”, line 357, in _optimizer_step
self.trainer._call_lightning_module_hook(
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1302, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/core/module.py”, line 1661, in optimizer_step
optimizer.step(closure=optimizer_closure)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py”, line 169, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py”, line 234, in optimizer_step
return self.precision_plugin.optimizer_step(
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py”, line 121, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torch/optim/optimizer.py”, line 140, in wrapper
out = func(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torch/optim/optimizer.py”, line 23, in _use_grad
ret = func(self, *args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torch/optim/sgd.py”, line 130, in step
loss = closure()
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py”, line 107, in _wrap_closure
closure_result = closure()
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py”, line 147, in call
self._result = self.closure(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py”, line 142, in closure
self._backward_fn(step_output.closure_loss)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py”, line 303, in backward_fn
self.trainer._call_strategy_hook(“backward”, loss, optimizer, opt_idx)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1440, in _call_strategy_hook
output = fn(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py”, line 207, in backward
self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, optimizer_idx, *args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py”, line 69, in backward
model.backward(tensor, optimizer, optimizer_idx, *args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/pytorch_lightning/core/module.py”, line 1406, in backward
loss.backward(*args, **kwargs)
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torch/_tensor.py”, line 487, in backward
torch.autograd.backward(
File “/home/shree/anaconda3/envs/ellipse_rcnn/lib/python3.10/site-packages/torch/autograd/init.py”, line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function ‘BmmBackward0’ returned nan values in its 0th output.
Epoch 15: 82%|████████▏ | 65/79 [02:05<00:27, 1.93s/it, loss=0.437, v_num=0, loss_classifier=0.00949, loss_box_reg=0.0436, loss_ellipse=0.358, loss_objectness=0.000142, loss_rpn_box_reg=0.00138, total_loss=0.413]

It could be that some operations in your model forward are unstable and causing issues when computing the gradient (backward).

I suggest you turn on Trainer(detect_anomaly=True) to debug this and fine out where it happens. It is also possible that this is caused by an outlier/corrupted datapoint in your dataset. I suggest you log the sample index and look at the datapoints for the iteration in which this error occurs.

thanks for the reply i am able to solve this issue somehow.

1 Like