Hi. I’m stuck at validation sanity check.
I have a GNN model that has 25M total params (99.5 MB total estimated model params) according to the pytorch lightning summary printout. It ran without problem when the model was half the size (by having less number of NNs) or even 75% size of this. I realized that when the model is bigger then it takes longer to pass the validation sanity check. Strangely, I’m stuck at the validation sanity check (with 0% progress on tqdm progress bar’s display) more than 24 hrs. The nvidia-smi doesn’t show any gpu usage, and the htop shows very little utilization of cpu either.
I wonder why this happens, and how I can track down the problem. BTW, the Trainer uses default (that means I assume the sanity check runs for 2 steps). Have anyone experienced similar behavior?
mode info about my system and training setting:
I’ve set the num_workers=32 (I have 32 cores from dual cpus) for all the data_loaders. I have one gpu with 24 GB memory capacity. And the batch size for training is 512.