Validation_step and validation_epoch_end won't get called in trainer.fit() routine

frengelk · February 22, 2022, 10:58am

Hello,

my Problem is the following:

If I use the normal data loader for getting the training data loaded into the trainer.fit() routine, everything works fine. (validation step after each epoch)

However, when I create a custom batch sampler (pulling even amount of events from each class), inside the the trainer loop, only the training_step gets executed (behaviour here seems as expected).

The validation step then gets only executed in the initial validation check.

You can see my Batch sampler here (Susy1LeptonAnalysis/PytorchHelp.py at pytorch_tryout · frengelk/Susy1LeptonAnalysis · GitHub)

In the iter method, I create for each step_per_epoch (int, definded by me) an array of indices, which gets returned by yield array here (last line of previous link, I can only put 2 links in my post).

The model is defined here:
(Same file as Batch Sampler, starting in line 49.)

The trainer gets called here:

github.com

frengelk/Susy1LeptonAnalysis/blob/pytorch_tryout/analysis/tasks/pytorch_test.py#L211


      
              n_nodes=self.n_nodes,
          )
          
          
# define data
          data_collection = util.DataModuleClass(
              X_train,
              y_train,
              X_val,
              y_val,
              # X_test,
              # y_test,
              self.batch_size,
              n_processes,
              self.steps_per_epoch,
          )
          
          
# needed for test evaluation
          criterion = nn.CrossEntropyLoss()
          # optimizer = optim.Adam(model.parameters(), lr=self.learning_rate)
          
          
print(model)

I know that the code is nested and embedded in luigi, so it might be difficult to read at some points.

If you have any questions, or need more information, I am happy to make my problem easier to understand.

Best regards and thanks in advance,
Frederic

goku · April 4, 2022, 2:33pm

from a quick look, I don’t think you are using the BatchSampler for the validation dataloader.

github.com

frengelk/Susy1LeptonAnalysis/blob/cf3d10a942595c5dc4588bf97debddc3341ddded/analysis/utils/PytorchHelp.py#L103-L108


      
          def val_dataloader(self):
              return data.DataLoader(
                  dataset=self.val_dataset,
                  batch_size=10 * self.batch_size,  # , shuffle=True  # len(val_dataset
                  num_workers=8,
              )  # =1

We have moved the discussions to GitHub Discussions. You might want to check that out instead to get a quick response. The forums will be marked read-only after some time.

Thank you

chengjie11 · April 22, 2022, 8:33pm

Hi Frengelk, have you already solved your problem? I might meet the same problem. Waiting for your reply Thanks in advance!

Cynthia_Maldonado · October 26, 2022, 5:30pm

Hi - have you already solved this problem? I am having the same problem!

aniketmaurya · November 2, 2022, 11:51am

Hi @Cynthia_Maldonado, could you check if you are using BatchSampler correcly?

>>> sampler = DistributedSampler(dataset) if is_distributed else None
>>> loader = DataLoader(dataset, shuffle=(sampler is None),
...                     sampler=sampler)

Topic		Replies	Views
Does not run validation step after epoch when running with all data implementation help	5	2735	May 1, 2023
Training_epoch_end is never called LightningModule	3	1586	February 22, 2021
On_training_epoch_end does not get called LightningModule	3	3257	March 21, 2023
Running multiple validation steps after each training epoch implementation help	1	690	December 16, 2023
Multiple dataloaders in training_step() and use them separately implementation help	0	374	September 13, 2023

Validation_step and validation_epoch_end won't get called in trainer.fit() routine

Related topics