Incorrect batch size being inferred using trainer.fit(), correct batch size in dataloader? What could be going wrong? [PyLightning]

zuzu401 · March 26, 2023, 4:43pm

I have a script to fine-tune a HuggingFace model that I wrote using PyLightning. I’m running into a problem where when I call trainer.fit(model, train_loader, val_loader) the batch size in the data-loader is the batch size of the train_loader + the val_loader, which makes me believe that my validation data is being included in both training and validation. I’m not sure why this is happening? Here’s a snippet of my code:

    train_data = TLDataset(train, tokenizer)
    logger.info(f"Sucessfully loaded SRC training data: 10000 examples")
    val_data = TLDataset(val, tokenizer)
    logger.info(f"Sucessfully loaded SRC validation data: 1200 examples")
    
    train_loader = DataLoader(train_data, batch_size=8, drop_last=True)
    val_loader = DataLoader(val_data, batch_size=8) #, num_workers=num_cpus//num_gpus

    
    
    tb_logger = pl_loggers.TensorBoardLogger(save_dir=f"{args.output_dir}logs/{args.file_name}_logs/")
    strategy = RayStrategy(num_workers=num_gpus, use_gpu=True if num_gpus > 0 else False, find_unused_parameters=False)
    es = EarlyStopping(monitor="val_loss", mode="min", patience=args.src_es_patience)
    checkpoint_callback = ModelCheckpoint(monitor='val_loss', dirpath = args.output_dir, filename = args.file_name, mode="min")
    val_check_interval = args.val_check_interval
    
    
    model = T5FineTuner(args)
    trainer = pl.Trainer(max_steps = args.src_num_train_steps, strategy=strategy, callbacks = [es, checkpoint_callback], val_check_interval=val_check_interval, logger=tb_logger, replace_sampler_ddp=False) 
    logger.info("Succesfully loaded model and trainer...")

    # print(f'TRAINING DATA LENGTH: {len(train_data)}') # 10000 examples
    # print(f"BATCH SIZE: {args.train_bsz}") # 8
    # print(f'NUMBER OF BATCHES: {len(train_data)//args.train_bsz}') # 1250 batches


    trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=val_loader)

When training occurs, the progress bar shows training data = 1250 + 150 = 1400 batches and when it goes into validation it shows 150 batches. Is this expected behavior (i.e. the progress bar shows the entire number of batches for training+val and then shifts to validation only when in a validation loop)? Or am I doing something wrong?

awaelchli · March 26, 2023, 4:53pm

Hey

Yes the progress bar before Lightning 2.0 would combine both sets in the display. So this was expected behavior. In Lightning 2.0 and later, these are separated, and each bar only shows the total for the respective dataloader.

I see you are using the RayStrategy. We don’t officially support this and I’m not so familiar with it, so I can’t comment on that but maybe what confused you is replace_sampler_ddp=False. Your print statements and comments don’t look right to me. You should print the following:

print(len(train_data))  # 10000
print(len(val_data))  # 1200

print(len(train_loader))  # 10000 // 8 = 1250
print(len(val_loader))  # 1200 // 8 = 150

Which will corresponds with what the progress bar shows.

Topic		Replies	Views
Val dataloader batchsize overrides train dataloader size	0	1019	May 5, 2021
Dealing with large dataset Trainer	1	4121	December 3, 2022
Validation_step and validation_epoch_end won't get called in trainer.fit() routine LightningModule	4	6944	November 2, 2022
Multi-batch validation step?	1	1638	January 30, 2022
Stuck at validation sanity check	0	3804	May 19, 2021

Incorrect batch size being inferred using trainer.fit(), correct batch size in dataloader? What could be going wrong? [PyLightning]

Related topics