Incorrect batch size being inferred using, correct batch size in dataloader? What could be going wrong? [PyLightning]

I have a script to fine-tune a HuggingFace model that I wrote using PyLightning. I’m running into a problem where when I call, train_loader, val_loader) the batch size in the data-loader is the batch size of the train_loader + the val_loader, which makes me believe that my validation data is being included in both training and validation. I’m not sure why this is happening? Here’s a snippet of my code:

    train_data = TLDataset(train, tokenizer)"Sucessfully loaded SRC training data: 10000 examples")
    val_data = TLDataset(val, tokenizer)"Sucessfully loaded SRC validation data: 1200 examples")
    train_loader = DataLoader(train_data, batch_size=8, drop_last=True)
    val_loader = DataLoader(val_data, batch_size=8) #, num_workers=num_cpus//num_gpus

    tb_logger = pl_loggers.TensorBoardLogger(save_dir=f"{args.output_dir}logs/{args.file_name}_logs/")
    strategy = RayStrategy(num_workers=num_gpus, use_gpu=True if num_gpus > 0 else False, find_unused_parameters=False)
    es = EarlyStopping(monitor="val_loss", mode="min", patience=args.src_es_patience)
    checkpoint_callback = ModelCheckpoint(monitor='val_loss', dirpath = args.output_dir, filename = args.file_name, mode="min")
    val_check_interval = args.val_check_interval
    model = T5FineTuner(args)
    trainer = pl.Trainer(max_steps = args.src_num_train_steps, strategy=strategy, callbacks = [es, checkpoint_callback], val_check_interval=val_check_interval, logger=tb_logger, replace_sampler_ddp=False)"Succesfully loaded model and trainer...")

    # print(f'TRAINING DATA LENGTH: {len(train_data)}') # 10000 examples
    # print(f"BATCH SIZE: {args.train_bsz}") # 8
    # print(f'NUMBER OF BATCHES: {len(train_data)//args.train_bsz}') # 1250 batches, train_dataloaders=train_loader, val_dataloaders=val_loader)

When training occurs, the progress bar shows training data = 1250 + 150 = 1400 batches and when it goes into validation it shows 150 batches. Is this expected behavior (i.e. the progress bar shows the entire number of batches for training+val and then shifts to validation only when in a validation loop)? Or am I doing something wrong?


Yes the progress bar before Lightning 2.0 would combine both sets in the display. So this was expected behavior. In Lightning 2.0 and later, these are separated, and each bar only shows the total for the respective dataloader.

I see you are using the RayStrategy. We don’t officially support this and I’m not so familiar with it, so I can’t comment on that but maybe what confused you is replace_sampler_ddp=False. Your print statements and comments don’t look right to me. You should print the following:

print(len(train_data))  # 10000
print(len(val_data))  # 1200

print(len(train_loader))  # 10000 // 8 = 1250
print(len(val_loader))  # 1200 // 8 = 150

Which will corresponds with what the progress bar shows.