Is there anywhere documented how the predictions are returned when 2 or more devices are used? I am asking this because a user will expect to get the same results when testing/predicting with 2 or more devices.
In order to check what is going on I used the same model with 1 and 2 devices as following:
seed_everything(1, workers=True)
trainer = L.Trainer(fast_dev_run=20, devices=1)
out1 = trainer.predict(model=litmodel, dataloaders=dm.test_dataloader())
preds1 = torch.cat(out1)
print('\n Using 2 devices \n')
trainer = L.Trainer(fast_dev_run=20, devices=2)
out2 = trainer.predict(model=litmodel, dataloaders=dm.test_dataloader())
preds2 = torch.cat(out2)
print(torch.all(preds1 == preds2)) # This sould be True.
However I am getting the following output:
Seed set to 1
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Running in `fast_dev_run` mode: will run the requested loop using 20 batch(es). Logging and checkpointing is suppressed.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:10<00:00, 1.96it/s]
Using 2 devices
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Running in `fast_dev_run` mode: will run the requested loop using 20 batch(es). Logging and checkpointing is suppressed.
[rank: 0] Seed set to 1
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[rank: 1] Seed set to 1
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:10<00:00, 1.94it/s]
Using 2 devices
[rank: 1] Seed set to 1
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:10<00:00, 1.85it/s]
tensor(False)
tensor(False)
It is unclear from the output on what is going on. I would expect the message “Using 2 devices” to be printed only once. Why is it printed 2 times? The same for the sanity check print(torch.all(preds1 == preds2))
.
Can someone what happens under the hood and whether it is safe/straightforward to use 2 or more devices when doing inference?