How trainer.test/predict works when 2 devices are used?

Is there anywhere documented how the predictions are returned when 2 or more devices are used? I am asking this because a user will expect to get the same results when testing/predicting with 2 or more devices.

In order to check what is going on I used the same model with 1 and 2 devices as following:

seed_everything(1, workers=True)
trainer = L.Trainer(fast_dev_run=20, devices=1)
out1 = trainer.predict(model=litmodel, dataloaders=dm.test_dataloader())
preds1 = torch.cat(out1)
 
print('\n Using 2 devices \n')

trainer = L.Trainer(fast_dev_run=20, devices=2)
out2 = trainer.predict(model=litmodel, dataloaders=dm.test_dataloader())
preds2 = torch.cat(out2)

print(torch.all(preds1 == preds2))  # This sould be True.

However I am getting the following output:

Seed set to 1
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Running in `fast_dev_run` mode: will run the requested loop using 20 batch(es). Logging and checkpointing is suppressed.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:10<00:00,  1.96it/s]

 Using 2 devices 

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Running in `fast_dev_run` mode: will run the requested loop using 20 batch(es). Logging and checkpointing is suppressed.
[rank: 0] Seed set to 1
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[rank: 1] Seed set to 1
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:10<00:00,  1.94it/s]

 Using 2 devices 

[rank: 1] Seed set to 1
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:10<00:00,  1.85it/s]
tensor(False)
tensor(False)

It is unclear from the output on what is going on. I would expect the message “Using 2 devices” to be printed only once. Why is it printed 2 times? The same for the sanity check print(torch.all(preds1 == preds2)).

Can someone what happens under the hood and whether it is safe/straightforward to use 2 or more devices when doing inference?