What is the batch size for distributed training fsdp?

Let’s say I setup torch fabric like this:

fabric = Fabric(accelerator=“gpu”, devices=1, num_nodes=1, strategy=“fsdp”)
fabric.launch()

When I specify the batch size in my model, is it the batch per gpu or the batch in total? I am trying to do contrastive learning with a large batch size. If it’s the batch per gpu, how do I retain the loss until the end where I can have a large final batch size?