NewConnectionError of DDP in a grave GPU cloud

In the early step of my training on the cloud, I got some network broken information. The training looks fine, but the validation shows that something is seriously wrong, like my complex validation dataset is not itered correctly.

I used to deploy another set of code with dist.init_xxx to achieve multi-node multi-GPU training on this platform and it worked. Here I don’t konw what’s the problem that block the communication(maybe). Even setting devices=1 and strategy=None, it still shows the same problem while the training on my own machine is fine with 1 or 2 GPUs.

Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
              Preparing Data...                                                 
[rank: 0] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
              Added key: store_based_barrier_key:1 to store for rank: 0         
              Rank 0: Completed store-based barrier for                         
              key:store_based_barrier_key:1 with 1 nodes.                       
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
Retrying (Retry(total=2, connect=None, read=None, redirect=None,  
              status=None)) after connection broken by                          
              'NewConnectionError('<urllib3.connection.HTTPSConnection object at
              0x2ba6dea76280>: Failed to establish a new connection: [Errno -2] 
              Name or service not known')': /api/5288891/envelope/              
              Retrying (Retry(total=1, connect=None, read=None, redirect=None,  
              status=None)) after connection broken by                          
              'NewConnectionError('<urllib3.connection.HTTPSConnection object at
              0x2ba6dea76490>: Failed to establish a new connection: [Errno -2] 
              Name or service not known')': /api/5288891/envelope/              
              Retrying (Retry(total=0, connect=None, read=None, redirect=None,  
              status=None)) after connection broken by                          
              'NewConnectionError('<urllib3.connection.HTTPSConnection object at
              0x2ba6dea765e0>: Failed to establish a new connection: [Errno -2] 
              Name or service not known')': /api/5288891/envelope/