In the early step of my training on the cloud, I got some network broken information. The training looks fine, but the validation shows that something is seriously wrong, like my complex validation dataset is not itered correctly.
I used to deploy another set of code with dist.init_xxx to achieve multi-node multi-GPU training on this platform and it worked. Here I don’t konw what’s the problem that block the communication(maybe). Even setting devices=1 and strategy=None, it still shows the same problem while the training on my own machine is fine with 1 or 2 GPUs.
Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Preparing Data...
[rank: 0] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for
key:store_based_barrier_key:1 with 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
Retrying (Retry(total=2, connect=None, read=None, redirect=None,
status=None)) after connection broken by
'NewConnectionError('<urllib3.connection.HTTPSConnection object at
0x2ba6dea76280>: Failed to establish a new connection: [Errno -2]
Name or service not known')': /api/5288891/envelope/
Retrying (Retry(total=1, connect=None, read=None, redirect=None,
status=None)) after connection broken by
'NewConnectionError('<urllib3.connection.HTTPSConnection object at
0x2ba6dea76490>: Failed to establish a new connection: [Errno -2]
Name or service not known')': /api/5288891/envelope/
Retrying (Retry(total=0, connect=None, read=None, redirect=None,
status=None)) after connection broken by
'NewConnectionError('<urllib3.connection.HTTPSConnection object at
0x2ba6dea765e0>: Failed to establish a new connection: [Errno -2]
Name or service not known')': /api/5288891/envelope/