Hi, I’m currently working on GCP and using a custom training job strategy, so I have all the code of my experiment in just one Python script.
This code is not only for the training, it also creates a Vertex Experiment run, logs parameters, metrics and artifacts, and so on.
The problem I’m running into is that DeepSpeed distributes the training across multiple GPUs, but runs the entire code of the script multiple times. When the code for creating a experiment run gets executed multiple times, an exception is raised because there cannot be multiple experiment runs with the same name.
When I read about the DeepSpeed strategy I thought it would only run in parallel the trainer.fit()
part. Is there any way to have some code that only runs once?
The code is something like this:
...
# this should only be executed once
aiplatform.start_run(run_id)
aiplatform.log_params(params)
...
# this is the part that needs to be distributed across multiple gpus
trainer = pl.Trainer(accelerator="gpu", devices=4, strategy="deepspeed_stage_2", ...)
trainer.fit(model, train_dataloaders=train_dl, val_dataloaders=val_dl)
...
Thanks!