I am implementing a dual-encoder to solve Entity Linking tasks. I could implement a dual-encoder with 8 GPUs (DDPStrategy).
For the next step, I tried to implement Hard-Negative mining during training. For every epoch, I need to encode all candidates (say all Wikipedia articles) and save them. The code is like this.
def on_train_epoch_start(self):
all_vecs = []
all_wikipedia_dataloader = create_dataloader(..)
for batch in tqdm(all_wikipedia_dataloader):
all_vecs.append(self.encoder(input)[0][:, 0, :]) # hidden vector of CLS token
Although this code worked, I noticed that each GPU has to encode all candidates (NOT split the data!). This is time consuming because I can use 8 GPUs for this code. On training_step the code can actually apply DDP strategy.
So my question is how to apply multiple GPUs on not training_step?
My code is similar to this. But the code is separated from the training code
Hey, you would have to do the splitting yourself. In your case it’s probably quite easy to just use the DistributedSampler from pytorch in your dataloader and then call all_gather on these resulting all_vecs. Note however, that the distributed sampler does repeat samples for the last batch to ensure they are the same size.