How to apply multiple GPUs on not `training_step`?

jef · January 3, 2023, 11:40am

I am implementing a dual-encoder to solve Entity Linking tasks. I could implement a dual-encoder with 8 GPUs (DDPStrategy).

For the next step, I tried to implement Hard-Negative mining during training. For every epoch, I need to encode all candidates (say all Wikipedia articles) and save them. The code is like this.

def on_train_epoch_start(self):
     all_vecs = []
     all_wikipedia_dataloader = create_dataloader(..)
     for batch in tqdm(all_wikipedia_dataloader):
           all_vecs.append(self.encoder(input)[0][:, 0, :]) # hidden vector of CLS token

Although this code worked, I noticed that each GPU has to encode all candidates (NOT split the data!). This is time consuming because I can use 8 GPUs for this code. On training_step the code can actually apply DDP strategy.

So my question is how to apply multiple GPUs on not training_step?

My code is similar to this. But the code is separated from the training code

github.com

studio-ousia/bpr/blob/master/generate_embeddings.py

import argparse
import itertools

import joblib
import numpy as np
import torch
from tqdm import tqdm
from torch.nn.parallel import DataParallel
from transformers import AutoTokenizer

from bpr.biencoder import BiEncoder
from bpr.passage_db import PassageDB


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--biencoder_file", type=str, required=True)
    parser.add_argument("--passage_db_file", type=str, required=True)
    parser.add_argument("--output_file", type=str, required=True)
    parser.add_argument("--batch_size", type=int, default=128)

This file has been truncated. show original

justusschock · January 4, 2023, 10:35am

Hey, you would have to do the splitting yourself. In your case it’s probably quite easy to just use the DistributedSampler from pytorch in your dataloader and then call all_gather on these resulting all_vecs. Note however, that the distributed sampler does repeat samples for the last batch to ensure they are the same size.

jef · January 4, 2023, 2:07pm

Hi @justusschock

Yeah, you’re right. By using simply DistributedSampler and dist.all_gather_object of pytorch (not pytorch-lightning), I could do it!

Although my code still does not care about the last batch, I would like to share my code for someone who wants to see it.

import itertools
import torch.distributed as dist
from torch.utils.data import DistributedSampler

def on_train_start(self): 
    dataset = MyDataset(...)
    sampler = DistributedSampler(dataset, shuffle=False)
    dataloader = Dataloader(dataset=dataset, sampler=sampler, ...)
    all_vecs = []
    for batch in tqdm(dataloader , disable=self.global_rank == 0):
        with torch.no_grad():
            all_vecs.append(self.my_model(batch['xxx']))
    all_rank_all_vecs = [None for _ in range(dist.get_world_size())]
    dist.all_gather_object(all_rank_cand_vecs, all_vecs)
    # flatten list of list
    all_rank_cand_vecs = torch.stack(flatten(all_rank_cand_vecs), dim=0)

justusschock · January 4, 2023, 2:22pm

That’s great new, thanks for sharing with everybody!

Topic		Replies	Views
Multiple GPU runs the scipt twice DDP/GPU	10	332	February 8, 2024
How can I train a model using DDP on two GPUs, but only test on one GPU? DDP/GPU	4	1842	August 17, 2023
DDP MultiGPU Training does not reduce training time DDP/GPU	3	1568	November 8, 2023
Testing Multi GPU training on a Single GPU DDP/GPU	1	2559	February 22, 2021
Training not proceeding DDP/GPU	0	891	August 4, 2022

How to apply multiple GPUs on not `training_step`?

Related topics