Trained weights are on CPU despite the model being trained on GPU

bavshehata · March 19, 2023, 9:05pm

I’ve been tackling this problem for several hours but without luck.

I’m training a CNN on GPU and everything works great. However, once the model finishes training and I print the weights using model.state_dict(), I see the weights residing on the CPU. More perplexing is if I save the weights and then load them back to the gpu as follows and then print the weights, they are still on CPU:

torch.save(FacesModel.state_dict(), 'outputs/model.pth')
FacesModel = LitFacesModel()
FacesModel.load_state_dict(torch.load('outputs/model.pth', map_location='cuda:0'))

Here’s a reduced version of the model I’m training

class LitFacesModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.bn1 = nn.BatchNorm2d(32)
        self.bn2 = nn.BatchNorm2d(64)
        self.bn3 = nn.BatchNorm2d(128)
        self.bn4 = nn.BatchNorm2d(256)
        self.cnv1 = nn.Conv2d(3, 32, kernel_size = 3)
        self.cnv2 = nn.Conv2d(32, 64, kernel_size = 3)
        self.cnv3 = nn.Conv2d(64, 128, kernel_size = 3)
        self.cnv4 = nn.Conv2d(128, 256, kernel_size = 3)
        self.rel = nn.ReLU()
        self.avg = nn.AvgPool2d(2, 2)
        self.flat = nn.Flatten()
        self.fc1 = nn.Linear(25600, 132)
        self.fc2 = nn.Linear(132, CLASSES)


    def forward(self,x):
        out = self.avg(self.bn1(self.rel(self.cnv1(x))))
        out = self.avg(self.bn2(self.rel(self.cnv2(out))))
        out = self.avg(self.bn3(self.rel(self.cnv3(out))))
        out = self.avg(self.bn4(self.rel(self.cnv4(out))))
        out = self.flat(out)
        out = self.fc1(out)
        out = self.fc2(out)
        return out

    def configure_optimizers(self):
        opt_func = torch.optim.Adam
        optimizer = torch.optim.Adam(self.parameters(), LR)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]

    def training_step(self, batch):
        images, labels = batch 
        pred = self(images)

Any idea why this is happening?

awaelchli · March 20, 2023, 10:42pm

Hey

Can you show me the Trainer settings you used?

bavshehata · March 21, 2023, 1:23pm

Absolutely, here are the trainer settings:
trainer = pl.Trainer(max_epochs=EPOCHS,log_every_n_steps=10, profiler="simple", logger=logger,accelerator="gpu", devices=1)

awaelchli · March 21, 2023, 1:53pm

How did you check that the weights are on CPU during training? Did you print them?
Note that after training finishes, the model gets moved back to CPU. So if you do this:

trainer.fit(model)
print(model.device)  # prints cpu

It will always print “cpu” for the device. Maybe this got you confused.

bavshehata · March 21, 2023, 2:07pm

Yes, I believe that got me confused. So after the model is trained, let’s say I’d like to test it on an image using pred = model(img). How can I run this prediction on GPU, given that img is on GPU as well?

awaelchli · March 21, 2023, 10:43pm

It depends if you want to do it with or without Lightning.

With Lightning: You can implement the predict_step hook in the LightningModule and then call trainer.predict(model). Docs

Without Lightning: You need to do model = model.cuda() and then also move the input data to the GPU. Docs

bavshehata · March 24, 2023, 10:05am

Works like a charm. Thanks!

pswpswpsw · March 27, 2023, 5:58am

that means you have to create a dataloader in order to just simply call predict. It seems to me, unless I need to perform LARGE and FASTEST inference, otherwise I would only rely on GPU for training and I am okay with just inference on CPU.