PyTorch Lightning AUROC value for multi-class seems to be completely off compared to sklearn (using it wrong)?

NumesSanguis · August 27, 2020, 12:27pm

I want to calculate the Area Under the Receiver Operating Characteristics (AUROC) of my multi-class predictions. However, I don’t know which value to trust.

PyTorch Lightning comes with an AUROC metric. However, whether you call it on the final activation values or after categorizing it both gives different results.
Also, both values do not match the AUROC calculation found in scikit-learn.

Here is a standalone version:

from pl_bolts.models import LitMNIST
import pytorch_lightning as pl
from pytorch_lightning.metrics.classification import AUROC
from pytorch_lightning.metrics.functional import to_categorical
# https://github.com/reiinakano/scikit-plot
# pip install scikit-plot | conda install -c conda-forge scikit-plot
import matplotlib.pyplot as plt
import scikitplot as skplt

pl.seed_everything(0)

# model
model = LitMNIST(batch_size=64)
trainer = pl.Trainer(max_epochs=1, deterministic=True)
# train 1 epoch to expect decent results from auroc
trainer.fit(model)
# prevent grad error
model.freeze()

pl_auroc = AUROC()
for i, batch in enumerate(model.test_dataloader()):
    # get a prediction
    x, target = batch
    print(i, x.shape, target.shape)
    predict = model(x)

    # auroc with activations per class
    print(pl_auroc(predict, target))
    # auroc after converting activations to categories
    predict_cat = to_categorical(predict)
    print(pl_auroc(predict_cat, target))

    # use scikit-learn with a simplified interface
    predict_np = predict.numpy()  # .cpu()
    target_np = target.numpy()  # .cpu()
    skplt.metrics.plot_roc(target_np, predict_np)  # reversed order
    plt.show()

    # predictions versus target
    print(list(zip(predict_cat.numpy(), target.numpy())))

    # 3 predctions is enough
    if i >= 2:
        break

And this is the output from the above script:

# output batch 1:
# AUROC on activation: tensor(0.3796)
# AUROC on category: tensor(0.1111)
# AUROC scikit-learn micro-average: 0.98
#                    macro-average: 0.94
# (predict, target):
# [(7, 7), (2, 2), (1, 1), (0, 0), (4, 4), (1, 1), (4, 4), (9, 9), (6, 5), (9, 9), (0, 0), (6, 6), (9, 9), (0, 0), (1, 1), (5, 5), (9, 9), (7, 7), (2, 3), (4, 4), (9, 9), (6, 6), (6, 6), (5, 5), (4, 4), (0, 0), (7, 7), (4, 4), (0, 0), (1, 1), (3, 3), (1, 1), (3, 3), (4, 4), (7, 7), (2, 2), (7, 7), (1, 1), (2, 2), (1, 1), (1, 1), (7, 7), (4, 4), (2, 2), (3, 3), (5, 5), (1, 1), (2, 2), (4, 4), (4, 4), (6, 6), (3, 3), (5, 5), (5, 5), (6, 6), (0, 0), (4, 4), (1, 1), (9, 9), (5, 5), (7, 7), (2, 8), (9, 9), (3, 3)]

# output batch 2:
# AUROC on activation: tensor(0.4407)
# AUROC on category: tensor(0.0678)
# AUROC scikit-learn micro-average: 0.97
#                    macro-average: 0.94

# output batch 3:
# AUROC on activation: tensor(0.4038)
# AUROC on category: tensor(0.0962)
# AUROC scikit-learn micro-average: 0.93
#                    macro-average: 0.93

The AUROC values from pytorch_lightning.metrics.classification.AUROC seem to be completely off.
Am I using AUROC wrong here?

p.s. Maybe a metric category is necessary for this type of question?

dilip · August 27, 2020, 1:05pm

Are you categorizing your predictions in the same way when you call the scikit-learn function?

awaelchli · August 27, 2020, 6:44pm

I wrote this test:

def test_auroc_sk():
    for i in range(100):
        target = torch.randint(0, 2, size=(10, ))
        pred = torch.randint(0, 2, size=(10,))
        score_sk = sklearn.metrics.roc_auc_score(target.numpy(), pred.numpy())
        score_pl = auroc(pred, target)
        assert torch.allclose(torch.tensor(score_pl).float(), torch.tensor(score_sk).float())

and it passes. It seems to me these two metrics return the same scores.

goku · August 27, 2020, 10:05pm

I think AUROC from pl is for binary class and not for multi-class. If it is so then example in the docstrings and docstings itself needs to be fixed.

williamfalcon · August 27, 2020, 10:17pm

@justusschock @SkafteNicki wrote this stuff. maybe they have a multi-class version?

NumesSanguis · August 28, 2020, 2:32am

I’m not sure what that scikit-learn function is doing, but I was using pl’s AUROC function and the values seemed to be very off. In the 1st post’s example, pl’s values are also very off (ignore the scikit-learn function and just look pure at the (predict, target) tuples).

Your hunch seems to be correct @goku . Here is @awaelchli 's testcase rewritten to multi-class:

import torch
import sklearn.metrics
import pytorch_lightning as pl
from pytorch_lightning.metrics.classification import AUROC

pl.seed_everything(0)
auroc = AUROC()

def test_auroc_sk_multiclass():
    for i in range(100):
        target = torch.randint(0, 3, size=(10,))  # 2 --> 3
        pred = torch.rand(10, 3).softmax(dim=1)  # torch.randint(0, 2, size=(10, ))
        score_sk = sklearn.metrics.roc_auc_score(target.numpy(), pred.numpy(), multi_class='ovo', labels=[0, 1, 2])
        score_pl = auroc(pred, target)
        print(score_sk, score_pl)
        assert torch.allclose(torch.tensor(score_pl).float(), torch.tensor(score_sk).float())

test_auroc_sk_multiclass()

Which fails the assert. sklearn’s output is 0.2708, while pl’s output is tensor(0.5000).

justusschock · August 28, 2020, 8:39am

So far this is only a binary implementation.

Basically we have a multiclass auc implementation here and a multiclass roc calculation here which you can then combine to a multiclass auric as it has been done here

NumesSanguis · September 1, 2020, 9:56am

Created a bug report for AUROC not throwing an error when passing multi-class predictions: AUROC metric should throw an error when used for multi-class problems · Issue #3303 · Lightning-AI/lightning · GitHub
Created a feature request for a MulticlassAUROC metric: MulticlassAUROC: Implement a multi-class version of the AUROC metric · Issue #3304 · Lightning-AI/lightning · GitHub

Topic		Replies	Views
Metrics Not Lining Up With sklearn	2	1305	January 14, 2021
Discrepancy between val and test metrics implementation help	6	3317	December 21, 2020
Confusions about torchmetrics in pytorch_lightning Trainer	6	649	March 1, 2024
Using AUROC as a validation metric	1	638	January 29, 2022
Understanding epoch metrics: acc_train_epoch does not appear to be the average of acc_train_step	0	371	August 31, 2022

PyTorch Lightning AUROC value for multi-class seems to be completely off compared to sklearn (using it wrong)?

Related topics