I want to calculate the Area Under the Receiver Operating Characteristics (AUROC) of my multi-class predictions. However, I don’t know which value to trust.
PyTorch Lightning comes with an AUROC metric. However, whether you call it on the final activation values or after categorizing it both gives different results.
Also, both values do not match the AUROC calculation found in scikit-learn.
Here is a standalone version:
from pl_bolts.models import LitMNIST
import pytorch_lightning as pl
from pytorch_lightning.metrics.classification import AUROC
from pytorch_lightning.metrics.functional import to_categorical
# https://github.com/reiinakano/scikit-plot
# pip install scikit-plot | conda install -c conda-forge scikit-plot
import matplotlib.pyplot as plt
import scikitplot as skplt
pl.seed_everything(0)
# model
model = LitMNIST(batch_size=64)
trainer = pl.Trainer(max_epochs=1, deterministic=True)
# train 1 epoch to expect decent results from auroc
trainer.fit(model)
# prevent grad error
model.freeze()
pl_auroc = AUROC()
for i, batch in enumerate(model.test_dataloader()):
# get a prediction
x, target = batch
print(i, x.shape, target.shape)
predict = model(x)
# auroc with activations per class
print(pl_auroc(predict, target))
# auroc after converting activations to categories
predict_cat = to_categorical(predict)
print(pl_auroc(predict_cat, target))
# use scikit-learn with a simplified interface
predict_np = predict.numpy() # .cpu()
target_np = target.numpy() # .cpu()
skplt.metrics.plot_roc(target_np, predict_np) # reversed order
plt.show()
# predictions versus target
print(list(zip(predict_cat.numpy(), target.numpy())))
# 3 predctions is enough
if i >= 2:
break
And this is the output from the above script:
# output batch 1:
# AUROC on activation: tensor(0.3796)
# AUROC on category: tensor(0.1111)
# AUROC scikit-learn micro-average: 0.98
# macro-average: 0.94
# (predict, target):
# [(7, 7), (2, 2), (1, 1), (0, 0), (4, 4), (1, 1), (4, 4), (9, 9), (6, 5), (9, 9), (0, 0), (6, 6), (9, 9), (0, 0), (1, 1), (5, 5), (9, 9), (7, 7), (2, 3), (4, 4), (9, 9), (6, 6), (6, 6), (5, 5), (4, 4), (0, 0), (7, 7), (4, 4), (0, 0), (1, 1), (3, 3), (1, 1), (3, 3), (4, 4), (7, 7), (2, 2), (7, 7), (1, 1), (2, 2), (1, 1), (1, 1), (7, 7), (4, 4), (2, 2), (3, 3), (5, 5), (1, 1), (2, 2), (4, 4), (4, 4), (6, 6), (3, 3), (5, 5), (5, 5), (6, 6), (0, 0), (4, 4), (1, 1), (9, 9), (5, 5), (7, 7), (2, 8), (9, 9), (3, 3)]
# output batch 2:
# AUROC on activation: tensor(0.4407)
# AUROC on category: tensor(0.0678)
# AUROC scikit-learn micro-average: 0.97
# macro-average: 0.94
# output batch 3:
# AUROC on activation: tensor(0.4038)
# AUROC on category: tensor(0.0962)
# AUROC scikit-learn micro-average: 0.93
# macro-average: 0.93
The AUROC values from pytorch_lightning.metrics.classification.AUROC
seem to be completely off.
Am I using AUROC wrong here?
p.s. Maybe a metric
category is necessary for this type of question?