In trying to write a Simple Object Detection system (using Lightning) which is based on this tutorial.
I am using a COCO-like data set and the problem I am facing is on the metrics.
In the tutorial, the training loop looks like:
for epoch in range(num_epochs):
# train for one epoch, printing every 10 iterations
train_one_epoch(
model, optimizer, data_loader, device, epoch, print_freq=len(dataset_val)
)
# update the learning rate
lr_scheduler.step()
# evaluate on the test dataset
evaluate(model, data_loader_val, device=device)
Where evaluate
is:
@torch.no_grad()
def evaluate(model, data_loader, device, coco_evaluator):
n_threads = torch.get_num_threads()
# FIXME remove this and make paste_masks_in_image run on the GPU
torch.set_num_threads(1)
cpu_device = torch.device("cpu")
model.eval()
metric_logger = utils.MetricLogger(delimiter=" ")
header = "Test:"
coco = get_coco_api_from_dataset(data_loader.dataset)
iou_types = _get_iou_types(model)
coco_evaluator = CocoEvaluator(coco, iou_types)
for images, targets, batch_id in metric_logger.log_every(data_loader, 100, header):
images = list(img.to(device) for img in images)
if torch.cuda.is_available():
torch.cuda.synchronize()
model_time = time.time()
outputs = model(images)
outputs = [{k: v.to(cpu_device) for k, v in t.items()} for t in outputs]
model_time = time.time() - model_time
res = {
target["image_id"].item(): output
for target, output in zip(targets, outputs)
}
evaluator_time = time.time()
coco_evaluator.update(res)
evaluator_time = time.time() - evaluator_time
metric_logger.update(model_time=model_time, evaluator_time=evaluator_time)
# gather the stats from all processes
metric_logger.synchronize_between_processes()
print("Averaged stats:", metric_logger)
coco_evaluator.synchronize_between_processes()
# accumulate predictions from all images
coco_evaluator.accumulate()
coco_evaluator.summarize()
torch.set_num_threads(n_threads)
return coco_evaluator
Based on this notebook, I wrote:
val_dataloader
instantiates the coco_evaluator
:
def val_dataloader(self):
valid_loader = torch.utils.data.DataLoader(
self.valid_dataset,
batch_size=self.cfg.data.batch_size,
num_workers=self.cfg.data.num_workers,
shuffle=False,
pin_memory=True,
collate_fn=collate_fn,
)
self.val = valid_loader.dataset
# prepare coco evaluator
coco = get_coco_api_from_dataset(valid_loader.dataset)
self.iou_types = _get_iou_types(self.model)
self.coco_evaluator = CocoEvaluator(coco, self.iou_types)
return valid_loader
The validation_step
calculates updates the coco_evaluator
:
def validation_step(self, batch, batch_idx):
images, targets, image_ids = batch
outputs = self.model(images)
targets_cpu = []
outputs_cpu = []
for target, output in zip(targets, outputs):
t_cpu = {k: v.cpu() for k, v in target.items()}
o_cpu = {k: v.cpu() for k, v in output.items()}
targets_cpu.append(t_cpu)
outputs_cpu.append(o_cpu)
res = {
target["image_id"].item(): output
for target, output in zip(targets_cpu, outputs_cpu)
}
self.coco_evaluator.update(res)
return res
At the validation_epoch_end
the accumulation and summarization are carried out
def validation_epoch_end(self, output):
self.coco_evaluator.accumulate()
self.coco_evaluator.summarize()
# coco main metric
metric = coco_evaluator.coco_eval["bbox"].stats[0]
self.log(self.validation_metric, metric)
The WRONG behavior:
The tutorial yields something like, on the PennFudanPed data set:
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.00s).
Accumulating evaluation results...
DONE (t=0.00s).
Running per image evaluation...
Evaluate annotation type *segm*
DONE (t=0.00s).
Accumulating evaluation results...
DONE (t=0.00s).
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.025
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.036
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.036
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.045
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.070
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.070
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.087
IoU metric: segm
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.033
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.036
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.036
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.058
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.090
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.090
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.113
That's it!
Whereas the code I am re-writing yields (of course on the same data set):
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.00s).
Accumulating evaluation results...
DONE (t=0.00s).
Running per image evaluation...
Evaluate annotation type *segm*
DONE (t=0.01s).
Accumulating evaluation results...
DONE (t=0.00s).
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.232
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.329
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.252
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.238
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.600
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.600
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.600
IoU metric: segm
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.202
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.252
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.252
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.404
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.400
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.400
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.400
I know that there are no occurrences of small objects (area < 36**2) in the validation
data set:
{'small': 0, 'medium': 11, 'large': 29}
So, the tutorial output -1 for:
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
IoU metric: segm
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Looks consistent (is it really?). Such behavior is observed across all runs.
Things I have tested:
- The Models used in both codes match
- The Data sets are the same in both codes
- I have stored all res’ in a list for later processing in
validation_epoch_end
, as is done in the tutorial code, but it has also not worked.