Skip to content

Fix/mAP #1834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: develop
Choose a base branch
from
Open

Fix/mAP #1834

wants to merge 13 commits into from

Conversation

rafaelpadilla
Copy link

@rafaelpadilla rafaelpadilla commented Apr 28, 2025

Description

The current implementation of Mean Average Precision (mAP) in supervision.metrics produces results that diverge from both pycocotools and published benchmarks (e.g., the Roboflow leaderboard).

This PR aligns supervision's mAP implementation with pycocotools, the official COCO evaluation tool, ensuring reliable, standardized metrics. 🚀

Key points 🏆

  • Self-contained: No external dependencies are needed to compute the metric.
  • Seamless integration: Metrics are computed directly in supervision/metrics/mean_average_precision.py and works directly with supervision.Detections objects.
  • Unchanged public API: Existing user code continues to work with the same interface. No impact for users! 🔥
import supervision as sv
from supervision.metrics import MeanAveragePrecision
preds   = sv.Detections(...)
targets = sv.Detections(...)
metric  = MeanAveragePrecision()
result  = metric.update(preds, targets).compute()
print(result.map50_95)

  • Same result object: MeanAveragePrecisionResult class still exposes the same properties:
print("Overall:")
print(f"map50_95: {result.map50_95}")
print(f"map50:     {result.map50}")
print(f"map75:     {result.map75}")
print(f"map50_95: {result.small_objects.map50_95}")
print(f"map50_95: {result.medium_objects.map50_95}")
print(f"map50_95: {result.large_objects.map50_95}")
  • Numerical parity: Evaluations match pycocotools within < 2e-5. ✅
    • A Colab notebook was created to compute mAP results for RT-Detr models purely with pycocotools here.
    • Another Colab notebook was created to compute mAP results for RT-Detr models using this proposed branch here.
    • A detailed results table comparing the results of both Colab notebooks (pure pycocotools vs. this PR) is here, showing a maximum deviation of 2e-5. 🎯

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

@rafaelpadilla rafaelpadilla changed the title Fix/m ap Fix/mAP Apr 28, 2025
@SkalskiP SkalskiP mentioned this pull request May 5, 2025
2 tasks
)


class EvaluationDataset:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest. I don't really like the idea of converting Detections into this intermediate format. What do we gain?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main benefit of introducing an EvaluationDataset intermediate is alignment with pycocotools. It mirrors their internal structure and makes our implementation easier to understand, debug, and compare against their results — which is crucial for metric consistency and correctness.

Also, Detections doesn't currently expose image IDs or a clean way to separate predictions from ground truth. EvaluationDataset fills that gap by organizing data in a way that's much closer to what pycocotools expects, making it easier to implement metrics like mAP and potentially recall in the future.

If the Detections class evolves to include this structure natively, we could definitely revisit. But for now, this abstraction helps keep the logic cleaner, more testable, and easier to extend without bloating the evaluator.

Comment on lines 529 to 593
def _iou_with_jaccard(
dt: List[List[float]], gt: List[List[float]], is_crowd: List[bool]
) -> np.ndarray:
"""
Calculate the intersection over union (IoU) between detection bounding boxes (dt)
and ground-truth bounding boxes (gt).
Reference: https://github.com/rafaelpadilla/review_object_detection_metrics

Args:
dt (List[List[float]]): List of detection bounding boxes in the \
format [x, y, width, height].
gt (List[List[float]]): List of ground-truth bounding boxes in the \
format [x, y, width, height].
is_crowd (List[bool]): List indicating if each ground-truth bounding box \
is a crowd region or not.

Returns:
np.ndarray: Array of IoU values of shape (len(dt), len(gt)).
"""
assert len(is_crowd) == len(gt), "iou(iscrowd=) must have the same length as gt"
if len(dt) == 0 or len(gt) == 0:
return np.array([])
ious = np.zeros((len(dt), len(gt)), dtype=np.float64)
for g_idx, g in enumerate(gt):
for d_idx, d in enumerate(dt):
ious[d_idx, g_idx] = _jaccard(d, g, is_crowd[g_idx])
return ious


def _jaccard(box_a: List[float], box_b: List[float], is_crowd: bool) -> float:
"""
Calculate the Jaccard index (intersection over union) between two bounding boxes.
If a gt object is marked as "iscrowd", a dt is allowed to match any subregion
of the gt. Choosing gt' in the crowd gt that best matches the dt can be done using
gt'=intersect(dt,gt). Since by definition union(gt',dt)=dt, computing
iou(gt,dt,iscrowd) = iou(gt',dt) = area(intersect(gt,dt)) / area(dt)

Args:
box_a (List[float]): Box coordinates in the format [x, y, width, height].
box_b (List[float]): Box coordinates in the format [x, y, width, height].
iscrowd (bool): Flag indicating if the second box is a crowd region or not.

Returns:
float: Jaccard index between the two bounding boxes.
"""
xa, ya, x2a, y2a = box_a[0], box_a[1], box_a[0] + box_a[2], box_a[1] + box_a[3]
xb, yb, x2b, y2b = box_b[0], box_b[1], box_b[0] + box_b[2], box_b[1] + box_b[3]

# Innermost left x
xi = max(xa, xb)
# Innermost right x
x2i = min(x2a, x2b)
# Same for y
yi = max(ya, yb)
y2i = min(y2a, y2b)

# Calculate areas
Aa = max(x2a - xa, 0.0) * max(y2a - ya, 0.0)
Ab = max(x2b - xb, 0.0) * max(y2b - yb, 0.0)
Ai = max(x2i - xi, 0.0) * max(y2i - yi, 0.0)

if is_crowd:
return Ai / (Aa + EPS)

return Ai / (Aa + Ab - Ai + EPS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s reimplement this in a vectorized way using numpy and move it to detection/utils.py.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

else:
area = None

if use_iscrowd or use_precomputed_area:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have two separate flags if, in the end, they are connected with or ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up — you're absolutely right. Initially, I aimed to keep each part independent to provide more flexibility for external users. However, after reviewing it more closely, I agree that a single flag is sufficient and cleaner. I've updated the code accordingly to keep only one flag.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unsure about the name of this flag. Since we're loading both iscrowd and area, calling it use_iscrowd feels a bit misleading.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that use_iscrowd might be a bit misleading when you first see it. However, the evaluation logic needs to account for the iscrowd property when it’s present in the dataset. This is necessary to produce results consistent with pycocotools.

Also, the use_iscrowd flag (which defaults to False) is essential for maintaining compatibility with existing tests, particularly test/dataset/formats/test_coco.py::test_coco_annotations_to_detections.

In short, the use_iscrowd flag serves two purposes:

  • It includes the iscrowd property from the dataset in the Detection objects, which is critical for matching pycocotools behavior.
  • It ensures backward compatibility with existing tests.

To keep things simple and closer to the original logic, I kept the use_iscrowd flag only in the coco_annotations_to_detections function. I hope that addresses your concern, but happy to adjust if you have other suggestions.

@@ -159,16 +183,29 @@ def detections_to_coco_annotations(
return coco_annotations, annotation_id


def get_coco_class_index_mapping(annotations_path: str) -> Dict[int, int]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct — this function is not currently invoked in the main pipeline. However, it is essential for evaluating certain models whose class ID schemes differ from those used in the COCO dataset.

Specifically, some models use sequential class IDs (e.g., 0 to 79 for 80 classes), whereas COCO's official annotations intentionally skip some IDs. You can see a detailed breakdown of these skipped IDs in this spread sheet.

To address this mismatch, this function is super useful. A practical example of this mapping is used in
this colab notebook, where get_coco_class_index_mapping is applied to reproduce results consistent with the roboflow/model-leaderboard

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since in your examples you always reverse the dictionary right after get_coco_class_index_mapping, maybe it would be easier to just return the reversed mapping directly?

class_mapping = get_coco_class_index_mapping(annotation_file)
inv_class_mapping = {v: k for k, v in class_mapping.items()}

Copy link
Author

@rafaelpadilla rafaelpadilla Jun 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️ I have updated the code to return the reversed mapping. Thanks for the suggestion!

@rafaelpadilla
Copy link
Author

Hey @SkalskiP ,

Thank you for your review! 🙌
I've addressed all the comments and updated the code accordingly.
Let me know if everything looks good on your side.

@rafaelpadilla rafaelpadilla requested a review from SkalskiP May 8, 2025 03:07
@rishabh-mondal
Copy link

Hi @SkalskiP @onuralpszr,
Do you have any updates on the mAP calculation error issue? Could you please let me know when it might be resolved?
Thank you!

@@ -1325,3 +1325,73 @@ def spread_out_boxes(
xyxy_padded[:, [2, 3]] += force_vectors

return pad_boxes(xyxy_padded, px=-1)


def _jaccard(box_a: List[float], box_b: List[float], is_crowd: bool) -> float:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was to enable batch processing of the boxes, so that we wouldn't need double for loop

for g_idx, g in enumerate(gt):
    for d_idx, d in enumerate(dt):

in the iou_with_jaccard function.

Copy link
Author

@rafaelpadilla rafaelpadilla Jun 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, see my explanation in the next conversation. I addressed this issue there.

Comment on lines 1371 to 1373
def iou_with_jaccard(
dt: List[List[float]], gt: List[List[float]], is_crowd: List[bool]
) -> np.ndarray:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d like to keep the supervision API consistent. So far, we have box_iou_batch and mask_iou_batch. I think it would make sense to rename this function to box_iou_batch_with_jaccard. Alternatively, we could consider merging box_iou_batch with iou_with_jaccard, and add support for an optional is_crowd: Optional[np.ndarray] argument.

If you decide to keep box_iou_batch_with_jaccard as a separate function, please rename dt and gt to be consistent with the naming in box_iou_batch and mask_iou_batch. The arguments should be named boxes_true: np.ndarray and boxes_detection: np.ndarray (or, for masks, masks_true: np.ndarray and masks_detection: np.ndarray).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think merging box_iou_batch with iou_with_jaccard makes the most sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughtful suggestions! I agree that aligning with the supervision API and aiming for consistency is important.

I also agree that merging box_iou_batch and iou_with_jaccard would be ideal in theory, especially to avoid maintaining two implementations. However, doing so leads to a measurable divergence from pycocotools mAP results.

In earlier tests, I tried replacing iou_with_jaccard with the existing box_iou_batch, but noticed small discrepancies in IoU values (starting from distant decimal place). While those differences seem minor, they add up across multiple detections and lead to different mAP scores. 😞

I believe the root cause is that gt_boxes are in float64 while dt_boxes are in float32. When passed together to box_iou_batch (which expects uniform np.ndarray inputs), the type coercion or mixed precision results in slightly different outcomes compared to the pycocotools implementation.

The original pycocotools IoU logic is implemented in Cython (source), and has its own internal handling of precision and memory layout. Since our goal here is to provide a fully python code without relying on .pyx dependencies, the most accurate match I’ve been able to get is via the current iou_with_jaccard and _jaccard functions.

I haven’t found a reliable way to vectorize the logic using numpy that still replicates pycocotools outputs exactly - especially when accounting for is_crowd. If you have an idea on how to achieve that while maintaining numerical parity, I would genuinely love to learn. 🙏

For now, I have renamed the input arguments as you suggested to maintain naming consistency. But I would recommend to keep iou_with_jaccard and _jaccard as-is, since they yield results that are numerically aligned with pycocotools.

Comment on lines 117 to 118
iscrowd = [0] * len(image_annotations)
area = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we’re not actually using iscrowd or area; we’re just returning empty dictionaries. I think it makes sense to remove both iscrowd and area.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, those values weren’t being used.
✔️ I removed both iscrowd and area from this part of code.

dt_boxes = [d["bbox"] for d in dt]

# Get the iscrowd flag for each gt
is_crowd = [int(o["iscrowd"]) for o in gt]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you load a dataset that doesn’t have iscrowd, this line causes the entire code to crash. We want users to be able to run evaluation on their own datasets, which might be in YOLO format or even COCO format without iscrowd. The code shouldn’t crash in this case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right — thanks for pointing that out!
✔️ I've updated the evaluation logic so that if iscrowd is missing from the dataset, it now defaults to iscrowd=0.
This ensures compatibility with datasets in YOLO format or COCO variants that don’t include the iscrowd field, and prevents the code from crashing in those cases.

@SkalskiP
Copy link
Collaborator

Hi @rafaelpadilla, I just tested YOLO11 by comparing the results obtained with the updated supervision to those from pycocotools, and we’re seeing about a ~1% discrepancy.

@SkalskiP
Copy link
Collaborator

Hi @rafaelpadilla, short update: I spent some time today diving a bit deeper. I independently benchmarked YOLO11 using both the new supervision and pycocotools, and I’m getting identical results between those two methods. However, these results differ from what our ML team report in the paper. I reached out to our ML team, to learn more.

Here are Google Colabs I prepared:

I'll keep you posted. For now, let's focus on other comments in this PR.

…r datasets.

2) Making `get_coco_class_index_mapping` return the inversed mapping, to simplify its usage.
@rafaelpadilla
Copy link
Author

rafaelpadilla commented Jun 1, 2025

Hi @SkalskiP,

Thanks again for your review!

I've addressed all your comments and incorporated your suggestions throughout the PR.

Regarding results consistency:

  • I re-ran the tests using your YOLO Colab notebook and confirmed that the results remain identical 🚀
  • I also re-tested with RT-DETR using my notebook and the results are consistent as well. 🚀
  • To make things easier to compare, I’ve consolidated both sets of results (YOLO and RT-DETR) into this Google Sheet.

All points that could be implemented without impacting metric accuracy have been addressed. For the others, I provided context explaining the rationale behind the current implementation and pycocotools.

I believe this PR will make Supervision a simple and reliable tool for testing object detection models.

Let me know if anything else comes up.

@rafaelpadilla rafaelpadilla requested a review from SkalskiP June 1, 2025 18:25
@rafaelpadilla
Copy link
Author

rafaelpadilla commented Jun 3, 2025

Sorry, I think I accidentally closed this PR. 🤭
Reopening it as it is ready to be reviewed and potentially merged.

@rafaelpadilla rafaelpadilla reopened this Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants