Fix/mAP #1834

rafaelpadilla · 2025-04-28T22:19:25Z

Description

The current implementation of Mean Average Precision (mAP) in supervision.metrics produces results that diverge from both pycocotools and published benchmarks (e.g., the Roboflow leaderboard).

This PR aligns supervision's mAP implementation with pycocotools, the official COCO evaluation tool, ensuring reliable, standardized metrics. 🚀

Key points 🏆

Self-contained: No external dependencies are needed to compute the metric.
Seamless integration: Metrics are computed directly in supervision/metrics/mean_average_precision.py and works directly with supervision.Detections objects.
Unchanged public API: Existing user code continues to work with the same interface. No impact for users! 🔥

import supervision as sv
from supervision.metrics import MeanAveragePrecision
preds   = sv.Detections(...)
targets = sv.Detections(...)
metric  = MeanAveragePrecision()
result  = metric.update(preds, targets).compute()
print(result.map50_95)

Same result object: MeanAveragePrecisionResult class still exposes the same properties:

print("Overall:")
print(f"map50_95: {result.map50_95}")
print(f"map50:     {result.map50}")
print(f"map75:     {result.map75}")
print(f"map50_95: {result.small_objects.map50_95}")
print(f"map50_95: {result.medium_objects.map50_95}")
print(f"map50_95: {result.large_objects.map50_95}")

Numerical parity: Evaluations match pycocotools within < 2e-5. ✅
- A Colab notebook was created to compute mAP results for RT-Detr models purely with pycocotools here.
- Another Colab notebook was created to compute mAP results for RT-Detr models using this proposed branch here.
- A detailed results table comparing the results of both Colab notebooks (pure pycocotools vs. this PR) is here, showing a maximum deviation of 2e-5. 🎯

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

SkalskiP · 2025-05-06T16:23:48Z

supervision/metrics/mean_average_precision.py

+        )
+
+
+class EvaluationDataset:


I'll be honest. I don't really like the idea of converting Detections into this intermediate format. What do we gain?

The main benefit of introducing an EvaluationDataset intermediate is alignment with pycocotools. It mirrors their internal structure and makes our implementation easier to understand, debug, and compare against their results — which is crucial for metric consistency and correctness.

Also, Detections doesn't currently expose image IDs or a clean way to separate predictions from ground truth. EvaluationDataset fills that gap by organizing data in a way that's much closer to what pycocotools expects, making it easier to implement metrics like mAP and potentially recall in the future.

If the Detections class evolves to include this structure natively, we could definitely revisit. But for now, this abstraction helps keep the logic cleaner, more testable, and easier to extend without bloating the evaluator.

SkalskiP · 2025-05-07T12:10:33Z

supervision/metrics/mean_average_precision.py

+def _iou_with_jaccard(
+    dt: List[List[float]], gt: List[List[float]], is_crowd: List[bool]
+) -> np.ndarray:
+    """
+    Calculate the intersection over union (IoU) between detection bounding boxes (dt)
+    and ground-truth bounding boxes (gt).
+    Reference: https://github.com/rafaelpadilla/review_object_detection_metrics
+
+    Args:
+        dt (List[List[float]]): List of detection bounding boxes in the \
+            format [x, y, width, height].
+        gt (List[List[float]]): List of ground-truth bounding boxes in the \
+            format [x, y, width, height].
+        is_crowd (List[bool]): List indicating if each ground-truth bounding box \
+            is a crowd region or not.
+
+    Returns:
+        np.ndarray: Array of IoU values of shape (len(dt), len(gt)).
+    """
+    assert len(is_crowd) == len(gt), "iou(iscrowd=) must have the same length as gt"
+    if len(dt) == 0 or len(gt) == 0:
+        return np.array([])
+    ious = np.zeros((len(dt), len(gt)), dtype=np.float64)
+    for g_idx, g in enumerate(gt):
+        for d_idx, d in enumerate(dt):
+            ious[d_idx, g_idx] = _jaccard(d, g, is_crowd[g_idx])
+    return ious
+
+
+def _jaccard(box_a: List[float], box_b: List[float], is_crowd: bool) -> float:
+    """
+    Calculate the Jaccard index (intersection over union) between two bounding boxes.
+    If a gt object is marked as "iscrowd", a dt is allowed to match any subregion
+    of the gt. Choosing gt' in the crowd gt that best matches the dt can be done using
+    gt'=intersect(dt,gt). Since by definition union(gt',dt)=dt, computing
+    iou(gt,dt,iscrowd) = iou(gt',dt) = area(intersect(gt,dt)) / area(dt)
+
+    Args:
+        box_a (List[float]): Box coordinates in the format [x, y, width, height].
+        box_b (List[float]): Box coordinates in the format [x, y, width, height].
+        iscrowd (bool): Flag indicating if the second box is a crowd region or not.
+
+    Returns:
+        float: Jaccard index between the two bounding boxes.
+    """
+    xa, ya, x2a, y2a = box_a[0], box_a[1], box_a[0] + box_a[2], box_a[1] + box_a[3]
+    xb, yb, x2b, y2b = box_b[0], box_b[1], box_b[0] + box_b[2], box_b[1] + box_b[3]
+
+    # Innermost left x
+    xi = max(xa, xb)
+    # Innermost right x
+    x2i = min(x2a, x2b)
+    # Same for y
+    yi = max(ya, yb)
+    y2i = min(y2a, y2b)
+
+    # Calculate areas
+    Aa = max(x2a - xa, 0.0) * max(y2a - ya, 0.0)
+    Ab = max(x2b - xb, 0.0) * max(y2b - yb, 0.0)
+    Ai = max(x2i - xi, 0.0) * max(y2i - yi, 0.0)
+
+    if is_crowd:
+        return Ai / (Aa + EPS)
+
+    return Ai / (Aa + Ab - Ai + EPS)


Let’s reimplement this in a vectorized way using numpy and move it to detection/utils.py.

supervision/metrics/mean_average_precision.py

SkalskiP · 2025-05-07T16:34:43Z

supervision/dataset/formats/coco.py

+    else:
+        area = None
+
+    if use_iscrowd or use_precomputed_area:


Does it make sense to have two separate flags if, in the end, they are connected with or ?

Thanks for bringing this up — you're absolutely right. Initially, I aimed to keep each part independent to provide more flexibility for external users. However, after reviewing it more closely, I agree that a single flag is sufficient and cleaner. I've updated the code accordingly to keep only one flag.

I'm a bit unsure about the name of this flag. Since we're loading both iscrowd and area, calling it use_iscrowd feels a bit misleading.

I agree that use_iscrowd might be a bit misleading when you first see it. However, the evaluation logic needs to account for the iscrowd property when it’s present in the dataset. This is necessary to produce results consistent with pycocotools.

Also, the use_iscrowd flag (which defaults to False) is essential for maintaining compatibility with existing tests, particularly test/dataset/formats/test_coco.py::test_coco_annotations_to_detections.

In short, the use_iscrowd flag serves two purposes:

It includes the iscrowd property from the dataset in the Detection objects, which is critical for matching pycocotools behavior.

It ensures backward compatibility with existing tests.

To keep things simple and closer to the original logic, I kept the use_iscrowd flag only in the coco_annotations_to_detections function. I hope that addresses your concern, but happy to adjust if you have other suggestions.

SkalskiP · 2025-05-07T16:35:24Z

supervision/dataset/formats/coco.py

@@ -159,16 +183,29 @@ def detections_to_coco_annotations(
    return coco_annotations, annotation_id


+def get_coco_class_index_mapping(annotations_path: str) -> Dict[int, int]:


This function is not used.

You're correct — this function is not currently invoked in the main pipeline. However, it is essential for evaluating certain models whose class ID schemes differ from those used in the COCO dataset.

Specifically, some models use sequential class IDs (e.g., 0 to 79 for 80 classes), whereas COCO's official annotations intentionally skip some IDs. You can see a detailed breakdown of these skipped IDs in this spread sheet.

To address this mismatch, this function is super useful. A practical example of this mapping is used in
this colab notebook, where get_coco_class_index_mapping is applied to reproduce results consistent with the roboflow/model-leaderboard

Since in your examples you always reverse the dictionary right after get_coco_class_index_mapping, maybe it would be easier to just return the reversed mapping directly?

class_mapping = get_coco_class_index_mapping(annotation_file) inv_class_mapping = {v: k for k, v in class_mapping.items()}

✔️ I have updated the code to return the reversed mapping. Thanks for the suggestion!

… detection/utils.py

rafaelpadilla · 2025-05-08T02:12:54Z

Hey @SkalskiP ,

Thank you for your review! 🙌
I've addressed all the comments and updated the code accordingly.
Let me know if everything looks good on your side.

rishabh-mondal · 2025-05-27T07:37:30Z

Hi @SkalskiP @onuralpszr,
Do you have any updates on the mAP calculation error issue? Could you please let me know when it might be resolved?
Thank you!

SkalskiP · 2025-05-28T14:57:44Z

supervision/detection/utils.py

@@ -1325,3 +1325,73 @@ def spread_out_boxes(
        xyxy_padded[:, [2, 3]] += force_vectors

    return pad_boxes(xyxy_padded, px=-1)
+
+
+def _jaccard(box_a: List[float], box_b: List[float], is_crowd: bool) -> float:


What I meant was to enable batch processing of the boxes, so that we wouldn't need double for loop

for g_idx, g in enumerate(gt): for d_idx, d in enumerate(dt):

in the iou_with_jaccard function.

Please, see my explanation in the next conversation. I addressed this issue there.

SkalskiP · 2025-05-28T15:07:11Z

supervision/detection/utils.py

+def iou_with_jaccard(
+    dt: List[List[float]], gt: List[List[float]], is_crowd: List[bool]
+) -> np.ndarray:


I’d like to keep the supervision API consistent. So far, we have box_iou_batch and mask_iou_batch. I think it would make sense to rename this function to box_iou_batch_with_jaccard. Alternatively, we could consider merging box_iou_batch with iou_with_jaccard, and add support for an optional is_crowd: Optional[np.ndarray] argument.

If you decide to keep box_iou_batch_with_jaccard as a separate function, please rename dt and gt to be consistent with the naming in box_iou_batch and mask_iou_batch. The arguments should be named boxes_true: np.ndarray and boxes_detection: np.ndarray (or, for masks, masks_true: np.ndarray and masks_detection: np.ndarray).

I think merging box_iou_batch with iou_with_jaccard makes the most sense.

Thanks for the thoughtful suggestions! I agree that aligning with the supervision API and aiming for consistency is important.

I also agree that merging box_iou_batch and iou_with_jaccard would be ideal in theory, especially to avoid maintaining two implementations. However, doing so leads to a measurable divergence from pycocotools mAP results.

In earlier tests, I tried replacing iou_with_jaccard with the existing box_iou_batch, but noticed small discrepancies in IoU values (starting from distant decimal place). While those differences seem minor, they add up across multiple detections and lead to different mAP scores. 😞

I believe the root cause is that gt_boxes are in float64 while dt_boxes are in float32. When passed together to box_iou_batch (which expects uniform np.ndarray inputs), the type coercion or mixed precision results in slightly different outcomes compared to the pycocotools implementation.

The original pycocotools IoU logic is implemented in Cython (source), and has its own internal handling of precision and memory layout. Since our goal here is to provide a fully python code without relying on .pyx dependencies, the most accurate match I’ve been able to get is via the current iou_with_jaccard and _jaccard functions.

I haven’t found a reliable way to vectorize the logic using numpy that still replicates pycocotools outputs exactly - especially when accounting for is_crowd. If you have an idea on how to achieve that while maintaining numerical parity, I would genuinely love to learn. 🙏

For now, I have renamed the input arguments as you suggested to maintain naming consistency. But I would recommend to keep iou_with_jaccard and _jaccard as-is, since they yield results that are numerically aligned with pycocotools.

SkalskiP · 2025-05-28T15:14:34Z

supervision/dataset/formats/coco.py

+        iscrowd = [0] * len(image_annotations)
+        area = None


It looks like we’re not actually using iscrowd or area; we’re just returning empty dictionaries. I think it makes sense to remove both iscrowd and area.

Good point, those values weren’t being used.
✔️ I removed both iscrowd and area from this part of code.

SkalskiP · 2025-05-28T21:00:13Z

supervision/metrics/mean_average_precision.py

+        dt_boxes = [d["bbox"] for d in dt]
+
+        # Get the iscrowd flag for each gt
+        is_crowd = [int(o["iscrowd"]) for o in gt]


When you load a dataset that doesn’t have iscrowd, this line causes the entire code to crash. We want users to be able to run evaluation on their own datasets, which might be in YOLO format or even COCO format without iscrowd. The code shouldn’t crash in this case.

You're absolutely right — thanks for pointing that out!
✔️ I've updated the evaluation logic so that if iscrowd is missing from the dataset, it now defaults to iscrowd=0.
This ensures compatibility with datasets in YOLO format or COCO variants that don’t include the iscrowd field, and prevents the code from crashing in those cases.

SkalskiP · 2025-05-28T22:03:15Z

Hi @rafaelpadilla, I just tested YOLO11 by comparing the results obtained with the updated supervision to those from pycocotools, and we’re seeing about a ~1% discrepancy.

SkalskiP · 2025-05-29T11:14:52Z

Hi @rafaelpadilla, short update: I spent some time today diving a bit deeper. I independently benchmarked YOLO11 using both the new supervision and pycocotools, and I’m getting identical results between those two methods. However, these results differ from what our ML team report in the paper. I reached out to our ML team, to learn more.

Here are Google Colabs I prepared:

I'll keep you posted. For now, let's focus on other comments in this PR.

…r datasets. 2) Making `get_coco_class_index_mapping` return the inversed mapping, to simplify its usage.

…ther functions

rafaelpadilla · 2025-06-01T18:24:52Z

Hi @SkalskiP,

Thanks again for your review!

I've addressed all your comments and incorporated your suggestions throughout the PR.

Regarding results consistency:

I re-ran the tests using your YOLO Colab notebook and confirmed that the results remain identical 🚀
I also re-tested with RT-DETR using my notebook and the results are consistent as well. 🚀
To make things easier to compare, I’ve consolidated both sets of results (YOLO and RT-DETR) into this Google Sheet.

All points that could be implemented without impacting metric accuracy have been addressed. For the others, I provided context explaining the rationale behind the current implementation and pycocotools.

I believe this PR will make Supervision a simple and reliable tool for testing object detection models.

Let me know if anything else comes up.

rafaelpadilla · 2025-06-03T23:22:25Z

Sorry, I think I accidentally closed this PR. 🤭
Reopening it as it is ready to be reviewed and potentially merged.

rafaelpadilla added 3 commits April 28, 2025 20:12

preparing coco dataset with tags area and iscrowd

317cbd3

implementing mean average precision using COCO approach

232916a

pre-commit (lint) fixes

36a2171

rafaelpadilla requested review from SkalskiP and onuralpszr as code owners April 28, 2025 22:19

fix data default value

157ed3a

rafaelpadilla changed the title ~~Fix/m ap~~ Fix/mAP Apr 28, 2025

Removing unnecessary comments and including references.

15cd445

SkalskiP mentioned this pull request May 5, 2025

Incorrect mAP calculation #1836

Closed

2 tasks

SkalskiP requested changes May 7, 2025

View reviewed changes

rafaelpadilla and others added 6 commits May 7, 2025 20:28

replacing __str__ implementation with to_pycocotools_output

688d987

Updating docstring to reflect newly added arguments

bba43ec

Removing spaces

790baa6

removing unnecessary flag

19c4e44

Moving iou_with_jaccard and jaccard from mean_average_precision.py to…

8f72e49

… detection/utils.py

fix(pre_commit): 🎨 auto format pre-commit hooks

52980a6

rafaelpadilla requested a review from SkalskiP May 8, 2025 03:07

SkalskiP reviewed May 28, 2025

View reviewed changes

rafaelpadilla added 2 commits June 1, 2025 15:52

1) Simplifying the logic of use_iscrowd and compatibility with othe…

cff9745

…r datasets. 2) Making `get_coco_class_index_mapping` return the inversed mapping, to simplify its usage.

renaming and reordering inputs of function to keep consistency with o…

fd31351

…ther functions

rafaelpadilla requested a review from SkalskiP June 1, 2025 18:25

rafaelpadilla closed this Jun 3, 2025

rafaelpadilla reopened this Jun 3, 2025

		@@ -159,16 +183,29 @@ def detections_to_coco_annotations(
		return coco_annotations, annotation_id


		def get_coco_class_index_mapping(annotations_path: str) -> Dict[int, int]:

Fix/mAP #1834

Are you sure you want to change the base?

Fix/mAP #1834

Uh oh!

Conversation

rafaelpadilla commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key points 🏆

Type of change

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rafaelpadilla Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rafaelpadilla commented May 8, 2025

Uh oh!

rishabh-mondal commented May 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rafaelpadilla Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SkalskiP commented May 28, 2025

Uh oh!

SkalskiP commented May 29, 2025

Uh oh!

rafaelpadilla commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rafaelpadilla commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rafaelpadilla commented Apr 28, 2025 •

edited

Loading

rafaelpadilla Jun 1, 2025 •

edited

Loading

rafaelpadilla Jun 1, 2025 •

edited

Loading

rafaelpadilla commented Jun 1, 2025 •

edited

Loading

rafaelpadilla commented Jun 3, 2025 •

edited

Loading