-
Notifications
You must be signed in to change notification settings - Fork 2k
Fix/mAP #1834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Fix/mAP #1834
Conversation
) | ||
|
||
|
||
class EvaluationDataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be honest. I don't really like the idea of converting Detections
into this intermediate format. What do we gain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main benefit of introducing an EvaluationDataset
intermediate is alignment with pycocotools
. It mirrors their internal structure and makes our implementation easier to understand, debug, and compare against their results — which is crucial for metric consistency and correctness.
Also, Detections
doesn't currently expose image IDs or a clean way to separate predictions from ground truth. EvaluationDataset
fills that gap by organizing data in a way that's much closer to what pycocotools
expects, making it easier to implement metrics like mAP and potentially recall in the future.
If the Detections
class evolves to include this structure natively, we could definitely revisit. But for now, this abstraction helps keep the logic cleaner, more testable, and easier to extend without bloating the evaluator.
def _iou_with_jaccard( | ||
dt: List[List[float]], gt: List[List[float]], is_crowd: List[bool] | ||
) -> np.ndarray: | ||
""" | ||
Calculate the intersection over union (IoU) between detection bounding boxes (dt) | ||
and ground-truth bounding boxes (gt). | ||
Reference: https://github.com/rafaelpadilla/review_object_detection_metrics | ||
|
||
Args: | ||
dt (List[List[float]]): List of detection bounding boxes in the \ | ||
format [x, y, width, height]. | ||
gt (List[List[float]]): List of ground-truth bounding boxes in the \ | ||
format [x, y, width, height]. | ||
is_crowd (List[bool]): List indicating if each ground-truth bounding box \ | ||
is a crowd region or not. | ||
|
||
Returns: | ||
np.ndarray: Array of IoU values of shape (len(dt), len(gt)). | ||
""" | ||
assert len(is_crowd) == len(gt), "iou(iscrowd=) must have the same length as gt" | ||
if len(dt) == 0 or len(gt) == 0: | ||
return np.array([]) | ||
ious = np.zeros((len(dt), len(gt)), dtype=np.float64) | ||
for g_idx, g in enumerate(gt): | ||
for d_idx, d in enumerate(dt): | ||
ious[d_idx, g_idx] = _jaccard(d, g, is_crowd[g_idx]) | ||
return ious | ||
|
||
|
||
def _jaccard(box_a: List[float], box_b: List[float], is_crowd: bool) -> float: | ||
""" | ||
Calculate the Jaccard index (intersection over union) between two bounding boxes. | ||
If a gt object is marked as "iscrowd", a dt is allowed to match any subregion | ||
of the gt. Choosing gt' in the crowd gt that best matches the dt can be done using | ||
gt'=intersect(dt,gt). Since by definition union(gt',dt)=dt, computing | ||
iou(gt,dt,iscrowd) = iou(gt',dt) = area(intersect(gt,dt)) / area(dt) | ||
|
||
Args: | ||
box_a (List[float]): Box coordinates in the format [x, y, width, height]. | ||
box_b (List[float]): Box coordinates in the format [x, y, width, height]. | ||
iscrowd (bool): Flag indicating if the second box is a crowd region or not. | ||
|
||
Returns: | ||
float: Jaccard index between the two bounding boxes. | ||
""" | ||
xa, ya, x2a, y2a = box_a[0], box_a[1], box_a[0] + box_a[2], box_a[1] + box_a[3] | ||
xb, yb, x2b, y2b = box_b[0], box_b[1], box_b[0] + box_b[2], box_b[1] + box_b[3] | ||
|
||
# Innermost left x | ||
xi = max(xa, xb) | ||
# Innermost right x | ||
x2i = min(x2a, x2b) | ||
# Same for y | ||
yi = max(ya, yb) | ||
y2i = min(y2a, y2b) | ||
|
||
# Calculate areas | ||
Aa = max(x2a - xa, 0.0) * max(y2a - ya, 0.0) | ||
Ab = max(x2b - xb, 0.0) * max(y2b - yb, 0.0) | ||
Ai = max(x2i - xi, 0.0) * max(y2i - yi, 0.0) | ||
|
||
if is_crowd: | ||
return Ai / (Aa + EPS) | ||
|
||
return Ai / (Aa + Ab - Ai + EPS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s reimplement this in a vectorized way using numpy
and move it to detection/utils.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
supervision/dataset/formats/coco.py
Outdated
else: | ||
area = None | ||
|
||
if use_iscrowd or use_precomputed_area: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have two separate flags if, in the end, they are connected with or
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up — you're absolutely right. Initially, I aimed to keep each part independent to provide more flexibility for external users. However, after reviewing it more closely, I agree that a single flag is sufficient and cleaner. I've updated the code accordingly to keep only one flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit unsure about the name of this flag. Since we're loading both iscrowd
and area
, calling it use_iscrowd
feels a bit misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that use_iscrowd
might be a bit misleading when you first see it. However, the evaluation logic needs to account for the iscrowd
property when it’s present in the dataset. This is necessary to produce results consistent with pycocotools.
Also, the use_iscrowd
flag (which defaults to False
) is essential for maintaining compatibility with existing tests, particularly test/dataset/formats/test_coco.py::test_coco_annotations_to_detections.
In short, the use_iscrowd
flag serves two purposes:
- It includes the
iscrowd
property from the dataset in the Detection objects, which is critical for matching pycocotools behavior. - It ensures backward compatibility with existing tests.
To keep things simple and closer to the original logic, I kept the use_iscrowd
flag only in the coco_annotations_to_detections
function. I hope that addresses your concern, but happy to adjust if you have other suggestions.
@@ -159,16 +183,29 @@ def detections_to_coco_annotations( | |||
return coco_annotations, annotation_id | |||
|
|||
|
|||
def get_coco_class_index_mapping(annotations_path: str) -> Dict[int, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct — this function is not currently invoked in the main pipeline. However, it is essential for evaluating certain models whose class ID schemes differ from those used in the COCO dataset.
Specifically, some models use sequential class IDs (e.g., 0 to 79 for 80 classes), whereas COCO's official annotations intentionally skip some IDs. You can see a detailed breakdown of these skipped IDs in this spread sheet.
To address this mismatch, this function is super useful. A practical example of this mapping is used in
this colab notebook, where get_coco_class_index_mapping
is applied to reproduce results consistent with the roboflow/model-leaderboard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since in your examples you always reverse the dictionary right after get_coco_class_index_mapping
, maybe it would be easier to just return the reversed mapping directly?
class_mapping = get_coco_class_index_mapping(annotation_file)
inv_class_mapping = {v: k for k, v in class_mapping.items()}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️ I have updated the code to return the reversed mapping. Thanks for the suggestion!
… detection/utils.py
Hey @SkalskiP , Thank you for your review! 🙌 |
Hi @SkalskiP @onuralpszr, |
@@ -1325,3 +1325,73 @@ def spread_out_boxes( | |||
xyxy_padded[:, [2, 3]] += force_vectors | |||
|
|||
return pad_boxes(xyxy_padded, px=-1) | |||
|
|||
|
|||
def _jaccard(box_a: List[float], box_b: List[float], is_crowd: bool) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was to enable batch processing of the boxes, so that we wouldn't need double for
loop
for g_idx, g in enumerate(gt):
for d_idx, d in enumerate(dt):
in the iou_with_jaccard
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, see my explanation in the next conversation. I addressed this issue there.
supervision/detection/utils.py
Outdated
def iou_with_jaccard( | ||
dt: List[List[float]], gt: List[List[float]], is_crowd: List[bool] | ||
) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d like to keep the supervision API consistent. So far, we have box_iou_batch
and mask_iou_batch
. I think it would make sense to rename this function to box_iou_batch_with_jaccard
. Alternatively, we could consider merging box_iou_batch
with iou_with_jaccard
, and add support for an optional is_crowd: Optional[np.ndarray]
argument.
If you decide to keep box_iou_batch_with_jaccard
as a separate function, please rename dt
and gt
to be consistent with the naming in box_iou_batch
and mask_iou_batch
. The arguments should be named boxes_true: np.ndarray
and boxes_detection: np.ndarray
(or, for masks, masks_true: np.ndarray
and masks_detection: np.ndarray
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think merging box_iou_batch
with iou_with_jaccard
makes the most sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thoughtful suggestions! I agree that aligning with the supervision API and aiming for consistency is important.
I also agree that merging box_iou_batch
and iou_with_jaccard
would be ideal in theory, especially to avoid maintaining two implementations. However, doing so leads to a measurable divergence from pycocotools mAP results.
In earlier tests, I tried replacing iou_with_jaccard
with the existing box_iou_batch
, but noticed small discrepancies in IoU values (starting from distant decimal place). While those differences seem minor, they add up across multiple detections and lead to different mAP scores. 😞
I believe the root cause is that gt_boxes
are in float64
while dt_boxes
are in float32
. When passed together to box_iou_batch
(which expects uniform np.ndarray
inputs), the type coercion or mixed precision results in slightly different outcomes compared to the pycocotools implementation.
The original pycocotools IoU logic is implemented in Cython (source), and has its own internal handling of precision and memory layout. Since our goal here is to provide a fully python code without relying on .pyx dependencies, the most accurate match I’ve been able to get is via the current iou_with_jaccard
and _jaccard
functions.
I haven’t found a reliable way to vectorize the logic using numpy that still replicates pycocotools outputs exactly - especially when accounting for is_crowd
. If you have an idea on how to achieve that while maintaining numerical parity, I would genuinely love to learn. 🙏
For now, I have renamed the input arguments as you suggested to maintain naming consistency. But I would recommend to keep iou_with_jaccard
and _jaccard
as-is, since they yield results that are numerically aligned with pycocotools.
supervision/dataset/formats/coco.py
Outdated
iscrowd = [0] * len(image_annotations) | ||
area = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we’re not actually using iscrowd
or area
; we’re just returning empty dictionaries. I think it makes sense to remove both iscrowd
and area
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, those values weren’t being used.
✔️ I removed both iscrowd
and area
from this part of code.
dt_boxes = [d["bbox"] for d in dt] | ||
|
||
# Get the iscrowd flag for each gt | ||
is_crowd = [int(o["iscrowd"]) for o in gt] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you load a dataset that doesn’t have iscrowd
, this line causes the entire code to crash. We want users to be able to run evaluation on their own datasets, which might be in YOLO format or even COCO format without iscrowd
. The code shouldn’t crash in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right — thanks for pointing that out!
✔️ I've updated the evaluation logic so that if iscrowd
is missing from the dataset, it now defaults to iscrowd=0
.
This ensures compatibility with datasets in YOLO format or COCO variants that don’t include the iscrowd
field, and prevents the code from crashing in those cases.
Hi @rafaelpadilla, I just tested YOLO11 by comparing the results obtained with the updated |
Hi @rafaelpadilla, short update: I spent some time today diving a bit deeper. I independently benchmarked YOLO11 using both the new Here are Google Colabs I prepared: I'll keep you posted. For now, let's focus on other comments in this PR. |
…r datasets. 2) Making `get_coco_class_index_mapping` return the inversed mapping, to simplify its usage.
Hi @SkalskiP, Thanks again for your review! I've addressed all your comments and incorporated your suggestions throughout the PR. Regarding results consistency:
All points that could be implemented without impacting metric accuracy have been addressed. For the others, I provided context explaining the rationale behind the current implementation and pycocotools. I believe this PR will make Supervision a simple and reliable tool for testing object detection models. Let me know if anything else comes up. |
Sorry, I think I accidentally closed this PR. 🤭 |
Description
The current implementation of Mean Average Precision (mAP) in
supervision.metrics
produces results that diverge from both pycocotools and published benchmarks (e.g., the Roboflow leaderboard).This PR aligns supervision's mAP implementation with pycocotools, the official COCO evaluation tool, ensuring reliable, standardized metrics. 🚀
Key points 🏆
supervision/metrics/mean_average_precision.py
and works directly withsupervision.Detections
objects.MeanAveragePrecisionResult
class still exposes the same properties:Type of change