Metric Decomposition

In this page, we first provide the proof for the metric decomposition used in the paper, and we define Compound Decomposable Metrics for more complex metrics.

Decomposable Metrics

Below, for a complex vision task, V, that can be represented as a sequential composition of the subtasks, i.e., V=vn...v2v1, we remind the metric compositionality definition.

MV=F(MV), where MV=i=1Nmvi,

where MV is a directly decomposable metric of the task V, mvi is a metric of the i-th subtask, and F is a monotonic function.

Performance metric AP

The performance of an MVC or a human on a given vision task is usually evaluated using task-specific performance metrics. For the complex vision tasks object detection (VD) and instance segmentation (VI) used in the paper, a commonly used performance metric is Average textit{Precision} (AP) measured for each object class. Each AP of a class is calculated by taking the area under the curve (AuC) of a Precision-Recall curve (PR-curve). The PR-curve is obtained as follows. For each δ[0,1], first collect all output bounding boxes with confidence score below δ; then calculate the precision of these boxes as TPTP+FP and recall as TPTP+FN, where TP is true positive, FP is false positive and FN is false negative. Finally, each pair of (precision, recall) value becomes a point on the PR-curve. A threshold of intersection over union (IoU), usually predefined (e.g., 0.5), is used to determine whether the predicted boundary of the object overlaps with the ground truth enough to be considered a true positive over a false positive. For example, for a predicted bounding box of a bus b and a ground truth bounding box b0, b is considered a true positive if IoU(b,b0)0.5 and false positive otherwise. However, calculating precision and recall for each δ[0,1] can be expensive; therefore, in practice, a δ value is only selected if there exists at least one bounding box with the confidence score that equals to δ. The procedure for calculating AP for instance segmentation is the same as that for object detection, except each point on the PR-curve is obtained with precision and recall values with the areas enclosed in the outlines rather than bounding boxes.

Decomposition of AP

Below, we show that AP is decomposable. First, AP for a class c is the AuC of the PR-curve. Therefore, by the definition of decomposable metrics, if each PR-curve is directly decomposable, AP is decomposable. Each point on the PR-curve is defined with precision and recall for a class label c given a δ and a IoU threshold tIoU. For object detection, for all detected objects d such that the confidence score confdδ, where c represents the ground truth class and IoUd is the max IoU of the bounding box compared to all ground truth boxes, we can represent Precision and Recall as: Precisionδ=|{d|IoUdtIoUcd=c}||{d|cd=c}| Recallδ=|{d|IoUdtIoUcd=c}|{d|c=c}| .

Thus, Precisionδ and Recallδ can be decomposed as follows:

decompose AP metric for detection

Note that IoUdtIoUcd=c is the condition for a detection d to be a true positive. We can see that for Precisionδ, |{d|IoUdtIoUcd=c}||{d|cd=c}| is the percentage of bounding boxes matched ground truth out of all output boxes of this class, which is precision for vL; |{d|cd=cIoUdtIoUcd=c}||{d|IoUdtIoUcd=c}| is the percentage of correct labels out of all bounding boxes matched ground truth of this class, which is precision for vC|L. Similarly for Recallδ, |{d|IoUdtIoUc=c}||{d|c=c}| is recall for vL and |{d|IoUdtIoUc=c}||{d|c=c}| is recall for vC|L. Since both Precisionδ and Recallδ are decomposable, each point on the PR-curve is decomposable, the PR-curve for each class c can be decomposed into the two subtasks, i.e., PRD= PRLPRC|L. As a result, AP for object detection is decomposable following Metric Decomposition.

Similarly, AP for instance segmentation is decomposable because we can decompose Precisionδ and Recallδ. For a segmented object s, let IoUsseg be the max IoU of the area enclosed by os compared to all ground truth segmentation, IoUsbox be the max IoU of bs compared to all tight bounding boxes around areas enclosed by ground truth segmentation. Precisionδ and Recallδ can be decomposed as the following:

decompose AP metric for instance segmentation

Following the decomposition of Precisionδ and Recall δ, we can decompose PR-curves for instance segmentation, i.e., PRI= PRL PR C|L PR S|C,L. AP used for instance segmentation can also be decomposed following Metric Decomposition.

Compound Decomposable Metrics

Some metrics, such as mean Average Precision (mAP), are more complex and are not decomposable according to our decomposition definition. mAP is defined as an average of AP for each class label c; therefore, mAP can be represented as a function of the precision-recall curve, PR, that is directly decomposable. For such metrics MV, we extend the decomposable metric definition into compound decomposable as follows:

MV=F(MV1,...,MVK), where MVk=i=1Nk,

where MVk is a decomposable metric of the task V, mvik is a metric of the i-th subtask, and F is a function that is monotonic with respect to every argument.

Now we show that mAP is decomposable. Given all class labels C, mAP is the average of each AP value of cC. Since AP for each c is decomposable as we showed above, mAP is also decomposable: mAPV=mean( AP 1,..., AP |C|).