Metric Decomposition

In this page, we first provide the proof for the metric decomposition used in the paper, and we define Compound Decomposable Metrics for more complex metrics.

Decomposable Metrics

Below, for a complex vision task, $V$ , that can be represented as a sequential composition of the subtasks, i.e., $V = v_{n} ⊙ . . . v_{2} ⊙ v_{1}$ , we remind the metric compositionality definition.

$M_{V} = F (M_{V}^{^{'}}), where M_{V}^{^{'}} = \prod_{i = 1}^{N} m_{v_{i}}$ ,

where $M_{V}^{^{'}}$ is a directly decomposable metric of the task $V$ , $m_{v_{i}}$ is a metric of the i-th subtask, and $F$ is a monotonic function.

Performance metric AP

The performance of an MVC or a human on a given vision task is usually evaluated using task-specific performance metrics. For the complex vision tasks object detection ( $V_{D}$ ) and instance segmentation ( $V_{I}$ ) used in the paper, a commonly used performance metric is Average textit{Precision} (AP) measured for each object class. Each AP of a class is calculated by taking the area under the curve (AuC) of a Precision-Recall curve (PR-curve). The PR-curve is obtained as follows. For each $δ \in [0, 1]$ , first collect all output bounding boxes with confidence score below $δ$ ; then calculate the precision of these boxes as $\frac{T P}{T P + F P}$ and recall as $\frac{T P}{T P + F N}$ , where TP is true positive, FP is false positive and FN is false negative. Finally, each pair of (precision, recall) value becomes a point on the PR-curve. A threshold of intersection over union (IoU), usually predefined (e.g., $0.5$ ), is used to determine whether the predicted boundary of the object overlaps with the ground truth enough to be considered a true positive over a false positive. For example, for a predicted bounding box of a bus $b$ and a ground truth bounding box $b_{0}$ , $b$ is considered a true positive if $I o U (b, b_{0}) \geq 0.5$ and false positive otherwise. However, calculating precision and recall for each $δ \in [0, 1]$ can be expensive; therefore, in practice, a $δ$ value is only selected if there exists at least one bounding box with the confidence score that equals to $δ$ . The procedure for calculating AP for instance segmentation is the same as that for object detection, except each point on the PR-curve is obtained with precision and recall values with the areas enclosed in the outlines rather than bounding boxes.

Decomposition of AP

Below, we show that AP is decomposable. First, AP for a class $c$ is the AuC of the PR-curve. Therefore, by the definition of decomposable metrics, if each PR-curve is directly decomposable, AP is decomposable. Each point on the PR-curve is defined with precision and recall for a class label $c$ given a $δ$ and a IoU threshold $t_{I o U}$ . For object detection, for all detected objects $d$ such that the confidence score conf $_{d} \geq δ$ , where $c^{*}$ represents the ground truth class and $I o U_{d}$ is the max $I o U$ of the bounding box compared to all ground truth boxes, we can represent Precision and Recall as: Precision $^{δ} = \frac{| {d | I o U_{d} \geq t_{I o U} \land c_{d} = c^{*}} |}{| {d | c_{d} = c} |}$ Recall $^{δ} = \frac{| {d | I o U_{d} \geq t_{I o U} \land c_{d} = c^{*}} |}{{d | c^{*} = c} |}$ .

Thus, Precision $^{δ}$ and Recall $^{δ}$ can be decomposed as follows:

Note that $I o U_{d} \geq t_{I o U} \land c_{d} = c^{*}$ is the condition for a detection $d$ to be a true positive. We can see that for Precision $^{δ}$ , $\frac{| {d | I o U_{d} \geq t_{I o U} \land c_{d} = c} |}{| {d | c_{d} = c} |}$ is the percentage of bounding boxes matched ground truth out of all output boxes of this class, which is precision for $v_{L}$ ; $\frac{| {d | c_{d} = c^{*} \land I o U_{d} \geq t_{I o U} \land c_{d} = c} |}{| {d | I o U_{d} \geq t_{I o U} \land c_{d} = c} |}$ is the percentage of correct labels out of all bounding boxes matched ground truth of this class, which is precision for $v_{C | L}$ . Similarly for Recall $^{δ}$ , $\frac{| {d | I o U_{d} \geq t_{I o U} \land c^{*} = c} |}{| {d | c^{*} = c} |}$ is recall for $v_{L}$ and $\frac{| {d | I o U_{d} \geq t_{I o U} \land c^{*} = c} |}{| {d | c^{*} = c} |}$ is recall for $v_{C | L}$ . Since both Precision $^{δ}$ and Recall $^{δ}$ are decomposable, each point on the PR-curve is decomposable, the PR-curve for each class $c$ can be decomposed into the two subtasks, i.e., PR $_{D} =$ PR $_{L} \cdot {PR}_{C | L}$ . As a result, AP for object detection is decomposable following Metric Decomposition.

Similarly, AP for instance segmentation is decomposable because we can decompose Precision $^{δ}$ and Recall $^{δ}$ . For a segmented object $s$ , let $I o U_{s}^{seg}$ be the max $I o U$ of the area enclosed by $o_{s}$ compared to all ground truth segmentation, $I o U_{s}^{b o x}$ be the max $I o U$ of $b_{s}$ compared to all tight bounding boxes around areas enclosed by ground truth segmentation. Precision $^{δ}$ and Recall $^{δ}$ can be decomposed as the following:

decompose AP metric for instance segmentation

Following the decomposition of Precision $^{δ}$ and Recall $^{δ}$ , we can decompose PR-curves for instance segmentation, i.e., PR $_{I} =$ PR $_{L} \cdot$ PR $_{C | L} \cdot$ PR $_{S | C, L}$ . AP used for instance segmentation can also be decomposed following Metric Decomposition.

Compound Decomposable Metrics

Some metrics, such as mean Average Precision (mAP), are more complex and are not decomposable according to our decomposition definition. mAP is defined as an average of AP for each class label c; therefore, mAP can be represented as a function of the precision-recall curve, PR, that is directly decomposable. For such metrics $M_{V}$ , we extend the decomposable metric definition into compound decomposable as follows:

$M_{V} = F (M_{V}^{1}, . . ., M_{V}^{K}), where M_{V}^{k} = \prod_{i = 1}^{N_{k}}$ ,

where $M_{V}^{k}$ is a decomposable metric of the task $V$ , $m_{v_{i}}^{k}$ is a metric of the i-th subtask, and $F$ is a function that is monotonic with respect to every argument.

Now we show that mAP is decomposable. Given all class labels $C$ , mAP is the average of each AP value of $c \in C$ . Since AP for each $c$ is decomposable as we showed above, mAP is also decomposable: mAP $_{V} = m e a n ($ AP $^{1}, . . .,$ AP $^{| C |})$ .