Evaluation measures - FLAGR - Rank Aggregation Library

This section provides some brief descriptions of the most popular performance evaluation measures for rank aggregation algorithms. These measures are computed by the evaluation tool of FLAGR in case a valid qrels file is provided as input.

$\text{Precision}@k$
Precision measures the ability of an algorithm to precisely detect the relevant elements. It is defined as the ratio of the number of relevant elements at the $k$-th element of a list, divided by the number of retrieved elements (i.e. $k$}: $$\text{Precision@}k=\frac{\text{true positives}@k}{(\text{true positives}@k) + (\text{false positives}@k)}$$

$\text{Recall}@k$
Recall measures the ability of an algorithm to detect the relevant elements early. It is defined as the ratio of the number of relevant elements at the $k$-th element of a list, divided by the number of all relevant elements: $$\text{Recall@}k=\frac{\text{true positives}@k}{(\text{true positives}@k) + (\text{false negatives}@k)}$$

$F1@k$
$F1$ is a well-established measure that combines Precision and Recall into a single scoring formula: $$F1@k=\frac{2 \cdot \text{Precision@}k \cdot \text{Recall@}k }{\text{Precision@}k + \text{Recall@}k}$$

Discounted Cumulative Gain ($DCG@k$)
$DCG$ is another measure for evaluating the performance of an algorithm. In contrast to the previous measures, this one can handle non-binary relevance judgments. In other words, the relevance score of an item may be a real value, and not just a relevant/non-relevant label. It is defined by the following formula: $$DCG@k=\sum_{i=1}^{k}\frac{2^{ {rel}_i } - 1}{\log_2(i + 1)}$$ where ${rel}_i$ is the relevance score of the element at index $i$. For binary problems, we set ${rel}_i=1$ if the $i$-th list element is relevant and 0 otherwise.

Normalized Discounted Cumulative Gain ($nDCG@k$)
One disadvantage of $DCG$ is that it is non-decreasing; it either stays the same (if the current element is non-relevant), or it increases (if the current element is relevant). This means that queries that return larger result sets will probably always have higher $DCG$ scores than queries that return small result sets. The Normalized Discounted Cumulative Gain ($nDCG$) confronts this problem by dividing $DCG$ with the maximum possible $DCG$ at each threshold $k$: $$nDCG@k=\frac{DCG@k}{IDCG@k}$$ where $IDCG@k$ is the Ideal $DCG@k$. To compute it, we first create an ideal ranking, where the elements are ranked in decreasing relevance order. Then, $IDCG@k$ is simply equal to $DCG@k$ in that ideal ranking, namely: $$IDCG@k=\sum_{i=1}^{\text{relevant items @}k}\frac{2^{ {rel}_i } - 1}{\log_2(i + 1)}$$

Average Precision ($AP$)
$AP$ is another evaluation metric that quantifies the ability of an algorithm to rank the relevant elements in the highest list positions. It is defined by the following equation: $$AP=\sum_{k}(\text{Recall@}k-\text{Recall@}k-1)\cdot\text{Precision@}k$$

Mean Average Precision (MAP)
Average Precision $AP$ quantifies the quality of a single ranked list compared with the ground truth. In other words, $AP$ examines a single ranking that is generated in response to a single query. In contrast, Mean Average Precision evaluates a ranking model for a set of queries $Q$. MAP is simply defined as the mean of the Average Precisions over all queries. Consequently, its computation is performed by firstly summing up the $AP_q$ value for each query $q$ in the dataset and then, the sum is divided by the number of queries. $$MAP=\frac{1}{|Q|}\sum_{q=1}^{|Q|}{AP}_q$$

Example
Consider a ranked list including 8 elements that has been submitted as a response to a query. From these elements, the 1st, 3rd, 4th, and 6th are relevant to the query. The rest of the elements are considered as not relevant. The first two columns of the following table show the list and the relevance of its elements. The rest of the columns contain the running values of Precision, Recall, $F1$, $DCG$, $IDCG$, $nDCG$ and Average Precision ($AP$) at each list element.

Rank	Relevant	$\text{Precision@}k$	$\text{Recall@}k$	$F1@k$	$DCG@k$	$IDCG@k$	$nDCG@k$	$AP@k$
1	Yes	1.00	0.25	0.40	1.00	1.00	1.00	1.00
2	No	0.50	0.25	0.33	1.00	1.63	0.61	1.00
3	Yes	0.67	0.50	0.62	1.50	2.13	0.70	0.83
4	Yes	0.75	0.75	0.75	1.93	2.56	0.75	0.81
5	No	0.60	0.75	0.66	1.93	2.56	0.75	0.81
6	Yes	0.67	1.00	0.80	2.29	2.56	0.89	0.77
7	No	0.57	1.00	0.73	2.29	2.56	0.89	0.77
8	No	0.50	1.00	0.66	2.29	2.56	0.89	0.77

Example calculations at the 5th element of the list (@5)
According to the aforementioned definitions, the following calculations are performed: $$\text{Precision@}5=\frac{\text{relevant elements found up to position 5}}{\text{retrieved elements up to position 5}}=\frac{3}{5}=0.60$$ $$\text{Precision@}5=\frac{\text{relevant elements found up to position 5}}{\text{all relevant elements}}=\frac{3}{4}=0.75$$ $$F1@5=\frac{2 \cdot \text{Precision@}5 \cdot \text{Recall@}5 }{\text{Precision@}5 + \text{Recall@}5}=\frac{2 \cdot 0.60 \cdot 0.75}{0.60+0.75}=0.67$$ $$DCG@5=\frac{2^1 - 1}{\log_2(1 + 1)}+\frac{2^0 - 1}{\log_2(2 + 1)}+\frac{2^1 - 1}{\log_2(3 + 1)}+\frac{2^1 - 1}{\log_2(4 + 1)}++\frac{2^{ 0 } - 1}{\log_2(5 + 1)}=1.93$$ $$IDCG@5=\frac{2^1 - 1}{\log_2(1 + 1)}+\frac{2^1 - 1}{\log_2(2 + 1)}+\frac{2^1 - 1}{\log_2(3 + 1)}+\frac{2^1 - 1}{\log_2(4 + 1)}++\frac{2^{ 0 } - 1}{\log_2(5 + 1)}=2.56$$ $$nDCG@5=\frac{DCG@5}{IDCG@5}=\frac{1.93}{2.56}=0.75$$ $$AP@5=\frac{1+0.67+0.75}{3}=0.81$$

Rank	Relevant	\(\text{Precision@}k\)	\(\text{Recall@}k\)	\(F1@k\)	\(DCG@k\)	\(IDCG@k\)	\(nDCG@k\)	\(AP@k\)
1	Yes	1.00	0.25	0.40	1.00	1.00	1.00	1.00
2	No	0.50	0.25	0.33	1.00	1.63	0.61	1.00
3	Yes	0.67	0.50	0.62	1.50	2.13	0.70	0.83
4	Yes	0.75	0.75	0.75	1.93	2.56	0.75	0.81
5	No	0.60	0.75	0.66	1.93	2.56	0.75	0.81
6	Yes	0.67	1.00	0.80	2.29	2.56	0.89	0.77
7	No	0.57	1.00	0.73	2.29	2.56	0.89	0.77
8	No	0.50	1.00	0.66	2.29	2.56	0.89	0.77

6.1. Evaluation measures