Deep Dive: All the Ranking Metrics for Recommender Systems Explained!
Machine Learning System Design
Recommender systems are usually a set of Machine Learning models that rank items and recommend them to users. We tend to care primarily about the top ranked items, the rest being less critical. If we want to assess the quality of a specific recommendation, typical ML metrics may be less relevant. We present here a set of metrics commonly used to measure recommender system performance. We look at the following metrics:
The hit rate
Precision and recall
The average precision (AP@N) and recall at N (AR@N)
The mean average precision (mAP@N) and the mean average recall at N (mAR@N)
The cumulative gains (CG)
The discounted cumulative gains (DCG)
The normalized discounted cumulative gains (NDCG)
The Kendall rank correlation coefficient
Average reciprocal hit rate
In many cases, the window size matters when we recommend many items. For example on Netflix, we get recommended movies and the most important recommended movies are the ones we can see at first glance. One could argue that the order does not matter as long as the relevant movies belong to the visible window.
So the important question becomes:
Out of the recommended lists, how many users watched a movie in that visible window?
We measure this with the hit rate or hit ratio. Formally, if we present different personalized recommended lists to N users but only M users watch a movie out of a specific window size L, the hit rate is
L is defined by the product's business requirements. In the case of Netflix movie recommendations, there can be multiple relevant windows. We can imagine the visible window is the first relevant one and the second could be the size of the carousel menu.
Mean Average Precision and Mean Average Recall at N (mAP@N and mAR@N)
Instead of considering multiple windows, we could consider all possible windows. That could be helpful if the size of the visible window changes depending on the device it is being watched on. For example if you open Youtube, you will see at the top, various numbers of videos, depending on the resolution of the monitor or the device (laptop, Android, iPhone).
As much as possible we want the most relevant videos to be at the top so we need to penalize the metrics if they are too far down the list. Let’s look at the Mean Average Precision and Mean Average Recall at N (mAP@N and mAR@N) metrics to assess recommenders.
Precision is a natural metric to measure the relevance of a recommendation. The question that precision answers:
Out of the recommended videos, how many are relevant?
Formally if we recommend N videos and there are r videos that the user will click on, we define precision as
Recall is another natural metric to consider. The question that recall answers is
Out of the possible relevant videos, how many are we recommending?
Formally if there exist R relevant videos and if the user clicks on r of those within the recommended list, we define recall as
Typically, for binary classification problems we expect to balance precision and recall. The F1 score is for example a metric that captures that balance:
It is the harmonic mean of precision and recall metrics. It is not such a valid metric for recommender systems. For good classifiers, if we reduce the number N of recommended items, the precision increase and the recall decrease. This is because the classifier is going to be more confident about the items at the top of the list and the density of relevant items will increase. Similarly, when we increase N, precision decreases and recall increases. This is because we include more and more relevant items but also more irrelevant items at a higher rate.
In recommender systems, it doesn’t matter if we include all the relevant items as long as at least one relevant item belongs to the list. That is why we tend to optimize more for precision than recall for those use cases.
Average Precision at N (AP@N)
Precision and recall do not take into account how far down the list the relevant items may be. The average precision metric is a way to account for this. It asks the question
What is the average precision for any window size?
We still consider a maximum window size N, but we consider all the window sizes k within it where there is a relevant item:
Keep reading with a 7-day free trial
Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.