Recommendation Engine Accuracy Metric

Avik Das

6 min readJan 13, 2020

Objective of a Recommendation System:

Recommendation should be :

that is relevant
but not well-know to that user
user can’t discover these items on their own easily
may offer a personalized experience
help user to discover new, relevant objects which will be beyond user’s imagination

Recommendation Metric :

Prediction /Statistical accuracy metrics (MAE, RMSE) : The two most popular metrics in these group are MAE (mean absolute error) and RMSE (root mean squared error). These two metrics are particular cases of the Minkowski metric. The goal of these metrics is to measure how numerically close is your prediction from your real value. MAE punishes every error the same while RMSE punishes more larger errors.Decision support accuracy metricsDecision support accuracy metrics
Decision support accuracy metrics ( Precision, Recall, F1 score ) : Help users to select items that are more similar among available set of items. The metrics view the recommended predictions a binary operation which distinguishes good/relevant items from those items that are not good/not relevant. The precision is the proportion of recommendations that are good recommendations, and recall is the proportion of good recommendations that appear in top recommendations.

Precision-Recall-F1 Score in Details :

In information retrieval, precision is a measure of result relevancy, while recall is a measure of how many truly relevant results are returned.

A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels.

A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels.

An ideal system with high precision and high recall will return many results, with all results labeled correctly.

Graphical Representation of Recommendation Evaluation Metric

In recommendations domain,F1-Score is considered as an single value obtained combining both the precision and recall measures , and indicates an overall utility of the recommendation list.

Mathematical Calculation

Sample Code : Precison-Recall_F1-Score

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split

import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

# Add noisy features
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]

# Limit to the two first classes, and split into training and test
X_train, X_test, y_train, y_test = train_test_split(X[y < 2], y[y < 2],
test_size=.5,
random_state=random_state)

# Create a simple classifier
classifier = svm.LinearSVC(random_state=random_state)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

#y_score = classifier.decision_function(X_test)
print(f1_score(y_test, y_pred, average=”macro”))
print(precision_score(y_test, y_pred, average=”macro”))
print(recall_score(y_test, y_pred, average=”macro”))

'binary':

Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

'micro':

Calculate metrics globally by counting the total true positives, false negatives and false positives.

'macro':

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'weighted':

Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

'samples':Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score).

Output

NDCG ( Normailzed Discounted cumulative gain )

MAP is a metric for binary feedback only, while NDCG can be used in any case where you can assign relevance score to a recommended item (binary, integer or real).

Two assumptions are made in using DCG and its related measures.

Highly relevant documents are more useful when appearing earlier in a search engine result list (have higher ranks)
Highly relevant documents are more useful than marginally relevant documents, which are in turn more useful than non-relevant documents.

DCG originates from an earlier, more primitive, measure called Cumulative Gain.

NDCG
Intimidating as the name might be, the idea behind NDCG is pretty simple. A recommender returns some items and we’d like to compute how good the list is. Each item has a relevance score, usually a non-negative number. That’s gain. For items we don’t have user feedback for we usually set the gain to zero.

Now we add up those scores; that’s cumulative gain. We’d prefer to see the most relevant items at the top of the list, therefore before summing the scores we divide each by a growing number (usually a logarithm of the item position) — that’s discounting — and get a DCG.

DCGs are not directly comparable between users, so we normalize them. The worst possible DCG when using non-negative relevance scores is zero. To get the best, we arrange all the items in the test set in the ideal order, take first K items and compute DCG for them. Then we divide the raw DCG by this ideal DCG to get NDCG@K, a number between 0 and 1.

Code

import numpy as np

# change this if using K > 100
denominator_table = np.log2( np.arange( 2, 102 ))

def dcg_at_k( r, k, method = 1 ):
“””Score is discounted cumulative gain (dcg)
Relevance is positive real values. Can use binary
as the previous methods.
Example from
http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf
>>> r = [3, 2, 3, 0, 0, 1, 2, 2, 3, 0]
>>> dcg_at_k(r, 1)
3.0
>>> dcg_at_k(r, 1, method=1)
3.0
>>> dcg_at_k(r, 2)
5.0
>>> dcg_at_k(r, 2, method=1)
4.2618595071429155
>>> dcg_at_k(r, 10)
9.6051177391888114
>>> dcg_at_k(r, 11)
9.6051177391888114
Args:
r: Relevance scores (list or numpy) in rank order
(first element is the first item)
k: Number of results to consider
method: If 0 then weights are [1.0, 1.0, 0.6309, 0.5, 0.4307, …]
If 1 then weights are [1.0, 0.6309, 0.5, 0.4307, …]
Returns:
Discounted cumulative gain
“””
r = np.asfarray(r)[:k]
if r.size:
if method == 0:
return r[0] + np.sum(r[1:] / np.log2(np.arange(2, r.size + 1)))
elif method == 1:
# return np.sum(r / np.log2(np.arange(2, r.size + 2)))
return np.sum(r / denominator_table[:r.shape[0]])
else:
raise ValueError(‘method must be 0 or 1.’)
return 0.

def get_ndcg( r, k, method = 1 ):
“””Score is normalized discounted cumulative gain (ndcg)
Relevance orignally was positive real values. Can use binary
as the previous methods.
Example from
http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf
>>> r = [3, 2, 3, 0, 0, 1, 2, 2, 3, 0]
>>> ndcg_at_k(r, 1)
1.0
>>> r = [2, 1, 2, 0]
>>> ndcg_at_k(r, 4)
0.9203032077642922
>>> ndcg_at_k(r, 4, method=1)
0.96519546960144276
>>> ndcg_at_k([0], 1)
0.0
>>> ndcg_at_k([1], 2)
1.0
Args:
r: Relevance scores (list or numpy) in rank order
(first element is the first item)
k: Number of results to consider
method: If 0 then weights are [1.0, 1.0, 0.6309, 0.5, 0.4307, …]
If 1 then weights are [1.0, 0.6309, 0.5, 0.4307, …]
Returns:
Normalized discounted cumulative gain
“””
dcg_max = dcg_at_k(sorted(r, reverse=True), k, method)
dcg_min = dcg_at_k(sorted(r), k, method)
#assert( dcg_max >= dcg_min )

if not dcg_max:
return 0.

dcg = dcg_at_k(r, k, method)

#print dcg_min, dcg, dcg_max

return (dcg — dcg_min) / (dcg_max — dcg_min)

# ndcg with explicitly given best and worst possible relevances
# for recommendations including unrated movies
def get_ndcg_2( r, best_r, worst_r, k, method = 1 ):

dcg_max = dcg_at_k( sorted( best_r, reverse = True ), k, method )

if worst_r == None:
dcg_min = 0.
else:
dcg_min = dcg_at_k( sorted( worst_r ), k, method )

# assert( dcg_max >= dcg_min )

if not dcg_max:
return 0.

dcg = dcg_at_k( r, k, method )

#print dcg_min, dcg, dcg_max

return ( dcg — dcg_min ) / ( dcg_max — dcg_min )

Recommendation Engine Accuracy Metric

Objective of a Recommendation System:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Avik Das

No responses yet