Sklearn Metrics

3 min readDec 30, 2021

The sklearn.metrics module implements several loss, score, and utility functions to measure classification performance. Some metrics might require probability estimates of the positive class, confidence values, or binary decisions values. Most implementations allow each sample to provide a weighted contribution to the overall score, through the sample_weight parameter.

Model selection and evaluation using tools, such as model_selection.GridSearchCV and model_selection.cross_val_score, take a scoring parameter that controls what metric they apply to the estimators evaluated.

Common cases:

For the most common use cases, you can designate a scorer object with the scoring parameter; the table below shows all possible values. All scorer objects follow the convention that higher return values are better than lower return values. Thus metrics which measure the distance between the model and the data, like metrics.mean_squared_error, are available as neg_mean_squared_error which return the negated value of the metric.

Scoring

Function

Comment

Classification

‘accuracy’

metrics.accuracy_score

‘balanced_accuracy’

metrics.balanced_accuracy_score

‘top_k_accuracy’

metrics.top_k_accuracy_score

‘average_precision’

metrics.average_precision_score

‘neg_brier_score’

metrics.brier_score_loss

‘f1’

metrics.f1_score

for binary targets

‘neg_log_loss’

metrics.log_loss

requires predict_proba support

‘precision’ etc.

metrics.precision_score

suffixes apply as with ‘f1’

‘recall’ etc.

metrics.recall_score

suffixes apply as with ‘f1’

‘jaccard’ etc.

metrics.jaccard_score

suffixes apply as with ‘f1’

‘roc_auc’

metrics.roc_auc_score

Clustering

‘adjusted_mutual_info_score’

metrics.adjusted_mutual_info_score

‘adjusted_rand_score’

metrics.adjusted_rand_score

‘completeness_score’

metrics.completeness_score

‘fowlkes_mallows_score’

metrics.fowlkes_mallows_score

‘homogeneity_score’

metrics.homogeneity_score

‘mutual_info_score’

metrics.mutual_info_score

‘normalized_mutual_info_score’

metrics.normalized_mutual_info_score

‘rand_score’

metrics.rand_score

‘v_measure_score’

metrics.v_measure_score

Regression

‘explained_variance’

metrics.explained_variance_score

‘max_error’

metrics.max_error

‘neg_mean_absolute_error’

metrics.mean_absolute_error

‘neg_mean_squared_error’

metrics.mean_squared_error

‘neg_mean_squared_log_error’

metrics.mean_squared_log_error

‘neg_median_absolute_error’

metrics.median_absolute_error

‘r2’

metrics.r2_score

‘neg_mean_poisson_deviance’

metrics.mean_poisson_deviance

‘neg_mean_gamma_deviance’

metrics.mean_gamma_deviance

‘neg_mean_absolute_percentage_error’

metrics.mean_absolute_percentage_error

Some of these are restricted to the binary classification case:

precision_recall_curve(y_true, probas_pred, *)

Compute precision-recall pairs for different probability thresholds.

roc_curve(y_true, y_score, *[, pos_label, ...])

Compute Receiver operating characteristic (ROC).

det_curve(y_true, y_score[, pos_label, ...])

Compute error rates for different probability thresholds.

Others also work in the multiclass case:

accuracy_score(y_true, y_pred, *[, ...])

Accuracy classification score.

classification_report(y_true, y_pred, *[, ...])

Build a text report showing the main classification metrics.

f1_score(y_true, y_pred, *[, labels, ...])

Compute the F1 score, also known as balanced F-score or F-measure.

fbeta_score(y_true, y_pred, *, beta[, ...])

Compute the F-beta score.

hamming_loss(y_true, y_pred, *[, sample_weight])

Compute the average Hamming loss.

jaccard_score(y_true, y_pred, *[, labels, ...])

Jaccard similarity coefficient score.

log_loss(y_true, y_pred, *[, eps, ...])

Log loss, aka logistic loss or cross-entropy loss.

multilabel_confusion_matrix(y_true, y_pred, *)

Compute a confusion matrix for each class or sample.

precision_recall_fscore_support(y_true, ...)

Compute precision, recall, F-measure and support for each class.

precision_score(y_true, y_pred, *[, labels, ...])

Compute the precision.

recall_score(y_true, y_pred, *[, labels, ...])

Compute the recall.

roc_auc_score(y_true, y_score, *[, average, ...])

Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

zero_one_loss(y_true, y_pred, *[, ...])

Zero-one classification loss.

some work with binary and multilabel (but not multiclass) problems:

average_precision_score(y_true, y_score, *)

Compute average precision (AP) from prediction scores.

From binary to multiclass and multilabel :

Some metrics are essentially defined for binary classification tasks (e.g. f1_score, roc_auc_score). In these cases, by default only the positive label is evaluated, assuming by default that the positive class is labelled 1 (though this may be configurable through the pos_label parameter).

In extending a binary metric to multiclass or multilabel problems, the data is treated as a collection of binary problems, one for each class. There are then a number of ways to average binary metric calculations across the set of classes, each of which may be useful in some scenario. Where available, you should select among these using the average parameter.

"macro" simply calculates the mean of the binary metrics, giving equal weight to each class. In problems where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their performance. On the other hand, the assumption that all classes are equally important is often untrue, such that macro-averaging will over-emphasize the typically low performance on an infrequent class.

"weighted" accounts for class imbalance by computing the average of binary metrics in which each class’s score is weighted by its presence in the true data sample.

"micro" gives each sample-class pair an equal contribution to the overall metric (except as a result of sample-weight). Rather than summing the metric per class, this sums the dividends and divisors that make up the per-class metrics to calculate an overall quotient. Micro-averaging may be preferred in multilabel settings, including multiclass classification where a majority class is to be ignored.

Sklearn Metrics

Written by Avik Das

No responses yet