EvalMetrics.jl
Utility package for scoring binary classification models. Performance measures for general classification tasks can be found in MLJ.jl.
Installation
Execute the following command in Julia Pkg REPL (EvalMetrics.jl
requires julia 1.0 or higher)
(v1.5) pkg> add EvalMetrics
Usage
Quickstart
The fastest way of getting started is to use a simple binary_eval_report
function in the following way:
julia> using EvalMetrics, Random
julia> Random.seed!(123);
julia> targets = rand(0:1, 100);
julia> scores = rand(100);
julia> binary_eval_report(targets, scores)
Dict{String,Real} with 8 entries:
"precision@fpr0.05" => 0.0
"recall@fpr0.05" => 0.0
"accuracy@fpr0.05" => 0.45
"au_prcurve" => 0.460134
"samples" => 100
"true negative rate@fpr0.05" => 0.957447
"au_roccurve" => 0.42232
"prevalence" => 0.53
julia> binary_eval_report(targets, scores, 0.001)
Dict{String,Real} with 8 entries:
"recall@fpr0.001" => 0.0
"au_prcurve" => 0.460134
"samples" => 100
"precision@fpr0.001" => 1.0
"au_roccurve" => 0.42232
"accuracy@fpr0.001" => 0.47
"prevalence" => 0.53
"true negative rate@fpr0.001" => 1.0
Confusion Matrix
The core the package is the ConfusionMatrix
structure, which represents the confusion matrix in the following form
Actual positives  Actual negatives  

Predicted positives  tp (# true positives)  fp (# false positives) 
Predicted negatives  fn (# false negatives)  tn (# true negatives) 
p (# positives)  n (# negatives) 
The confusion matrix can be calculated from targets and predicted values or from targets, scores, and one or more decision thresholds
julia> thres = 0.6;
julia> predicts = scores .>= thres;
julia> cm1 = ConfusionMatrix(targets, predicts)
ConfusionMatrix{Int64}(53, 47, 18, 24, 23, 35)
julia> cm2 = ConfusionMatrix(targets, scores, thres)
ConfusionMatrix{Int64}(53, 47, 18, 24, 23, 35)
julia> cm3 = ConfusionMatrix(targets, scores, thres)
ConfusionMatrix{Int64}(53, 47, 18, 24, 23, 35)
julia> cm4 = ConfusionMatrix(targets, scores, [thres, thres])
2element Array{ConfusionMatrix{Int64},1}:
ConfusionMatrix{Int64}(53, 47, 18, 24, 23, 35)
ConfusionMatrix{Int64}(53, 47, 18, 24, 23, 35)
The package provides many basic classification metrics based on the confusion matrix. The following table provides a list of all available metrics and its aliases
Classification metric  Aliases 

true_positive 

true_negative 

false_positive 

false_negative 

true_positive_rate 
sensitivity , recall , hit_rate 
true_negative_rate 
specificity , selectivity 
false_positive_rate 
fall_out , type_I_error 
false_negative_rate 
miss_rate , type_II_error 
precision 
positive_predictive_value 
negative_predictive_value 

false_discovery_rate 

false_omission_rate 

threat_score 
critical_success_index 
accuracy 

balanced_accuracy 

error_rate 

balanced_error_rate 

f1_score 

fฮฒ_score 

matthews_correlation_coefficient 
mcc 
quant 

positive_likelihood_ratio 

negative_likelihood_ratio 

diagnostic_odds_ratio 

prevalence 
Each metric can be computed from the ConfusionMatrix
structure
julia> recall(cm1)
0.33962264150943394
julia> recall(cm2)
0.33962264150943394
julia> recall(cm3)
0.33962264150943394
julia> recall(cm4)
2element Array{Float64,1}:
0.33962264150943394
0.33962264150943394
The other option is to compute the metric directly from targets and predicted values or from targets, scores, and one or more decision thresholds
julia> recall(targets, predicts)
0.33962264150943394
julia> recall(targets, scores, thres)
0.33962264150943394
julia> recall(targets, scores, thres)
0.33962264150943394
julia> recall(targets, scores, [thres, thres])
2element Array{Float64,1}:
0.33962264150943394
0.33962264150943394
User defined classification metrics
It may occur that some useful metric is not defined in the package. To simplify the process of defining a new metric, the package provides the @metric
macro and apply
function.
import EvalMetrics: @metric, metric
@metric MyRecall
apply(::Type{MyRecall}, x::ConfusionMatrix) = x.tp/x.p
In the previous example, macro @metric
defines a new abstract type MyRecall
(used for dispatch) and a function myrecall
(for easy use of the new metric). With defined abstract type MyRecall
, the next step is to define a new method for the apply
function. This method must have exactly two input arguments: Type{MyRecall}
and ConfusionMatrix
. If another argument is needed, it can be added as a keyword argument.
apply(::Type{Fฮฒ_score}, x::ConfusionMatrix; ฮฒ::Real = 1) =
(1 + ฮฒ^2)*precision(x)*recall(x)/(ฮฒ^2*precision(x) + recall(x))
It is easy to check that the myrecall
metric returns the same outputs as the recall
metric defined in the package
julia> myrecall(cm1)
0.33962264150943394
julia> myrecall(cm2)
0.33962264150943394
julia> myrecall(cm3)
0.33962264150943394
julia> myrecall(cm4)
2element Array{Float64,1}:
0.33962264150943394
0.33962264150943394
julia> myrecall(targets, predicts)
0.33962264150943394
julia> myrecall(targets, scores, thres)
0.33962264150943394
julia> myrecall(targets, scores, thres)
0.33962264150943394
julia> myrecall(targets, scores, [thres, thres])
2element Array{Float64,1}:
0.33962264150943394
0.33962264150943394
Label encodings
Different label encodings are considered common in different machine learning applications. For example, support vector machines use 1
as a positive label and 1
as a negative label. On the other hand, it is common for neural networks to use 0
as a negative label. The package provides some basic label encodings listed in the following table
Encoding  positive label(s)  negative label(s) 

OneZero(::Type{T}) 
one(T) 
zero(T) 
OneMinusOne(::Type{T}) 
one(T) 
one(T) 
OneTwo(::Type{T}) 
one(T) 
2*one(T) 
OneVsOne(::Type{T}, pos::T, neg::T) 
pos 
neg 
OneVsRest(::Type{T}, pos::T, neg::AbstractVector{T}) 
pos 
neg 
RestVsOne(::Type{T}, pos::AbstractVector{T}, neg::T) 
pos 
neg 
The current_encoding
function can be used to verify which encoding is currently in use (by default it is OneZero
encoding)
julia> enc = current_encoding()
OneZero{Float64}:
positive class: 1.0
negative class: 0.0
One way to use a different encoding is to pass the new encoding as the first argument
julia> enc_new = OneVsOne(:positive, :negative)
OneVsOne{Symbol}:
positive class: positive
negative class: negative
julia> targets_recoded = recode.(enc, enc_new, targets);
julia> predicts_recoded = recode.(enc, enc_new, predicts);
julia> recall(enc, targets, predicts)
0.33962264150943394
julia> recall(enc_new, targets_recoded, predicts_recoded)
0.33962264150943394
The second way is to change the current encoding to the one you want
julia> set_encoding(OneVsOne(:positive, :negative))
OneVsOne{Symbol}:
positive class: positive
negative class: negative
julia> recall(targets_recoded, predicts_recoded)
0.33962264150943394
Decision thresholds for classification
The package provides a thresholds(scores::RealVector, n::Int)
, which returns n
decision thresholds which correspond to n
evenly spaced quantiles of the given scores
vector. The default value of n
is length(scores) + 1
. The thresholds
function has two keyword arguments reduced::Bool
and zerorecall::Bool
 If
reduced
istrue
(default), then the function returnsmin(length(scores) + 1, n)
thresholds.  If
zerorecall
istrue
(default), then the largest threshold ismaximum(scores)*(1 + eps())
otherwisemaximum(scores)
.
The package also provides some other useful utilities
threshold_at_tpr(targets::AbstractVector, scores::RealVector, tpr::Real)
returns the largest thresholdt
that satisfiestrue_positive_rate(targets, scores, t) >= tpr
threshold_at_tnr(targets::AbstractVector, scores::RealVector, tnr::Real)
returns the smallest thresholdt
that satisfiestrue_negative_rate(targets, scores, t) >= tnr
threshold_at_fpr(targets::AbstractVector, scores::RealVector, fpr::Real)
returns the smallest thresholdt
that satisfiesfalse_positive_rate(targets, scores, t) <= fpr
threshold_at_fnr(targets::AbstractVector, scores::RealVector, fnr::Real)
returns the largest thresholdt
that satisfiesfalse_negative_rate(targets, scores, t) <= fnr
All four functions can be called with an encoding of type AbstractEncoding
as the first parameter to use a different encoding than default.
Evaluation curves
Functionality for measuring performance with curves is implemented in the package as well. For example, a precisionrecall (PR) curve can be computed as follows:
julia> scores = [0.74, 0.48, 0.23, 0.91, 0.33, 0.92, 0.83, 0.61, 0.68, 0.09];
julia> targets = collect(1:10 .>= 3);
julia> prcurve(targets, scores)
([1.0, 0.875, 0.75, 0.625, 0.625, 0.5, 0.375, 0.375, 0.25, 0.125, 0.0],
[0.8, 0.7777777777777778, 0.75, 0.7142857142857143, 0.8333333333333334, 0.8, 0.75, 1.0, 1.0, 1.0, 1.0])
All possible calls:
prcurve(targets::AbstractVector, scores::RealVector)
returns alllength(target) + 1
pointsprcurve(enc::AbstractEncoding, target::AbstractVector, scores::RealVector)
makes different encodings possibleprcurve(targets::AbstractVector, scores::RealVector, thres::RealVector)
uses provided threshols to compute individual pointsprcurve(enc::AbstractEncoding, target::AbstractVector, scores::RealVector, thres::RealVector)
prcurve(cms::AbstractVector{<:ConfusionMatrix})
We can also compute area under the curve using the auc_trapezoidal
function which uses the trapezoidal rule as follows:
julia> auc_trapezoidal(prcurve(targets, scores)...)
0.8595734126984128
However, a convenience function au_prcurve
is provided with exactly the same signature as prcurve
function. Moreover, any curve(PRCurve, args...)
or auc(PRCurve, args...)
call is equivalent to prcurve(args...)
and au_prcurve(args...)
, respectively.
Besides PR curve, Receiver operating characteristic (ROC) curve is also available out of the box with analogical definitions of roccurve
and au_roccurve
.
All points of the curve, as well as area under curve scores are computed using the highest possible resolution by default. This can be changed by a keyword argument npoints
julia> length.(prcurve(targets, scores))
(11, 11)
julia> length.(prcurve(targets, scores; npoints=9))
(9, 9)
julia> auprcurve(targets, scores)
0.8595734126984128
julia> au_prcurve(targets, scores; npoints=9)
0.8826388888888889
Plotting
For plotting purposes, EvalMetrics.jl
provides recipes for the Plots
library:
julia> using Plots; pyplot()
julia> using Random, MLBase; Random.seed!(42);
julia> scores = sort(rand(10000));
julia> targets = scores .>= 0.99;
julia> targets[MLBase.sample(findall(0.98 .<= scores .< 0.99), 30; replace = false)] .= true;
julia> targets[MLBase.sample(findall(0.99 .<= scores .< 0.995), 30; replace = false)] .= false;
Then, any of the following can be used:
prplot(targets::AbstractVector, scores::RealVector)
to use the full resolution:
julia> prplot(targets, scores)
prplot(targets::AbstractVector, scores::RealVector, thresholds::RealVector)
to specify thresholds that will be usedprplot!(enc::AbstractEncoding, targets::AbstractVector, scores::RealVector)
to use a different encoding than defaultprplot!(enc::AbstractEncoding, targets::AbstractVector, scores::RealVector, thresholds::RealVector)
Furthermore, one can use vectors of vectors like [targets1, targets2]
and [scores1, scores2])
to plot multiple curves at once. The calls stay the same:
julia> prplot([targets, targets], [scores, scores .+ rand(10000) ./ 5])
For ROC curve use rocplot
analogically:
julia> rocplot(targets, scores)
julia> rocplot([targets, targets], [scores, scores .+ rand(10000) ./ 5])
'Modifying' versions with exclamation marks prplot!
and rocplot!
work as well.
The appearance of the plot can be changed in exactly the same way as with Plots
library. Therefore, keyword arguments such as xguide
, xlims
, grid
, fill
can all be used:
julia> prplot(targets, scores; xguide="RECALL", fill=:green, grid=false, xlims=(0.8, 1.0))
julia> rocplot(targets, scores, title="Title", label="experiment", xscale=:log10)
Here, limits on x axis are appropriately changed, unless overridden by using xlims
keyword argument.
julia> rocplot([targets, targets], [scores, scores .+ rand(10000) ./ 5], label=["a" "b";])
By default, plotted curves have 300 points, which are sampled to retain as much information as possible. This amounts to sampling false positive rate in case of ROC curves and true positive rate in case of PR curves instead of raw thresholds. The number of points can be again changed by keyword argument npoints
:
julia> prplot(targets, scores; npoints=Inf, label="Original")
julia> prplot!(targets, scores; npoints=10, label="Sampled (10 points)")
julia> prplot!(targets, scores; npoints=100, label="Sampled (100 points)")
julia> prplot!(targets, scores; npoints=1000, label="Sampled (1000 points)")
julia> prplot!(targets, scores; npoints=5000, label="Sampled (5000 points)")
Note that even though we visuallize smaller number of points, the displayed auc score is computed from all points. In case when logarithmic scale is used, the sampling is also done in logarithmic scale.
Other than that, diagonal
keyword indicates the diagonal in the plot, and aucshow
toggles, whether auc score is appended to a label:
julia> rocplot(targets, scores; aucshow=false, label="a", diagonal=true)
Userdefined curves
PR and ROC curves are available out of the box. Additional curve definitions can be provided in the similar way as new metrics are defined using macro @curve
and defining apply
function, which computes a point on the curve. For instance, ROC curve can be defined this way:
julia> import EvalMetrics: @curve, apply
julia> @curve MyROCCurve
julia> apply(::Type{MyROCCurve}, cms::AbstractVector{ConfusionMatrix{T}}) where T <: Real =
(false_positive_rate(cms), true_positive_rate(cms))
julia> myroccurve(targets, scores) == roccurve(targets, scores)
true
In order to be able to sample from x axis while plotting, sampling_function
and lowest_metric_value
must be provided as well.