ClusteringBenchmarks.jl

Data sets and metrics to evaluate clustering algorithms
Author HolyLab
Popularity
5 Stars
Updated Last
12 Months Ago
Started In
November 2023

ClusteringBenchmarks

Build Status Coverage

This is a repository of data sets and metrics which may be used to evaluate clustering algorithms. The data sets include:

  • Marek Gagolewski's collection (direct access to data). See ?load_gagolewski for details about how to load the data sets. Quite a few of these are not very "naturalistic"; among the low-dimensional data sets, the "sipu" and "fcps" batteries arguably have a larger proportion of problems that one might imagine coming from real data.

The metrics to compare the agreement between two clusterings of the same data set include:

Both support only hard clustering; at present there is no metric for soft clusterings. ami is recommended because it does not require that both clusterings have the same number of clusters, and it scales reasonably well when there are many clusters.

Clustering.jl contains additional evaluation metrics. Its mutualinfo(x, y; normed=true) is closest to this package's ami; see the AMI paper for details on the differences.

Benchmarking Clustering.jl

ami scores were computed for several algorithms in Clustering.jl on a subset of the Gagolewski collection (see demos/bench_clustering.jl for details). The results are shown as a violin plot below:

ami scores

Higher AMI score is better.

Some algorithms depend on computing the pairwise distance matrix, and thus scale poorly for large data sets. For these algorithms, any data set with more than 3000 points was excluded. Here were the number of benchmarks that were runnable for each algorithm:

5×2 DataFrame
 Row │ Algorithm  Count
     │ Any        Int64
─────┼──────────────────
   1 │ kmeans        45
   2 │ kmedoids      27
   3 │ hclust        27
   4 │ affprop       27
   5 │ dbscan        26

Algorithms (or variants) marked with a * are fully automated, and do not require anything beyond the input data. The remainding algorithms received one or more "hints" from the reference clustering (e.g., the number of clusters). Some choices were made for the fully-automated variants:

  • dbscan's hyperparameters were set to have 2d neighbors in d dimensions, and the radius was set as the mean distance to the 2dth neighbor.
  • hclust* split at the longest gap between splits in the dendrogram.

There may be more optimal automation strategies than these. In particular, all algorithms can be automated by optimizing evaluation metrics, although one still needs to choose, e.g., the range of numbers of clusters considered.