KMeansPNNSmoothing.jl

This code implements the PNN-smoothing seeding method for k-means described in the paper "Systematically and efficiently improving existing k-means initialization algorithms by pairwise-nearest-neighbor smoothing" by C. Baldassi, submitted for publication, (2022) arXiv.

The code is written in Julia. It requires Julia 1.6 or later.

It provides a multi-threaded implementation of Lloyd's algorithm, also known as k-means, with several seeding options:

sample centroids uniformly at random from the dataset, without replacement
kmeans++, including the "greedy" variant which is also used by scikit-learn
furthest-point heuristic, also called "maxmin" from this paper
kmeans‖, also called "scalable kmeans++"
pairwise nearest-neighbor hierarchical clustering (note: this scales more than quadratically with the number of points)
the PNN-smoothing meta-method
the refine meta-method

It also implements a number of different techniques that can accelerate the iterations of Lloyd's algorithm:

the naive standard one (no acceleration)
the "reduced comparison" method from this paper
methods based on Hamerly 2010
methods based on Elkan 2003
variants of the "yinyang" method from this paper
the "exponion" method from this paper
the "ball" method from this paper

The package also provides two functions to compute the centroid index as defined in this paper, an asymmetric one called CI and a symmetric one called CI_sym. These are not exported.

It also provides a function to compute the variation of information metric to quantify the distance between two partitions as defined in this paper. The function is called VI and is not exported.

Installation and setup

To install the module, just clone it from GitHub into some directory. Then enter in such directory and run julia with the "project" option:

$ julia --project

(Alternatively, if you start Julia from some other directory, you can press ; to enter in shell mode, cd into the project's directory, enter in pkg mode with ] and use the activate command.)

The first time you do this, you will then need to setup the project's environment. To do that, when you're in the Julia REPL, press the ] key to enter in pkg mode, then resolve the dependencies:

(KMeansPNNSmoothing) pkg> resolve

This should download all the required packages. You can subsequently type test to check that everything works. After this, you can press the backspace key to get back to the standard Julia prompt, and load the package:

julia> using KMeansPNNSmoothing

Usage

The format of the data must be a Matrix{Float64} with the data points organized by column. (Typically, this means that if you're reading a dataset you'll need to transpose it. See for example the runfile.jl script in the test directory.)

Here is an example run, assuming we want to cluster a data matrix into k clusters with the original kmeans++ algorithm (the {1} type parameter deactivates the "greedy" version) and using the "reduced yinyang" acceleration method:

result = kmeans(data, k; kmseeder=KMSeed.PlusPlus{1}(), accel=KMAccel.Ryy)

and here is an example running the PNN-smoothing scheme, using the non-greedy kmeans++ to seed the initial sub-sets (this is actually the default if no keyword arguments are passed):

result = kmeans(data, k; kmseeder=KMSeed.PNNS(KMSeed.PlusPlus{1}()))

and here is again PNN-smoothing but this time using "maxmin" at 2 levels of recursion:

result = kmeans(data, k; kmseeder=KMSeed.PNNS(KMSeed.MaxMin(), rlevel=2))

For the complete documentation you can use the Julia help (press the ? key in the REPL, then type kmeans, or KMSeed or KMAccel)

All codes are parallellized (in most cases over the data points) if there are threads available: either run Julia with the -t option or use the JULIA_NUM_THREADS environment variable.

Licence

The code is released under the MIT licence.

The code has originated from the code of RecombinatorKMeans.jl, see the licence information there.

Suggest Category

KMeansPNNSmoothing.jl

Installation and setup

Usage

Licence

Julia Packages