Clusterpath.jl

Julia implementation of *l*_1-norm clusterpath (Hocking et al., 2011, Radchenko & Mukerjee, 2017)
Author naturale0
Popularity
1 Star
Updated Last
3 Years Ago
Started In
December 2020

Clusterpath.jl

Julia implementation of $\ell_1$ Clusterpath, described in the paper1.

installation

import Pkg
Pkg.add("Clusterpath")

Quick Start

Sample procedure (with Big Merge Tracker)

  • generate_mixture_normal()
    • generate n observations from mixture of univariate normals each with standard deviation $1$ and mean parameters m and proportion p.
Random.seed!(0)
x1 = generate_mixture_normal(1000, [-4.5, 4.5], [0.35, 0.65])
  • clusterpath()
    • inputs:
      • x: observation vector
      • alpha: Big Merge Tracker threshold
cc = clusterpath(x1, α=0., return_split=true)["splits"][end]
-1.3447486506416237
  • Another toy data
N = 100
Random.seed!(1)
xx = [randn(N, 2) .* .5; (randn(N, 2) .* 0.3 .+ 3)]
gt = repeat([1, 2], inner=N);
  • plot_path()
    • plot clusterpath with the data(x) and the solution path casted by cast_solution().
    • If the dimension of x is greater than 4, only plot combinations of first four dimensions.
      Gaston.jl and gnuplot should be installed and on the PATH of your system. Install gnuplot here.
    • x: data
    • solution: solution path dataframe from cast_solution()
    • gt: ground truth labels
    • savefig: whether to save the figure as a PNG file. (default: false)
    • fname: image file name to be used when savefig is true. (default: "path_plot")
    • show: whether to show the plot in the notebook. Highly recommended not to show if the number of samples is large. (default: true)
plot_path(xx[:, 1], α=0., gt=gt, show=true)

png

plot_path(xx; α=0., gt=gt, show=true)

png

  • plot_cluster()
    • Plots the scatter plot of the data x colored according to the cluster assigned by clusterpath algorithm.
    • If the dimension of x is greater than 2, perform PCA and plot two PCs.
      ***Gaston.jl and gnuplot should be installed and on the PATH of your system. Install gnuplot here. ***
    • x: data
    • α: threshold for BMT-clusterpath
    • n_node: if greater than 1, will assign clusters from previous merge status. (default: 1)
    • show: whether to show the figure.
    • savefig: whether to save the figure as a png file. (default: false)
    • fname: file name to save if savefig is true. (default: "plot_clst")
    • verbose: print out current iteration. (default: false)
plot_cluster(xx, α=0.2; show=true, savefig=false)

png

  • assign_clusters()
    • assign cluster to each of the observations in x.
    • returns an array of length=size(x, 1) of cluster indices.
    • x: data
    • α: threshold for BMT-clusterpath
    • n_node: if greater than 1, will assign clusters from previous merge status. (default: 1)
assign_cluster(xx, α=.2)'
1×200 Adjoint{Int64,Array{Int64,1}}:
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  2  2  2  2  2  2  2  2  2  2  2  2

Population Procedure

include("PopulationSplit.jl");
  • cond_mean_on_LR()

    • Conditional mean on $(L, R)$, defined as $\mu_{L,R} = \big(\int_L^R f(x) dx\big)^{-1} \cdot \int_L^R x f(x) dx$
  • find_split()

    • Find a split point if find_split=true, or $\delta_1, \delta_2$ for truncation point searching if find_deltas=true.
  • find_truncation()

    • Find the population split points.
  • clusterpath_pop()

    • population-equivalent version of sample clusterpath() procedure.
splits = Array{Float64, 1}()
Lstars = Array{Float64, 1}()
Rstars = Array{Float64, 1}()

for p=0.5:0.05:0.9
    cp = clusterpath_pop(p, 4.5)
    push!(splits, cp["s"])
    push!(Lstars, cp["L*"])
    push!(Rstars, cp["R*"])
end

println([round(s, digits=2) for s in splits]')
println([round(l, digits=2) for l in Lstars]')
println([round(r, digits=2) for r in Rstars]')
[0.0 -0.45 -0.9 -1.36 -1.82 -2.31 -2.89 -3.82 NaN]
[-8.98 -8.54 -8.09 -7.63 -7.17 -6.67 -6.09 -5.18 NaN]
[8.98 9.44 9.89 10.34 10.79 11.24 11.7 12.17 NaN]
splits = Array{Float64, 1}()

for p=0.5:0.05:0.9
    push!(splits, clusterpath_pop(p, 4.5)["s"])
end

splits'
1×9 Adjoint{Float64,Array{Float64,1}}:
 0.0  -0.4495  -0.9005  -1.355  -1.8195  -2.314  -2.8935  -3.816  NaN

: exactly the same results as in the paper (supp. p.29 Table 1).


Footnotes

  1. Radchenko, P. and Mukherjee, G. (2017), Convex clustering via l1 fusion penalization. J. R. Stat. Soc. B, 79: 1527-1546. https://doi.org/10.1111/rssb.12226