Clusterpath.jl

Julia implementation of *l*_1-norm clusterpath (Hocking et al., 2011, Radchenko & Mukerjee, 2017)
Author naturale0
Popularity
1 Star
Updated Last
9 Months Ago
Started In
December 2020

Clusterpath.jl

Julia implementation of $\ell_1$ Clusterpath, described in the paper[^1].

installation

import Pkg
Pkg.add("Clusterpath")

Quick Start

Sample procedure (with Big Merge Tracker)

  • generate_mixture_normal()
    • generate n observations from mixture of univariate normals each with standard deviation $1$ and mean parameters m and proportion p.
Random.seed!(0)
x1 = generate_mixture_normal(1000, [-4.5, 4.5], [0.35, 0.65])
  • clusterpath()
    • inputs:
      • x: observation vector
      • alpha: Big Merge Tracker threshold
cc = clusterpath(x1, α=0., return_split=true)["splits"][end]
-1.3447486506416237
  • Another toy data
N = 100
Random.seed!(1)
xx = [randn(N, 2) .* .5; (randn(N, 2) .* 0.3 .+ 3)]
gt = repeat([1, 2], inner=N);
  • plot_path()
    • plot clusterpath with the data(x) and the solution path casted by cast_solution().
    • If the dimension of x is greater than 4, only plot combinations of first four dimensions.
      Gaston.jl and gnuplot should be installed and on the PATH of your system. Install gnuplot here.
    • x: data
    • solution: solution path dataframe from cast_solution()
    • gt: ground truth labels
    • savefig: whether to save the figure as a PNG file. (default: false)
    • fname: image file name to be used when savefig is true. (default: "path_plot")
    • show: whether to show the plot in the notebook. Highly recommended not to show if the number of samples is large. (default: true)
plot_path(xx[:, 1], α=0., gt=gt, show=true)

png

plot_path(xx; α=0., gt=gt, show=true)

png

  • plot_cluster()
    • Plots the scatter plot of the data x colored according to the cluster assigned by clusterpath algorithm.
    • If the dimension of x is greater than 2, perform PCA and plot two PCs.
      ***Gaston.jl and gnuplot should be installed and on the PATH of your system. Install gnuplot here. ***
    • x: data
    • α: threshold for BMT-clusterpath
    • n_node: if greater than 1, will assign clusters from previous merge status. (default: 1)
    • show: whether to show the figure.
    • savefig: whether to save the figure as a png file. (default: false)
    • fname: file name to save if savefig is true. (default: "plot_clst")
    • verbose: print out current iteration. (default: false)
plot_cluster(xx, α=0.2; show=true, savefig=false)

png

  • assign_clusters()
    • assign cluster to each of the observations in x.
    • returns an array of length=size(x, 1) of cluster indices.
    • x: data
    • α: threshold for BMT-clusterpath
    • n_node: if greater than 1, will assign clusters from previous merge status. (default: 1)
assign_cluster(xx, α=.2)'
1×200 Adjoint{Int64,Array{Int64,1}}:
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  2  2  2  2  2  2  2  2  2  2  2  2

Population Procedure

include("PopulationSplit.jl");
  • cond_mean_on_LR()

    • Conditional mean on $(L, R)$, defined as $\mu_{L,R} = \big(\int_L^R f(x) dx\big)^{-1} \cdot \int_L^R x f(x) dx$
  • find_split()

    • Find a split point if find_split=true, or $\delta_1, \delta_2$ for truncation point searching if find_deltas=true.
  • find_truncation()

    • Find the population split points.
  • clusterpath_pop()

    • population-equivalent version of sample clusterpath() procedure.
splits = Array{Float64, 1}()
Lstars = Array{Float64, 1}()
Rstars = Array{Float64, 1}()

for p=0.5:0.05:0.9
    cp = clusterpath_pop(p, 4.5)
    push!(splits, cp["s"])
    push!(Lstars, cp["L*"])
    push!(Rstars, cp["R*"])
end

println([round(s, digits=2) for s in splits]')
println([round(l, digits=2) for l in Lstars]')
println([round(r, digits=2) for r in Rstars]')
[0.0 -0.45 -0.9 -1.36 -1.82 -2.31 -2.89 -3.82 NaN]
[-8.98 -8.54 -8.09 -7.63 -7.17 -6.67 -6.09 -5.18 NaN]
[8.98 9.44 9.89 10.34 10.79 11.24 11.7 12.17 NaN]
splits = Array{Float64, 1}()

for p=0.5:0.05:0.9
    push!(splits, clusterpath_pop(p, 4.5)["s"])
end

splits'
1×9 Adjoint{Float64,Array{Float64,1}}:
 0.0  -0.4495  -0.9005  -1.355  -1.8195  -2.314  -2.8935  -3.816  NaN

: exactly the same results as in the paper (supp. p.29 Table 1).


[^1]: Radchenko, P. and Mukherjee, G. (2017), Convex clustering via l1 fusion penalization. J. R. Stat. Soc. B, 79: 1527-1546. https://doi.org/10.1111/rssb.12226

Used By Packages

No packages found.