This package provides an implementation of ParticleMDI, a particle Gibbs version of MDI, allowing for the integrative cluster analysis of multiple datasets. ParticleMDI is built within the framework of MDI (multiple data integration).
] add ParticleMDI
The function pmdi()
provides the primary functionality for ParticleMDI
. It requires the specification of:
dataFiles::Vector
a vector of K data matrices to be analyseddataTypes::Vector
a vector of K datatypes. Independent multivariate normals can be specified withParticleMDI.GaussianCluster
, categorical data withParticleMDI.CategoricalCluster
and negative binomial data withParticleMDI.NegBinomCluster
.N::Int64
the maximum number of clusters to fitparticles::Int64
the number of particlesρ::Float64
proportion of allocations assumed known in each MCMC iteration, a value in (0, 1).iter::Int64
number of iterations to runoutputFile::String
specification of a CSV file to store outputfeatureSelect::Union{String, Nothing}
defaults tonothing
, setting a string value means feature selection will be performed and output will be stored in the CSV file specified.thin::Int
an integer, thinning the resulting MCMC samples to everythin
th sample.
Outputs a .csv file, each row containing:
- Mass parameter for datasets
1:K
- Φ value for
binomial(K, 2)
pairs of datasets - c cluster allocations for observations
1:n
in datasets1:k
using ParticleMDI
using RDatasets
data = [Matrix(dataset("datasets", "iris")[:, 1:4])]
gaussian_normalise!(data[1])
dataTypes = [ParticleMDI.GaussianCluster]
# Run a simple run first to allow for compilation
pmdi(data, dataTypes, 10, 2, 0.99, 1, "output/file.csv")
pmdi(data, dataTypes, 10, 32, 0.25, 1000, "output/file.csv")
# View output; first generate a posterior similarity matrix
psm = generate_psm("output/file.csv", burnin = 500, thin = 2)
# Can be plotted as a heatmap
consensus_map(psm, k = 3)
ParticleMDI
includes functionality for clustering Gaussian and categorical data, however this can easily be extended to other data types. Consider a trivial case where we wish to cluster data according to their sign.
The first step is to define a struct containing each cluster. Typically this will contain information on the number of observations in the cluster as well sufficient statistics for calculating the posterior predictive of assigning new observations to this cluster.
mutable struct SignCluster
n::Int64 # No. of observations in cluster
isneg::Bool # Positive/negative flag
SignCluster(dataFile) = new(0, false)
end
Given this, we then need to define a function which returns the log posterior predictive of an observation belonging to this cluster, given the allocations already assigned to it. In this case, all we need to know is does the cluster contain positive or negative numbers.
function ParticleMDI.calc_logprob(obs, cl::SignCluster)
if cl.n == 0
return log(0.5)
else
return ((obs[1] <= 0) == cl.isneg) ? 0 : - 10
end
end
And finally, a function needs to be specified explaining how to update a cluster when new observations are added to it.
function ParticleMDI.cluster_add!(cl::SignCluster, obs)
cl.n += 1
cl.isneg = (obs[1] < 0)
end
Optionally a function which returns the log marginal likelihood of each feature in a cluster. This is used to perform feature selection by comparison between the inferred allocations and the situation where all observations within a feature are assigned to a single cluster. This need not be specified if featureSelect = false
, however if you want to do feature selection for any dataType you'll need to have this specified. In such a case, you can specify this to return a large number (not Inf
) and features should always be selected. The assumption of independence across features underlies this step and so should not be used if this assumption does not hold.
function ParticleMDI.calc_logmarginal!(cl::SignCluster)
# return a vector of log-marginal likelihoods
end
This can then be run by specifying SignCluster
as a data type in pmdi()
.