# ParticleMDI.jl

This package provides an implementation of ParticleMDI, a particle Gibbs version of MDI, allowing for the integrative cluster analysis of multiple datasets. ParticleMDI is built within the framework of MDI (multiple data integration).

## Installation

`] add ParticleMDI`

## Usage

The function `pmdi()`

provides the primary functionality for `ParticleMDI`

. It requires the specification of:

`dataFiles::Vector`

a vector of K data matrices to be analysed`dataTypes::Vector`

a vector of K datatypes. Independent multivariate normals can be specified with`ParticleMDI.GaussianCluster`

, categorical data with`ParticleMDI.CategoricalCluster`

and negative binomial data with`ParticleMDI.NegBinomCluster`

.`N::Int64`

the maximum number of clusters to fit`particles::Int64`

the number of particles`ρ::Float64`

proportion of allocations assumed known in each MCMC iteration, a value in (0, 1).`iter::Int64`

number of iterations to run`outputFile::String`

specification of a CSV file to store output`featureSelect::Union{String, Nothing}`

defaults to`nothing`

, setting a string value means feature selection will be performed and output will be stored in the CSV file specified.`thin::Int`

an integer, thinning the resulting MCMC samples to every`thin`

th sample.

## Output

Outputs a .csv file, each row containing:

- Mass parameter for datasets
`1:K`

- Φ value for
`binomial(K, 2)`

pairs of datasets - c cluster allocations for observations
`1:n`

in datasets`1:k`

```
using ParticleMDI
using RDatasets
data = [Matrix(dataset("datasets", "iris")[:, 1:4])]
gaussian_normalise!(data[1])
dataTypes = [ParticleMDI.GaussianCluster]
# Run a simple run first to allow for compilation
pmdi(data, dataTypes, 10, 2, 0.99, 1, "output/file.csv")
pmdi(data, dataTypes, 10, 32, 0.25, 1000, "output/file.csv")
# View output; first generate a posterior similarity matrix
psm = generate_psm("output/file.csv", burnin = 500, thin = 2)
# Can be plotted as a heatmap
consensus_map(psm, k = 3)
```

## Extending ParticleMDI for user-defined data types

`ParticleMDI`

includes functionality for clustering Gaussian and categorical data, however this can easily be extended to other data types. Consider a trivial case where we wish to cluster data according to their sign.
The first step is to define a struct containing each cluster. Typically this will contain information on the number of observations in the cluster as well sufficient statistics for calculating the posterior predictive of assigning new observations to this cluster.

```
mutable struct SignCluster
n::Int64 # No. of observations in cluster
isneg::Bool # Positive/negative flag
SignCluster(dataFile) = new(0, false)
end
```

Given this, we then need to define a function which returns the log posterior predictive of an observation belonging to this cluster, given the allocations already assigned to it. In this case, all we need to know is does the cluster contain positive or negative numbers.

```
function ParticleMDI.calc_logprob(obs, cl::SignCluster)
if cl.n == 0
return log(0.5)
else
return ((obs[1] <= 0) == cl.isneg) ? 0 : - 10
end
end
```

And finally, a function needs to be specified explaining how to update a cluster when new observations are added to it.

```
function ParticleMDI.cluster_add!(cl::SignCluster, obs)
cl.n += 1
cl.isneg = (obs[1] < 0)
end
```

Optionally a function which returns the log marginal likelihood of each feature in a cluster. This is used to perform feature selection by comparison between the inferred allocations and the situation where all observations within a feature are assigned to a single cluster. This need not be specified if `featureSelect = false`

, however if you want to do feature selection for *any* dataType you'll need to have this specified. In such a case, you can specify this to return a large number (**not** `Inf`

) and features should always be selected. The assumption of independence across features underlies this step and so should not be used if this assumption does not hold.

```
function ParticleMDI.calc_logmarginal!(cl::SignCluster)
# return a vector of log-marginal likelihoods
end
```

This can then be run by specifying `SignCluster`

as a data type in `pmdi()`

.