# MultiModalMuSig

A Julia package implementing several topic models used for mutation signature estimation.

The Multi-modal correlated topic model (MMCTM) takes an array of arrays of matrices, where the first column of each matrix is a mutation type index, and the second column is the mutation count for a particular sample.

The following example shows how to perform inference using the MMCTM on SNV and SV counts:

```
using MultiModalMuSig
using CSV
using DataFrames
using VegaLite
using Random
Random.seed!(42)
snv_counts = CSV.read("data/brca-eu_snv_counts.tsv", delim='\t')
sv_counts = CSV.read("data/brca-eu_sv_counts.tsv", delim='\t')
X = format_counts_mmctm(snv_counts, sv_counts)
model = MMCTM([7, 7], [0.1, 0.1], X)
fit!(model, tol=1e-5)
snv_signatures = DataFrame(hcat(model.ϕ[1]...))
sv_signatures = DataFrame(hcat(model.ϕ[2]...))
snv_signatures[:term] = snv_counts[:term]
snv_signatures = melt(
snv_signatures, :term, variable_name=:signature, value_name=:probability
)
snv_signatures |> @vlplot(
:bar, x={:term, sort=:null}, y=:probability, row=:signature,
resolve={scale={y=:independent}}
)
```

This code runs the MMCTM for 7 SNV and 7 SV signatures, with signature hyperparameters set to 0.1. Since these types of models can get stuck in poor local optima, it's a good idea to fit many models and pick the best one.

Sample-signature probabilities can be extracted like so:

```
# sample 3, SNV signature probabilities (modality 1)
model.props[3][1]
# SV signature probabilities (modality 2)
model.props[3][2]
# SNV probabilities for all samples
snv_props = hcat(
[model.props[i][1] for i in 1:length(model.props)]...
)
```

The MMCTM can be run on multiple modalities, *e.g.*

```
X = format_counts_mmctm(snv_counts, sv_counts, indel_counts)
model = MMCTM([7, 7, 5], [0.1, 0.1, 0.1], X)
```

The `DataFrame`

inputs to `format_counts_mmctm`

have an optional `term`

column, and further columns for each sample.

To run the CTM instead, just run the MMCTM with a single modality:

```
X = format_counts_ctm(snv_counts)
model = MMCTM([7], [0.1], X)
fit!(model, tol=1e-5)
```

The LDA implementation can be run like so:

```
X = format_counts_lda(snv_counts)
model = LDA(7, 0.1, 0.1, X)
fit!(model, tol=1e-5)
```

In the above code, both the sample-signature and signature-term hyperparameters have been set to 0.1, respectively. After fitting LDA, signatures can be found in `model.β`

, and signature probabilities can be found in `model.θ`

.