## Simpsons.jl

Julia module to check data for a Simpson's statistical paradox

### Usage

```
using Simpsons
has_simpsons_paradox(df, cause, effect, factor; continuous_threshold = 5, cmax = 5, verbose = true)
```

Returns true if the data in DataFrame `df`

aggregated by `factor`

exhibits
Simpson's paradox. Note that the `cause`

and `effect`

columns will be converted
to Int columns if they are not already numeric in type. A continuous data
`factor`

column (one with `continuous_threshold`

or more discrete levels) will
be grouped into at most `cmax`

clusters so as to avoid too many clusters. Prints
the regression slope directions for overall data and groups if `verbose`

is true.

```
simpsons_analysis(df, cause_column, effect_column; verbose = true, show_plots = true)
```

Analyze the DataFrame `df`

assuming a cause is in `cause_column`

and an effect in
`effect_column`

of the dataframe. Output data including any Simpson's paradox type
first degree slope reversals in subgroups found. Plots shown if `show_plots`

is true (default).

```
make_paradox(nsubgroups = 3 , N = 1024)
```

Return a dataframe containing `N`

rows of random data in 3 columns `:x`

(cause),
`:y`

(effect), and `:z`

(cofactor) which displays the Simpson's paradox.

```
plot_clusters(df, cause, effect)
```

Plot, with subplots, clustering of the dataframe `df`

using `cause`

and `effect`

plotted and
color coded by clusterings. Use kmeans clustering analysis on all fields of dataframe.
Use 2 to 5 as cluster numbers.

```
plot_kmeans_by_factor(df, cause_column, effect_column, factor_column)
```

Plot clustering of the dataframe using cause plotted as X, effect as Y, with the `factor_column`

used for kmeans clustering into between 2 and 5 clusters on the plot.

```
find_clustering_elbow(dataarray::AbstractMatrix{<:Real}, cmin = 1, cmax = 5; fclust = kmeans, kwargs...)
```

Find the "elbow" of the totalcost versus cluster number curve, where
cmin <= elbow <= cmax. Note that in pathological cases where the actual
minimum of the totalcosts occurs at a cluster count less than that of the
curve "elbow", the function will return either cmin or the actual cluster
count at which the totalcost is at minimum, whichever is larger.

Returns a tuple: the cluster count and the ClusteringResult at the "elbow" optimum.

### Examples

```
using Simpsons
# Create a dataframe with cause :x, effect :y, and cofactor :z columns
dfp = make_paradox()
# Test for a Simpson's paradox, where the regression direction of :x with :y
# reverses if the data is split by factor :z.
has_simpsons_paradox(dfp, :x, :y, :z) # true with this data
# Analyze with plots made of data clustering.
# To see the plots, run in REPL to prevent premature display closure.
simpsons_analysis(dfp, :x, :y)
```

### Installation

Install the package using the package manager (Press ] to enter pkg> mode):

```
(v1) pkg> add Simpsons
```