Julia module to check data for a Simpson's statistical paradox
using Simpsons
has_simpsons_paradox(df, cause, effect, factor; continuous_threshold = 5, cmax = 5, verbose = true)
Returns true if the data in DataFrame df aggregated by factor exhibits
Simpson's paradox. Note that the cause and effect columns will be converted
to Int columns if they are not already numeric in type. A continuous data
factor column (one with continuous_threshold or more discrete levels) will
be grouped into at most cmax clusters so as to avoid too many clusters. Prints
the regression slope directions for overall data and groups if verbose is true.
simpsons_analysis(df, cause_column, effect_column; verbose = true, show_plots = true)
Analyze the DataFrame df assuming a cause is in cause_column and an effect in
effect_column of the dataframe. Output data including any Simpson's paradox type
first degree slope reversals in subgroups found. Plots shown if show_plots is true (default).
make_paradox(nsubgroups = 3 , N = 1024)
Return a dataframe containing N rows of random data in 3 columns :x (cause),
:y (effect), and :z (cofactor) which displays the Simpson's paradox.
plot_clusters(df, cause, effect)
Plot, with subplots, clustering of the dataframe df using cause and effect plotted and
color coded by clusterings. Use kmeans clustering analysis on all fields of dataframe.
Use 2 to 5 as cluster numbers.
plot_kmeans_by_factor(df, cause_column, effect_column, factor_column)
Plot clustering of the dataframe using cause plotted as X, effect as Y, with the factor_column
used for kmeans clustering into between 2 and 5 clusters on the plot.
find_clustering_elbow(dataarray::AbstractMatrix{<:Real}, cmin = 1, cmax = 5; fclust = kmeans, kwargs...)
Find the "elbow" of the totalcost versus cluster number curve, where
cmin <= elbow <= cmax. Note that in pathological cases where the actual
minimum of the totalcosts occurs at a cluster count less than that of the
curve "elbow", the function will return either cmin or the actual cluster
count at which the totalcost is at minimum, whichever is larger.
Returns a tuple: the cluster count and the ClusteringResult at the "elbow" optimum.
using Simpsons
# Create a dataframe with cause :x, effect :y, and cofactor :z columns
dfp = make_paradox()
# Test for a Simpson's paradox, where the regression direction of :x with :y
# reverses if the data is split by factor :z.
has_simpsons_paradox(dfp, :x, :y, :z) # true with this data
# Analyze with plots made of data clustering.
# To see the plots, run in REPL to prevent premature display closure.
simpsons_analysis(dfp, :x, :y)
Install the package using the package manager (Press ] to enter pkg> mode):
(v1) pkg> add Simpsons