SyntheticDatasets.jl

The SyntheticDatasets.jl package is a library with functions for generating synthetic artificial datasets.

Installation

The package can be installed with the Julia package manager. From the Julia REPL, type ] to enter the Pkg REPL mode and run:

pkg> add SyntheticDatasets

Or, equivalently, via the Pkg API:

julia> import Pkg; Pkg.add("SyntheticDatasets")

Examples

A set of pluto notebooks and codes demonstrating the project's current functionality is available in the examples folder.

Here are a few examples to show the Package capabilities:

using StatsPlots, SyntheticDatasets

blobs = SyntheticDatasets.make_blobs(   n_samples = 1000, 
                                        n_features = 2,
                                        centers = [-1 1; -0.5 0.5], 
                                        cluster_std = 0.25,
                                        center_box = (-2.0, 2.0), 
                                        shuffle = true,
                                        random_state = nothing);

@df blobs scatter(:feature_1, :feature_2, group = :label, title = "Blobs")

gauss = SyntheticDatasets.make_gaussian_quantiles(  mean = [10,1], 
                                                    cov = 2.0,
                                                    n_samples = 1000, 
                                                    n_features = 2,
                                                    n_classes = 3, 
                                                    shuffle = true,
						    random_state = 2);

@df gauss scatter(:feature_1, :feature_2, group = :label, title = "Gaussian Quantiles")

spirals = SyntheticDatasets.make_twospirals(n_samples = 2000, 
                                            start_degrees = 90,
                                            total_degrees = 570, 
                                            noise =0.1);

@df spirals scatter(:feature_1, :feature_2, group = :label, title = "Two Spirals")

kernel = SyntheticDatasets.make_halfkernel( n_samples = 1000, 
                                            minx = -20,
                                            r1 = 20, 
                                            r2 = 35,
                                            noise = 3.0, 
                                            ratio = 0.6);

@df kernel scatter(:feature_1, :feature_2, group = :label, title = "Half Kernel")

Datasets

The SyntheticDatasets.jl is a library with functions for generating synthetic artificial datasets. The package has some functions are interfaces to the dataset generator of the ScikitLearn.

ScikitLearn

List of package datasets:

Dataset	Title	Reference
make_blobs	Generate isotropic Gaussian blobs for clustering.	link
make_moons	Make two interleaving half circles	link
make_s_curve	Generate an S curve dataset.	link
make_regression	Generate a random regression problem.	link
make_classification	Generate a random n-class classification problem.	link
make_friedman1	Generate the “Friedman #1” regression problem.	link
make_friedman2	Generate the “Friedman #2” regression problem.	link
make_friedman3	Generate the “Friedman #3” regression problem.	link
make_circles	Make a large circle containing a smaller circle in 2d	link
make_regression	Generate a random regression problem.	link
make_classification	Generate a random n-class classification problem.	link
make_low_rank_matrix	Generate a mostly low rank matrix with bell-shaped singular values.	link
make_swiss_roll	Generate a swiss roll dataset.	link
make_hastie_10_2	Generates data for binary classification used in Hastie et al.	link
make_gaussian_quantiles	Generate isotropic Gaussian and label samples by quantile.	link

Disclaimer: SyntheticDatasets.jl borrows code and documentation from scikit-learn in the dataset module, but it is not an official part of that project. It is licensed under MIT.

Other Functions

Dataset	Title	Reference
make_twospirals	Generate two spirals dataset.	link
make_halfkernel	Generate two half kernel dataset.	link
make_outlier	Generate outlier dataset.	link