SupervisedLearning

Work in progress for a front-end supervised learning framework. Currently the focus is on creating a pure Julia package for SVMs in KSVM.jl

The goal of this library is manyfold:

Education: allow the user to play around with the models, solvers, etc. for educational purposes. Provide a good base for course exercises. For example visualizing the learning curve of neural networks using different optimization algorithms.
Research: Swap out parts of the machine learning pipeline with custom implementations without losing the ability to utilize the rest of the framework. For example to prototype new prediction models.
Application: Porcelain interface to apply machine learning to given datasets in a convenient way. There might be multiple high-level interface for different usergroups (e.g. one that mimics R's caret package)

Planned High-level API

The following code should already work

using SupervisedLearning
using RDatasets
using UnicodePlots

data = dataset("datasets", "mtcars")

# In this case the dataset will be in-memory and encoded to -1, 1
# There will also be support for datastreaming from HDF5
problemSet = dataSource(AM ~ DRat + WT + DRat&WT, data, SignedClassEncoding)

# Convenient to use with UnicodePlots
print(barplot(classDistribution(problemSet)...))

# Methods for splitting the abstract data sets
trainSet, testSet = splitTrainTest!(problemSet, p_train = .75)

# Specifies the model and modelspecific parameter
model = Classifier.LogisticRegression(l2_coef=0.1)

# Backend for neural networks will be Mocha.jl or OnlineAI.jl
# model = Classifier.FeedForwardNeuralNetwork([4,5,7],[ReLu,ReLu,ReLu])

# train! mutates the model state
#  * the do-block is the callback function which also allows for early stopping
#  * In the regression case Solver.GradientDescent() will result in using Regression.jl, 
#    otherwise (in most deterministic cases) Optim.jl
#  * There will also be stochastic gradient descent with minibatches
train!(model, trainSet, Solver.GradientDescent(), max_iter = 10000, break_every = 100) do
  # You can also use the callback to execute any code
  # For example to print informative messages
  println("Testset accuracy: ", accuracy(model, testSet))
  
  # You can easily store custom learning curves or other arbitrary values
  # They will be linked to the correct iteration automatically
  remember!(model, :testsetCost, cost(model, testSet))
end

# The loss of the training set is stored by default and can be accessed with trainingCurve
# x is a Vector{Int} of iterations with stepsize break_every,
# y is a Vector{Float64} where y[i] is the cost of the trainSet at x[i]
x, y = trainingCurve(model)
print(lineplot(x, y, title = "Learning curve for trainSet"))

# Customly stored curves can be accessed with "history"
# x is a Vector{Int} of iterations (exact values depend on when you called remember!),
# y is a Vector{Float64} where y[i] is the cost of the testSet at x[i]
x, y = history(model, :testsetCost)
print(lineplot(x, y, title = "Learning curve for testSet"))

ŷ = predict(model, testSet) # what the model says
t = groundtruth(testSet) # what it should be

Planned Mid-level API

This is just a rough draft and still object to change

using SupervisedLearning
using RDatasets

data = dataset("datasets", "mtcars")

# In this case the dataset will be in-memory.
# Specifying the encoding is not necessary.
# The model will select the encoding it needs automatically
# Trees for example don't need an encoding at all.
problemSet = dataSource(AM ~ DRat + WT, data)

# Methods for splitting the abstract data sets
trainSet, testSet = splitTrainTest!(problemSet, p_train = .75)

# Perform a gridsearch over an arbitrary modelspace
gsResult = gridsearch([.001, .01, .1], [.0001, .0003]) do lr, lambda

  # Perform cross validation to get a good estimate for the hyperparameter performance
  cvResult = crossvalidate(trainSet, k = 5) do trainFold, valFold

    # Specify the model and model-specific parameters
    model = Classifier.LogisticRegression(l2_coef = lambda)

    # Specify the solver and solver-specific parameters
    solver = Solver.NaiveGradientDescent(learning_rate = lr, normalize_gradient = false)

    # train! mutates the model state
    train!(model, trainFold, solver, max_iter = 1000)

    # make sure to return the trained model
    model
  end

  # You can return a model or a cvResult to gridsearch
  cvResult
end

# Plot the final accuracy of all trained models using UnicodePlots
print(barplot(accuracy(gsResult, testSet)...))

# Get the best model
bestModel = gsResult.bestModel
ŷ = predict(bestModel, testSet)

SupervisedLearning

Planned High-level API

Planned Mid-level API

Julia Packages