AMLPipelineBase.jl is the Base package of TSML.jl, AutoMLPipeline.jl, and Lale.jl.
AMLPipelineBase is written in pure Julia. It exposes the abstract types commonly shared by TSML and AutoMLPipeline. It also contains basic data preprocessing routines and learners for rapid prototyping. TSML extends AMLPipelineBase capability by specializing in Time-Series workflow while AutoMLPipeline focuses in ML pipeline optimization. Since AMLPipelineBase is written in pure Julia including its dependencies, the future target will be to exploit Julia's native multi-threading using thread-safe ML Julia libraries for scalability and performance.
AMLPipelineBase declares the following abstract data types:
abstract type Machine end abstract type Computer <: Machine end abstract type Workflow <: Machine end abstract type Learner <: Computer end abstract type Transformer <: Computer end
AMLPipelineBase dynamically dispatches the fit! and transform! functions which must be overloaded by different subtypes of Machine.
function fit!(mc::Machine, input::DataFrame, output::Vector) error(typeof(mc),"not implemented") end function transform!(mc::Machine, input::DataFrame) error(typeof(mc),"not implemented") end
- To provide a Base package for common functions and abstractions shared by:
- AutoMLPipeline: A package for ML Pipeline Optimization
- TSML: A package for Time-Series ML
- Lale: A Julia wrapper of python's Lale for semi-automated data science.
- To implement efficient multi-threading reduction workflow
- Symbolic pipeline API for high-level description and easy expression of complex pipeline structures and workflows
- Easily extensible architecture by overloading just two main interfaces: fit! and transform!
- Meta-ensembles that allow composition of ensembles of ensembles (recursively if needed) for robust prediction routines
- Categorical and numerical feature selectors for specialized preprocessing routines based on types
- Normalizers (zscore, unitrange, pca, fa) and Ensemble learners (voting, stacks, best)
If you want to add your own filter/transformer/learner, it is trivial.
Just take note that filters and transformers process the first
input features and ignores the target output while learners process both
the input features and target output arguments of the
transform! function always expect one input argument in all cases.
First, import the abstract types and define your own mutable structure
as subtype of either Learner or Transformer. Next, import the
transform! functions to be overloaded. Also, load the DataFrames package
to be used for data interchange.
using DataFrames using AMLPipelineBase: AbsTypes # import fit! and transform! for function overloading import AMLPipelineBase.AbsTypes: fit!, transform! # export new definitions for dynamic dispatch export fit!, transform!, MyFilter # define your filter structure mutable struct MyFilter <: Transformer name::String model::Dict function MyFilter(args::Dict()) .... end end # filters and transformers ignore the target argument. # learners process both the input features and target argument. function fit!(fl::MyFilter, inputfeatures::DataFrame, target::Vector=Vector()) .... end # transform! function expects an input dataframe and outputs a dataframe function transform!(fl::MyFilter, inputfeatures::DataFrame)::DataFrame .... end
Note that the main data interchange format is a dataframe so transform! output should always be a dataframe as well as the input for fit! and transform!. This is necessary so that the pipeline passes the dataframe format consistently to its filters or transformers or learners. Once you created a filter, you can use it as part of the pipeline together with the other learners and filters.
AMLPipelineBase is in the Julia Official package registry.
The latest release can be installed at the Julia
prompt using Julia's package management which is triggered
] at the julia prompt:
julia> ] pkg> update pkg> add AMLPipelineBase
Below outlines some typical way to preprocess and model any dataset.
1. Load Data, Extract Input (X) and Target (Y)
# Make sure that the input feature is a dataframe and the target output is a 1-D vector. using AMLPipelineBase profbdata = getprofb() X = profbdata[:,2:end] Y = profbdata[:,1] |> Vector; head(x)=first(x,5) head(profbdata)
2. Load Filters, Transformers, and Learners
#### categorical preprocessing ohe = OneHotEncoder() #### Column selector catf = CatFeatureSelector() numf = NumFeatureSelector() #### Learners rf = RandomForest() ada = Adaboost() pt = PrunedTree() stack = StackEnsemble() best = BestLearner() vote = VoteEnsemble()
3. Filter categories and hot-encode them
pohe = catf |> ohe tr = fit_transform!(pohe,X,Y)
4. Filter numeric features
pdec = numf tr = fit_transform!(pdec,X,Y)
5. A Pipeline for the Voting Ensemble Classification
# take all categorical columns and hot-bit encode each, # concatenate them to the numerical features, # and feed them to the voting ensemble pvote = (catf |> ohe) + (numf) |> vote pred = fit_transform!(pvote,X,Y) sc = score(:accuracy,pred,Y) println(sc) ### cross-validate acc(X,Y) = score(:accuracy,X,Y) crossvalidate(pvote,X,Y,acc,10,true)
@pipelinex instead of
@pipeline to print the corresponding function calls in 6
julia> @pipelinex (catf |> ohe) + (numf) |> vote :(Pipeline(ComboPipeline(Pipeline(catf, ohe), numf), vote)) # another way is to use @macroexpand with @pipeline julia> @macroexpand @pipeline (catf |> ohe) + (numf) |> vote :(Pipeline(ComboPipeline(Pipeline(catf, ohe), numf), vote))
7. A Pipeline for the Random Forest (RF) Classification
# compute the pca, ica, fa of the numerical columns, # combine them with the hot-bit encoded categorical features # and feed all to the random forest classifier prf = (catf|> ohe) + numf |> rf pred = fit_transform!(prf,X,Y) score(:accuracy,pred,Y) |> println crossvalidate(prf,X,Y,acc,10)
9. A Pipeline for Random Forest Regression
using Statistics iris = getiris() Xreg = iris[:,1:3] Yreg = iris[:,4] |> Vector rfreg = (catf |> ohe) + (numf) |> rf pred = fit_transform!(rfreg,Xreg,Yreg) rmse(X,Y) = mean((X .- Y).^2) |> sqrt res = crossvalidate(rfreg,Xreg,Yreg,rmse,10,true)
Note: More examples can be found in the TSML and AutoMLPipeline packages. Since the code is written in Julia, you are highly encouraged to read the source code and feel free to extend or adapt the package to your problem. Please feel free to submit PRs to improve the package features.
10. Performance Comparison of Several Learners
10.1 Sequential Processing
using Random using DataFrames Random.seed!(1) disc = CatNumDiscriminator() catf = CatFeatureSelector() numf = NumFeatureSelector() ohe = OneHotEncoder() rf = RandomForest() ada = Adaboost() tree = PrunedTree() stack = StackEnsemble() best = BestLearner() vote = VoteEnsemble() learners = DataFrame() for learner in [rf,ada,tree,stack,vote,best] pcmc = disc |> ((catf |> ohe) + numf) |> learner println(learner.name) mean,sd,_ = crossvalidate(pcmc,X,Y,acc,10,true) learners = vcat(learners,DataFrame(name=learner.name,mean=mean,sd=sd)) end; @show learners;
10.2 Parallel Processing
using Random using DataFrames using Distributed nprocs() == 1 && addprocs() @everywhere using DataFrames @everywhere using AMLPipelineBase rf = RandomForest() ada = Adaboost() tree = PrunedTree() stack = StackEnsemble() best = BestLearner() vote = VoteEnsemble() disc = CatNumDiscriminator() catf = CatFeatureSelector() numf = NumFeatureSelector() @everywhere acc(X,Y) = score(:accuracy,X,Y) learners = @distributed (vcat) for learner in [rf,ada,tree,stack,vote,best] pcmc = disc |> ((catf |> ohe) + (numf)) |> learner println(learner.name) mean,sd,_ = crossvalidate(pcmc,X,Y,acc,10,true) DataFrame(name=learner.name,mean=mean,sd=sd) end @show learners;
11. Automatic Selection of Best Learner
You can use
* operation as a selector function which outputs the result of the best learner.
If we use the same pre-processing pipeline in 10, we expect that the average performance of
best learner which is
lsvc will be around 73.0.
Random.seed!(1) pcmc = disc |> ((catf |> ohe) + (numf)) |> (rf * ada * tree) crossvalidate(pcmc,X,Y,acc,10,true)
12. Learners as Transformers
It is also possible to use learners in the middle of expression to serve as transformers and their outputs become inputs to the final learner as illustrated below.
expr = ( ((numf)+(catf |> ohe) |> rf) + ((numf)+(catf |> ohe) |> ada) + ((numf)+(catf |> ohe) |> tree) ) |> ohe |> rf; crossvalidate(expr,X,Y,acc,10,true)
One can even include selector function as part of transformer preprocessing routine:
pjrf = disc |> ((catf |> ohe) + (numf |> rf)) |> ((rf * ada ) + (rf * tree * vote)) |> ohe |> ada crossvalidate(pjrf,X,Y,acc,10,true)
ohe is necessary in both examples
because the outputs of the learners and selector function are categorical
values that need to be hot-bit encoded before feeding to the final
13. Tree Visualization of the Pipeline Structure
You can visualize the pipeline by using AbstractTrees Julia package.
# package installation using Pkg Pkg.update() Pkg.add("AbstractTrees") # load the packages using AbstractTrees using AMLPipelineBase julia> expr = @pipelinex (catf |> ohe) + (numf) |> rf :(Pipeline(ComboPipeline(Pipeline(catf, ohe), numf), rf)) julia> print_tree(stdout, expr) :(Pipeline(ComboPipeline(Pipeline(catf, ohe), numf), rf)) ├─ :Pipeline ├─ :(ComboPipeline(Pipeline(catf, ohe), numf)) │ ├─ :ComboPipeline │ ├─ :(Pipeline(catf, ohe)) │ │ ├─ :Pipeline │ │ ├─ :catf │ │ └─ :ohe │ └─ :numf └─ :rf
Feature Requests and Contributions
We welcome contributions, feature requests, and suggestions. Here is the link to open an issue for any problems you encounter. If you want to contribute, please follow the guidelines in contributors page.
Usage questions can be posted in: