Julia implementation of Decision Tree (CART) and Random Forest algorithms

Created and developed by Ben Sadeghi (@bensadeghi). Now maintained by the JuliaAI organization.

Available via:

- AutoMLPipeline.jl - create complex ML pipeline structures using simple expressions
- CombineML.jl - a heterogeneous ensemble learning package
- MLJ.jl - a machine learning framework for Julia
- ScikitLearn.jl - Julia implementation of the scikit-learn API

- pre-pruning (max depth, min leaf size)
- post-pruning (pessimistic pruning)
- multi-threaded bagging (random forests)
- adaptive boosting (decision stumps), using SAMME
- cross validation (n-fold)
- support for ordered features (encoded as
`Real`

s or`String`

s)

- pre-pruning (max depth, min leaf size)
- multi-threaded bagging (random forests)
- cross validation (n-fold)
- support for numerical features

**Note that regression is implied if labels/targets are of type Array{Float}**

You can install DecisionTree.jl using Julia's package manager

`Pkg.add("DecisionTree")`

DecisionTree.jl supports the ScikitLearn.jl interface and algorithms (cross-validation, hyperparameter tuning, pipelines, etc.)

Available models: `DecisionTreeClassifier, DecisionTreeRegressor, RandomForestClassifier, RandomForestRegressor, AdaBoostStumpClassifier`

.
See each model's help (eg. `?DecisionTreeRegressor`

at the REPL) for more information

Load DecisionTree package

`using DecisionTree`

Separate Fisher's Iris dataset features and labels

```
features, labels = load_data("iris") # also see "adult" and "digits" datasets
# the data loaded are of type Array{Any}
# cast them to concrete types for better performance
features = float.(features)
labels = string.(labels)
```

Pruned Tree Classifier

```
# train depth-truncated classifier
model = DecisionTreeClassifier(max_depth=2)
fit!(model, features, labels)
# pretty print of the tree, to a depth of 5 nodes (optional)
print_tree(model, 5)
# apply learned model
predict(model, [5.9,3.0,5.1,1.9])
# get the probability of each label
predict_proba(model, [5.9,3.0,5.1,1.9])
println(get_classes(model)) # returns the ordering of the columns in predict_proba's output
# run n-fold cross validation over 3 CV folds
# See ScikitLearn.jl for installation instructions
using ScikitLearn.CrossValidation: cross_val_score
accuracy = cross_val_score(model, features, labels, cv=3)
```

Also, have a look at these classification and regression notebooks.

Decision Tree Classifier

```
# train full-tree classifier
model = build_tree(labels, features)
# prune tree: merge leaves having >= 90% combined purity (default: 100%)
model = prune_tree(model, 0.9)
# pretty print of the tree, to a depth of 5 nodes (optional)
print_tree(model, 5)
# apply learned model
apply_tree(model, [5.9,3.0,5.1,1.9])
# apply model to all the sames
preds = apply_tree(model, features)
# generate confusion matrix, along with accuracy and kappa scores
DecisionTree.confusion_matrix(labels, preds)
# get the probability of each label
apply_tree_proba(model, [5.9,3.0,5.1,1.9], ["Iris-setosa", "Iris-versicolor", "Iris-virginica"])
# run 3-fold cross validation of pruned tree,
n_folds=3
accuracy = nfoldCV_tree(labels, features, n_folds)
# set of classification parameters and respective default values
# pruning_purity: purity threshold used for post-pruning (default: 1.0, no pruning)
# max_depth: maximum depth of the decision tree (default: -1, no maximum)
# min_samples_leaf: the minimum number of samples each leaf needs to have (default: 1)
# min_samples_split: the minimum number of samples in needed for a split (default: 2)
# min_purity_increase: minimum purity needed for a split (default: 0.0)
# n_subfeatures: number of features to select at random (default: 0, keep all)
# keyword rng: the random number generator or seed to use (default Random.GLOBAL_RNG)
n_subfeatures=0; max_depth=-1; min_samples_leaf=1; min_samples_split=2
min_purity_increase=0.0; pruning_purity = 1.0; seed=3
model = build_tree(labels, features,
n_subfeatures,
max_depth,
min_samples_leaf,
min_samples_split,
min_purity_increase;
rng = seed)
accuracy = nfoldCV_tree(labels, features,
n_folds,
pruning_purity,
max_depth,
min_samples_leaf,
min_samples_split,
min_purity_increase;
verbose = true,
rng = seed)
```

Random Forest Classifier

```
# train random forest classifier
# using 2 random features, 10 trees, 0.5 portion of samples per tree, and a maximum tree depth of 6
model = build_forest(labels, features, 2, 10, 0.5, 6)
# apply learned model
apply_forest(model, [5.9,3.0,5.1,1.9])
# get the probability of each label
apply_forest_proba(model, [5.9,3.0,5.1,1.9], ["Iris-setosa", "Iris-versicolor", "Iris-virginica"])
# add 7 more trees
model = build_forest(model, labels, features, 2, 7, 0.5, 6)
# run 3-fold cross validation for forests, using 2 random features per split
n_folds=3; n_subfeatures=2
accuracy = nfoldCV_forest(labels, features, n_folds, n_subfeatures)
# set of classification parameters and respective default values
# n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))
# n_trees: number of trees to train (default: 10)
# partial_sampling: fraction of samples to train each tree on (default: 0.7)
# max_depth: maximum depth of the decision trees (default: no maximum)
# min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
# min_samples_split: the minimum number of samples in needed for a split (default: 2)
# min_purity_increase: minimum purity needed for a split (default: 0.0)
# keyword rng: the random number generator or seed to use (default Random.GLOBAL_RNG)
# multi-threaded forests must be seeded with an `Int`
n_subfeatures=-1; n_trees=10; partial_sampling=0.7; max_depth=-1
min_samples_leaf=5; min_samples_split=2; min_purity_increase=0.0; seed=3
model = build_forest(labels, features,
n_subfeatures,
n_trees,
partial_sampling,
max_depth,
min_samples_leaf,
min_samples_split,
min_purity_increase;
rng = seed)
accuracy = nfoldCV_forest(labels, features,
n_folds,
n_subfeatures,
n_trees,
partial_sampling,
max_depth,
min_samples_leaf,
min_samples_split,
min_purity_increase;
verbose = true,
rng = seed)
```

Adaptive-Boosted Decision Stumps Classifier

```
# train adaptive-boosted stumps, using 7 iterations
model, coeffs = build_adaboost_stumps(labels, features, 7);
# apply learned model
apply_adaboost_stumps(model, coeffs, [5.9,3.0,5.1,1.9])
# get the probability of each label
apply_adaboost_stumps_proba(model, coeffs, [5.9,3.0,5.1,1.9], ["Iris-setosa", "Iris-versicolor", "Iris-virginica"])
# run 3-fold cross validation for boosted stumps, using 7 iterations
n_iterations=7; n_folds=3
accuracy = nfoldCV_stumps(labels, features,
n_folds,
n_iterations;
verbose = true)
```

```
n, m = 10^3, 5
features = randn(n, m)
weights = rand(-2:2, m)
labels = features * weights
```

Regression Tree

```
# train regression tree
model = build_tree(labels, features)
# apply learned model
apply_tree(model, [-0.9,3.0,5.1,1.9,0.0])
# run 3-fold cross validation, returns array of coefficients of determination (R^2)
n_folds = 3
r2 = nfoldCV_tree(labels, features, n_folds)
# set of regression parameters and respective default values
# pruning_purity: purity threshold used for post-pruning (default: 1.0, no pruning)
# max_depth: maximum depth of the decision tree (default: -1, no maximum)
# min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
# min_samples_split: the minimum number of samples in needed for a split (default: 2)
# min_purity_increase: minimum purity needed for a split (default: 0.0)
# n_subfeatures: number of features to select at random (default: 0, keep all)
# keyword rng: the random number generator or seed to use (default Random.GLOBAL_RNG)
n_subfeatures = 0; max_depth = -1; min_samples_leaf = 5
min_samples_split = 2; min_purity_increase = 0.0; pruning_purity = 1.0 ; seed=3
model = build_tree(labels, features,
n_subfeatures,
max_depth,
min_samples_leaf,
min_samples_split,
min_purity_increase;
rng = seed)
r2 = nfoldCV_tree(labels, features,
n_folds,
pruning_purity,
max_depth,
min_samples_leaf,
min_samples_split,
min_purity_increase;
verbose = true,
rng = seed)
```

Regression Random Forest

```
# train regression forest, using 2 random features, 10 trees,
# averaging of 5 samples per leaf, and 0.7 portion of samples per tree
model = build_forest(labels, features, 2, 10, 0.7, 5)
# apply learned model
apply_forest(model, [-0.9,3.0,5.1,1.9,0.0])
# run 3-fold cross validation on regression forest, using 2 random features per split
n_subfeatures=2; n_folds=3
r2 = nfoldCV_forest(labels, features, n_folds, n_subfeatures)
# set of regression build_forest() parameters and respective default values
# n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))
# n_trees: number of trees to train (default: 10)
# partial_sampling: fraction of samples to train each tree on (default: 0.7)
# max_depth: maximum depth of the decision trees (default: no maximum)
# min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
# min_samples_split: the minimum number of samples in needed for a split (default: 2)
# min_purity_increase: minimum purity needed for a split (default: 0.0)
# keyword rng: the random number generator or seed to use (default Random.GLOBAL_RNG)
# multi-threaded forests must be seeded with an `Int`
n_subfeatures=-1; n_trees=10; partial_sampling=0.7; max_depth=-1
min_samples_leaf=5; min_samples_split=2; min_purity_increase=0.0; seed=3
model = build_forest(labels, features,
n_subfeatures,
n_trees,
partial_sampling,
max_depth,
min_samples_leaf,
min_samples_split,
min_purity_increase;
rng = seed)
r2 = nfoldCV_forest(labels, features,
n_folds,
n_subfeatures,
n_trees,
partial_sampling,
max_depth,
min_samples_leaf,
min_samples_split,
min_purity_increase;
verbose = true,
rng = seed)
```

Models can be saved to disk and loaded back with the use of the JLD2.jl package.

```
using JLD2
@save "model_file.jld2" model
```

Note that even though features and labels of type `Array{Any}`

are supported, it is highly recommended that data be cast to explicit types (ie with `float.(), string.()`

, etc). This significantly improves model training and prediction execution times, and also drastically reduces the size of saved models.

To use DecsionTree.jl models in MLJ, first ensure MLJ.jl and MLJDecisionTreeInterface.jl are both in your Julia environment. For example, to install in a fresh environment:

```
using Pkg
Pkg.activate("my_fresh_mlj_environment", shared=true)
Pkg.add("MLJ")
Pkg.add("MLJDecisionTreeInterface")
```

Detailed usage instructions are available for each model using the
`doc`

method. For example:

```
using MLJ
doc("DecisionTreeClassifier", pkg="DecisionTree")
```

Available models are: `AdaBoostStumpClassifier`

,
`DecisionTreeClassifier`

, `DecisionTreeRegressor`

,
`RandomForestClassifier`

, `RandomForestRegressor`

.

The following methods provide measures of feature importance for all models:
`impurity_importance`

, `split_importance`

, `permutation_importance`

. Query the document
strings for details.

A `DecisionTree`

model can be visualized using the `print_tree`

-function of its native interface
(for an example see above in section 'Classification Example').

In addition, an abstraction layer using `AbstractTrees.jl`

has been implemented with the intention to facilitate visualizations, which don't rely on any implementation details of `DecisionTree`

. For more information have a look at the docs in `src/abstract_trees.jl`

and the `wrap`

-function, which creates this layer for a `DecisionTree`

model.

Apart from this, `AbstractTrees.jl`

brings its own implementation of `print_tree`

.

BibTeX entry:

```
@software{ben_sadeghi_2022_7359268,
author = {Ben Sadeghi and
Poom Chiarawongse and
Kevin Squire and
Daniel C. Jones and
Andreas Noack and
Cédric St-Jean and
Rik Huijzer and
Roland Schätzle and
Ian Butterworth and
Yu-Fong Peng and
Anthony Blaom},
title = {{DecisionTree.jl - A Julia implementation of the
CART Decision Tree and Random Forest algorithms}},
month = nov,
year = 2022,
publisher = {Zenodo},
version = {0.11.3},
doi = {10.5281/zenodo.7359268},
url = {https://doi.org/10.5281/zenodo.7359268}
}
```