Fast Subset Selection algorithm for Statistics/Machine Learning
Author jeanpauphilet
11 Stars
Updated Last
1 Year Ago
Started In
September 2017



SubsetSelection is a Julia package that computes sparse L2-regularized estimators. Sparsity is enforced through explicit cardinality constraint or L0-penalty. Supported loss functions for regression are least squares, L1 and L2 SVR; for classification, logistic, L1 and L2 Hinge loss. The algorithm formulates the problem as a mixed-integer saddle-point problem and solves its boolean relaxation using a dual sub-gradient approach.

Quick start

To install the package:

julia> Pkg.install("SubsetSelection")

or the have the latest version

julia> Pkg.clone("git://")

To fit a basic model:

julia> using SubsetSelection, StatsBase

julia> n = 100; p = 10000; k = 10;
julia> indices = sort(sample(1:p, StatsBase.Weights(ones(p)/p), k, replace=false));
julia> w = sample(-1:2:1, k);
julia> X = randn(n,p); Y = X[:,indices]*w;
julia> Sparse_Regressor = subsetSelection(OLS(), Constraint(k), Y, X)

The algorithm returns a SparseEstimator object with the following fields: loss (loss function used), sparsity (model to enforce sparsity), indices (features selected), w (value of the estimator on the selected features only), α (values of the associated dual variables), b (bias term), iter (number of iterations required by the algorithm).

For instance, you can access the selected features directly in the indices field:

julia> Sparse_Regressor.indices
10-element Array{Int64,1}:

or compute predictions

julia> Y_pred = X[:,Sparse_Regressor.indices]*Sparse_Regressor.w
100-element Array{Float64,1}:

For classification, we use +1/-1 labels and the convention P ( Y = y | X = x ) = 1 / (1+e^{- y x^T w}).

Required and optional parameters

subsetSelection has four required parameters:

  • the loss function to be minimized, to be chosen among least squares (OLS()), L1SVR (L1SVR(ɛ)), L2SVR (L2SVR(ɛ)), Logistic loss (LogReg()), Hinge Loss (L1SVM()), L2-SVM (L2SVM()). For classification, we recommend using Hinge loss or L2-SVM functions. Indeed, the Fenchel conjugate of the Logistic loss exhibits unbounded gradients, which largely hinders convergence of the algorithm and might require smaller and more steps (see optional parameters).
  • the model used to enforce sparsity; either by adding a hard constraint of the form "||w||_0 < k" (Constraint(k)) or by adding a penalty of the form "+ λ ||w||_0" (Penalty(λ)) to the objective. For tractability issues, we highly recommend using an explicit constraint instead of a penalty, for it ensures the size of the support remains bounded through the algorithm.
  • the vector of outputs Y of size n, the sample size. In classification settings, Y should be a vector of ±1s.
  • the matrix of covariates X of size n×p, where n and p are the number of samples and features respectively.

In addition, subsetSelection accepts the following optional parameters:

  • an initialization for the selected features, indInit.
  • an initialization for the dual variables, αInit.
  • the value of the ℓ2-regularization parameter γ, set to 1/√n by default.
  • intercept, a boolean. If true, an intercept/bias term is computed as well. By default, set to false.
  • the maximum number of iterations in the sub-gradient algorithm, maxIter.
  • the value of the gradient stepsize δ. By default, the stepsize is set to 1e-3, which demonstrates very good empirical performance. However, smaller stepsizes might be needed when dealing with very large datasets or when the Logistic loss is used.
  • the number of gradient updates of dual variable α performed per update of primal variable s, gradUp.
  • anticycling a boolean. If true, the algorithm stops as soon as the support is not unchanged from one iteration to another. Empirically, the accuracy of the resulting support is strongly sensitive to noise - to use with caution. By default, set to false.
  • averaging a boolean. If true, the dual solution is averaged over past iterates. By default, set to true.

Best practices

  • Tuning the regularization parameter γ: By default, γ is set to 1/√n, which is an appropriate scaling in most regression instances. For an optimal performance, and especially in classification or noisy settings, we recommend performing a grid search and using cross-validation to assess out-of-sample performance. The grid search should start with a very low value for γ, such as
    γ = 1.*p / k / n / maximum(sum(X[train,:].^2,2))

and iteratively increase it by a factor 2. Mean square error or Area Under the Curve (see ROCAnalysis for implementation) are commonly used performance metrics for regression and classification tasks respectively.

  • Instances where the algorithm fails to converge have been reported. If you occur such cases, try normalize the data matrix X and relaunch the algorithm. If the algorithm still fails to converge, reduce the stepsize δ by a factor 10 or 100 and increase the number of iterations maxIter by a factor at least 2.