MultiStochGrad.jl

Julia implementation of stochastic gradient algorithms (SCSG , SVRG)
Author jean-pierreBoth
Popularity
1 Star
Updated Last
2 Years Ago
Started In
May 2020

MultiStochGrad

This package provides an implementation of 2 stochastic gradient algorithms: SCSG and SVRG

The algorithms implemented are described in the following papers:

  1. The so-called SCSG algorithm described and analyzed in the two papers by L. Lei and M.I Jordan.

    • "On the adaptativity of stochastic gradient based optimization" (2019) SCSG-1

    • "Less than a single pass : stochastically controlled stochastic gradient" (2019) SCSD-2

  2. The SVRG algorithm described in the paper by R. Johnson and T. Zhang "Accelerating Stochastic Gradient Descent using Predictive Variance Reduction" (2013).
    Advances in Neural Information Processing Systems, pages 315–323, 2013

These algorithms minimize functions given by an expression:

    f(x) = 1/n ∑ fᵢ(x) where fᵢ is a convex function.

All algorithms alternates some form of large batch computation (computing gradient of many terms of the sum) and small or mini batches (computing a small number of terms, possibly just one, term of the gradient) and updating position by combining these global and local gradients.

Structure of the package and documentation

The user API relies mainly upon two structures defining the optimization problem (detailed in file evaluator.jl) specifying how to compute the value and gradient of each term of the sum at a given position:

- struct TermFunction{F <: Function}

- struct TermGradient{G <: Function>}

The directory src/applis show examples of definitions of these 2 structures.

These two structures are grouped into a struct Evaluator{F,G} which can thus do all value and gradient computations needed during the iterations.
Each algorithm SVRG or SCSG is described by its own structure and has its specific parameters and iteration strategy. A minimize function taking as arguments a structure SCSG or SVRG, an evaluator structure, the maximum number of iterations and an initial position then returns the minimum value obtained and its correspond position.

Further documentation can be found in docs of the Julia package (run julia make.jl), some more in the doc of the equivalent Rust crate multistochgrad.rs and in the reference papers.

We compare below the performance of the Rust and the Julia versions.

Examples and tests

Small tests consist in a line fitting problem and logisitc regression.

The logistic regression test and example run on the training files of the digits MNIST database, as in the second paper on SCSG.

The data files can be downloaded from MNIST, then modify accordingly the paths declared in file TestLogisiticRegression.jl.

The database has 60000 images of 784 pixels corresponding to handwritten digits form 0 to 9.
The logistic regression, with 10 classes, is tested with the 2 algorithms and some comments are provided, comparing the results. Times are obtained by launching twice the example to avoid the compilation time of the first pass. Run times are those obtained on a 4 hyperthreaded i7-cores laptop at 2.7Ghz

SCSG logistic regression

The identifiability constraint was set on the class corresponding to the 0 digit. (Contrary to the Rust tests where the 9-digit class was chosen, this explains the different initial error and the fact that the best step was not the same).

For the signification of the parameters B_0 , b_O, see documentation of SCSG. Here we give some results:

  • initialization position : 9 images with constant pixel = 0.5, error at initial position: 8.88
nb iter B_0 m_0 step_0 y value time(s)
50 0.02 0.004 0.1 0.29 11.3
50 0.02 0.006 0.1 0.281 14.
70 0.02 0.006 0.1 0.272 19.5
100 0.02 0.006 0.1 0.265 27.
  • initialization position : 9 images with constant pixel = 0.0, error at initial position: 2.3
nb iter B_0 m_0 step_0 y value time(s)
50 0.015 0.003 0.1 0.285 10.2
50 0.02 0.004 0.1 0.28 11
50 0.02 0.006 0.1 0.274 14
100 0.02 0.004 0.1 0.266 22.
100 0.02 0.006 0.1 0.262 27.5

SVRG logistic regression

  • initialization position : 9 images with constant pixel = 0.5, error at initial position: 8.88
nb iter nb mini batch step y value time(s)
50 1000 0.05 0.269 25
50 2000 0.05 0.255 27
  • initialization position : 9 images with constant pixel = 0.0, error at initial position: 2.3
nb iter nb mini batch step y value time(s)
50 500 0.05 0.28 23.5
50 1000 0.05 0.264 24.5
50 2000 0.05 0.25 27
100 1000 0.05 0.253 50

some comments on cpu-times

All times were obtained withe @time macro and julia running with JULIA_NUM_THREADS = 8.

We see that SCSG is fast to reach a minimum of 0.28, it is more difficult to reach 0.26-0.27 it nevertheless quite competitive compared to SVRG.
The efficiency gain with respect to SVRG is not as important as with the Rust version (see below) where we have a factor 1.8 in cpu-time due to a multithreading effect. Our test uses 60000 observations and SCSG runs at most on 1/10 of the terms in a batch (i.e 6000 terms), on the constrary SVRG batches run on the full 60000 terms.
Batch sizes on SCSG are not large enough to fully benefit of the multithreading. This can be verified by setting JULIA_NUM_THREADS = 1 and compare how SCSG and SVRG timing benefit from the threading:

We did 2 checks with initial conditions set to pixel = 0.5 and compared with previous results:

  • For SCSG we ran the case with parameters (70, 0.02, 0.006, 0.1 ) corresponding to last line of first array of results. We had with 8 threads y=0.27 in 20s , and with one thread we obtain y=0.27 in 36.5s, so the threading gives us less than a factor 2.

  • For SVRG we ran the case with parameters (50,1000, 0.05) corresponding to the first line of first array of result for SVRG. We had y=0.269 with 25s, and with one thread we need 72s. Here the threading gives us a gain of 3.

So SCSG efficiency should be more evident on larger problems, note that the threading in Julia is still young.

The logistic regression needed the explicit use of BLAS interface to speed up vectorization.

Nevertheless the Julia version runs within a factor 1.5 or 1.8 of the Rust version which seems a good compromise.

Rust version of this package

There is also a Rust implementation of this package at multigrad.rs.

The Rust version has also an implementation of the SAG algorithm:

The Stochastic Averaged Gradient Descent as described in the paper: "Minimizing Finite Sums with the Stochastic Average Gradient" (2013, 2016) M.Schmidt, N.LeRoux, F.Bach.

Note

  • version 0.1.2 runs 10-15 % faster on SCSG due to some optimization see The need for rand speed or julia bloggers.

License

Licensed under either of

at your option.

This software was written on my own while working at CEA, CEA-LIST

Used By Packages

No packages found.