The algorithms implemented are described in the following papers:
-
The so-called SCSG algorithm described and analyzed in the two papers by L. Lei and M.I Jordan.
-
The SVRG algorithm described in the paper by R. Johnson and T. Zhang "Accelerating Stochastic Gradient Descent using Predictive Variance Reduction" (2013).
Advances in Neural Information Processing Systems, pages 315–323, 2013
These algorithms minimize functions given by an expression:
f(x) = 1/n ∑ fᵢ(x) where fᵢ is a convex function.
All algorithms alternates some form of large batch computation (computing gradient of many terms of the sum) and small or mini batches (computing a small number of terms, possibly just one, term of the gradient) and updating position by combining these global and local gradients.
The user API relies mainly upon two structures defining the optimization problem (detailed in file evaluator.jl) specifying how to compute the value and gradient of each term of the sum at a given position:
- struct TermFunction{F <: Function}
- struct TermGradient{G <: Function>}
The directory src/applis show examples of definitions of these 2 structures.
These two structures are grouped into a struct Evaluator{F,G} which can thus do all value and gradient computations needed during the iterations.
Each algorithm SVRG or SCSG is described by its own structure and has its specific parameters and iteration strategy.
A minimize function taking as arguments a structure SCSG or SVRG, an evaluator structure, the maximum number of iterations and an initial position then returns the minimum value obtained and its correspond position.
Further documentation can be found in docs of the Julia package (run julia make.jl), some more in the doc of the equivalent Rust crate multistochgrad.rs and in the reference papers.
We compare below the performance of the Rust and the Julia versions.
Small tests consist in a line fitting problem and logisitc regression.
The logistic regression test and example run on the training files of the digits MNIST database, as in the second paper on SCSG.
The data files can be downloaded from MNIST, then modify accordingly the paths declared in file TestLogisiticRegression.jl.
The database has 60000 images of 784 pixels corresponding to
handwritten digits form 0 to 9.
The logistic regression, with 10 classes, is tested with the 2 algorithms and some comments are provided, comparing the results.
Times are obtained by launching twice the example to avoid the compilation time of the first pass.
Run times are those obtained on a 4 hyperthreaded i7-cores laptop at 2.7Ghz
The identifiability constraint was set on the class corresponding to the 0 digit. (Contrary to the Rust tests where the 9-digit class was chosen, this explains the different initial error and the fact that the best step was not the same).
For the signification of the parameters B_0 , b_O, see documentation of SCSG. Here we give some results:
- initialization position : 9 images with constant pixel = 0.5, error at initial position: 8.88
nb iter | B_0 | m_0 | step_0 | y value | time(s) |
---|---|---|---|---|---|
50 | 0.02 | 0.004 | 0.1 | 0.29 | 11.3 |
50 | 0.02 | 0.006 | 0.1 | 0.281 | 14. |
70 | 0.02 | 0.006 | 0.1 | 0.272 | 19.5 |
100 | 0.02 | 0.006 | 0.1 | 0.265 | 27. |
- initialization position : 9 images with constant pixel = 0.0, error at initial position: 2.3
nb iter | B_0 | m_0 | step_0 | y value | time(s) |
---|---|---|---|---|---|
50 | 0.015 | 0.003 | 0.1 | 0.285 | 10.2 |
50 | 0.02 | 0.004 | 0.1 | 0.28 | 11 |
50 | 0.02 | 0.006 | 0.1 | 0.274 | 14 |
100 | 0.02 | 0.004 | 0.1 | 0.266 | 22. |
100 | 0.02 | 0.006 | 0.1 | 0.262 | 27.5 |
- initialization position : 9 images with constant pixel = 0.5, error at initial position: 8.88
nb iter | nb mini batch | step | y value | time(s) |
---|---|---|---|---|
50 | 1000 | 0.05 | 0.269 | 25 |
50 | 2000 | 0.05 | 0.255 | 27 |
- initialization position : 9 images with constant pixel = 0.0, error at initial position: 2.3
nb iter | nb mini batch | step | y value | time(s) |
---|---|---|---|---|
50 | 500 | 0.05 | 0.28 | 23.5 |
50 | 1000 | 0.05 | 0.264 | 24.5 |
50 | 2000 | 0.05 | 0.25 | 27 |
100 | 1000 | 0.05 | 0.253 | 50 |
All times were obtained withe @time macro and julia running with JULIA_NUM_THREADS = 8.
We see that SCSG is fast to reach a minimum of 0.28, it is more difficult to reach 0.26-0.27
it nevertheless quite competitive compared to SVRG.
The efficiency gain with respect to SVRG is not as important
as with the Rust version (see below) where we have a factor 1.8 in cpu-time due to a multithreading effect.
Our test uses 60000 observations and SCSG runs at most on 1/10 of the terms in a batch (i.e 6000 terms), on the constrary SVRG batches run on the full 60000 terms.
Batch sizes on SCSG are not large enough to fully benefit of the multithreading.
This can be verified by setting JULIA_NUM_THREADS = 1 and compare how SCSG and SVRG timing benefit from
the threading:
We did 2 checks with initial conditions set to pixel = 0.5 and compared with previous results:
-
For SCSG we ran the case with parameters (70, 0.02, 0.006, 0.1 ) corresponding to last line of first array of results. We had with 8 threads y=0.27 in 20s , and with one thread we obtain y=0.27 in 36.5s, so the threading gives us less than a factor 2.
-
For SVRG we ran the case with parameters (50,1000, 0.05) corresponding to the first line of first array of result for SVRG. We had y=0.269 with 25s, and with one thread we need 72s. Here the threading gives us a gain of 3.
So SCSG efficiency should be more evident on larger problems, note that the threading in Julia is still young.
The logistic regression needed the explicit use of BLAS interface to speed up vectorization.
Nevertheless the Julia version runs within a factor 1.5 or 1.8 of the Rust version which seems a good compromise.
There is also a Rust implementation of this package at multigrad.rs.
The Rust version has also an implementation of the SAG algorithm:
The Stochastic Averaged Gradient Descent as described in the paper: "Minimizing Finite Sums with the Stochastic Average Gradient" (2013, 2016) M.Schmidt, N.LeRoux, F.Bach.
- version 0.1.2 runs 10-15 % faster on SCSG due to some optimization see The need for rand speed or julia bloggers.
Licensed under either of
- Apache License, Version 2.0, LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0
- MIT license LICENSE-MIT or http://opensource.org/licenses/MIT
at your option.
This software was written on my own while working at CEA, CEA-LIST