Jchemo.jl
Dimension reduction, Regression and Discrimination for Chemometrics
Jchemo.jl is a package for data exploration and prediction with focus on high dimensional data.
The package was initially designed about partial least squares regression and discrimination models and variants, in particular locally weighted PLS models (LWPLS) (e.g. https://doi.org/10.1002/cem.3209). Then, it has been expanded to many other methods for analyzing high dimensional data.
The package was named Jchemo since it is orientated to chemometrics, but most of the provided methods are fully generic to other domains.
Functions such as transform, predict, coef and summary are available. Tuning the predictive models is facilitated by generic functions gridscore (validation dataset) and gridcv (cross-validation). Faster versions of these functions are also available for models based on latent variables (LVs) (gridscorelv and gridcvlv) and ridge regularization (gridscorelb and gridcvlb).
Most of the functions of the package have a help page (providing an example), e.g.:
?plskern
Examples demonstrating Jchemo.jl are available in project JchemoDemo, used for training only. The datasets of the examples are stored in package JchemoData.jl.
Some of the functions of the package (in particular those using kNN selections) use multi-threading to speed the computations. Taking advantage of this requires to specify a relevant number of threads (e.g. from the 'Settings' menu of the VsCode Julia extension and the file 'settings.json').
Jchemo.jl uses Makie.jl for plotting. To install and load one of the Makie's backends (e.g. CairoMakie.jl) is required to display the plots.
Before to update the package, it is recommended to have a look on What changed to avoid problems due to eventual breaking changes.
Dependent packages
Installation
In order to install Jchemo, run in the Pkg REPL:
pkg> add Jchemo
or for a specific version:
pkg> add Jchemo@0.1.18
or for the current developing version (not 100% stable):
pkg> add https://github.com/mlesnoff/Jchemo.jl.git
Benchmark - Computation time for a PLS with 1e6 observations
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 16 × Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 8 on 16 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS = 8
using Jchemo
## PLS2 with 1e6 observations
## (NB.: multi-threading is not used in plskern)
n = 10^6 # nb. observations (samples)
p = 500 # nb. X-variables (features)
q = 10 # nb. Y-variables to predict
X = rand(n, p)
Y = rand(n, q)
nlv = 25 # nb. PLS latent variables
@time plskern(X, Y; nlv = nlv) ;
8.100469 seconds (299 allocations: 4.130 GiB, 6.58% gc time)
@time plskern!(X, Y; nlv = nlv) ;
7.232234 seconds (6.47 k allocations: 338.617 MiB, 7.39% gc time, 0.13% compilation time)
Examples of syntax for predictive models
Fitting a model
using Jchemo
n = 150 ; p = 200 ; q = 2 ; m = 50
Xtrain = rand(n, p) ; Ytrain = rand(n, q)
Xtest = rand(m, p) ; Ytest = rand(m, q)
## Model fitting
nlv = 5
fm = plskern(Xtrain, Ytrain; nlv = nlv) ;
pnames(fm) # print the names of objects contained in 'fm'
## Some summary
summary(fm, Xtrain)
## Computation of the PLS scores (LVs) for Xtest
Jchemo.transform(fm, Xtest)
Jchemo.transform(fm, Xtest; nlv = 1)
## PLS b-coefficients
Jchemo.coef(fm)
Jchemo.coef(fm; nlv = 2)
## Predictions and performance of the fitted model
res = Jchemo.predict(fm, Xtest)
res.pred
rmsep(res.pred, Ytest)
mse(res.pred, Ytest)
Jchemo.predict(fm, Xtest).pred
Jchemo.predict(fm, Xtest; nlv = 0:3).pred
Tuning a model by grid-search
-
With gridscore
using Jchemo, StatsBase, CairoMakie
ntrain = 150 ; p = 200
ntest = 80
Xtrain = rand(ntrain, p) ; ytrain = rand(ntrain)
Xtest = rand(ntest, p) ; ytest = rand(ntest)
## Train is splitted to Cal+Val to tune the model,
## and the generalization error is estimated on Test.
nval = 50
s = sample(1:ntrain, nval; replace = false)
Xcal = rmrow(Xtrain, s)
ycal = rmrow(ytrain, s)
Xval = Xtrain[s, :]
yval = ytrain[s]
## Computation of the performance over the grid
## (the model is fitted on Cal, and the performance is
## computed on Val)
nlv = 0:10
res = gridscorelv(
Xcal, ycal, Xval, yval;
score = rmsep, fun = plskern, nlv = nlv)
## Plot the results
plotgrid(res.nlv, res.y1,
xlabel = "Nb. LVs", ylabel = "RMSEP").f
## Predictions and performance of the best model
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
fm = plskern(Xtrain, ytrain; nlv = res.nlv[u]) ;
res = Jchemo.predict(fm, Xtest)
rmsep(res.pred, ytest)
## *Note*: For PLSR models, using gridscorelv is much faster
## than using the generic function gridscore.
## In the same manner, for ridge regression models,
## gridscorelb is much faster than gridscore.
## Syntax for the generic gridscore
pars = mpar(nlv = nlv)
res = gridscore(
Xcal, ycal, Xval, yval;
score = rmsep, fun = plskern, pars = pars)
-
With gridcv
using Jchemo, StatsBase, CairoMakie
ntrain = 150 ; p = 200
ntest = 80
Xtrain = rand(ntrain, p) ; ytrain = rand(ntrain)
Xtest = rand(ntest, p) ; ytest = rand(ntest)
## Train is used to tune the model,
## and the generalization error is estimated on Test.
## Build the cross-validation (CV) segments
## Replicated K-Fold CV
K = 5 # Nb. folds
rep = 10 # Nb. replications (rep = 1 ==> no replication)
segm = segmkf(ntrain, K; rep = rep)
## Or replicated test-set CV
m = 30 # Size of the test-set
rep = 10 # Nb. replications (rep = 1 ==> no replication)
segm = segmts(ntrain, m; rep = rep)
## Computation of the performances over the grid
nlv = 0:10
rescv = gridcvlv(
Xtrain, ytrain; segm = segm,
score = rmsep, fun = plskern, nlv = nlv) ;
pnames(rescv)
res = rescv.res
## Plot the results
plotgrid(res.nlv, res.y1,
xlabel = "Nb. LVs", ylabel = "RMSEP").f
## Predictions and performance of the best model
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
fm = plskern(Xtrain, ytrain; nlv = res.nlv[u]) ;
res = Jchemo.predict(fm, Xtest)
rmsep(res.pred, ytest)
## *Note*: For PLSR models, using gridcvlv is much faster
## than using the generic function gridcv.
## In the same manner, for ridge regression models,
## gridcvlb is much faster than gridcv.
## Using the generic function gridcv:
pars = mpar(nlv = nlv)
rescv = gridcv(
Xtrain, ytrain; segm = segm,
score = rmsep, fun = plskern, pars = pars) ;
pnames(rescv)
res = rescv.res
Author
Matthieu Lesnoff
-
Cirad, UMR Selmet, Montpellier, France
-
ChemHouse, Montpellier
How to cite
Lesnoff, M. 2021. Jchemo: a Julia package for dimension reduction, regression and discrimination for chemometrics. https://github.com/mlesnoff/Jchemo. CIRAD, UMR SELMET, Montpellier, France
Acknowledgments
- G. Cornu (Cirad) https://ur-forets-societes.cirad.fr/en/l-unite/l-equipe
- L. Plagne, F. Févotte (Triscale.innov) https://www.triscale-innov.com
- R. Vezy (Cirad) https://www.youtube.com/channel/UCxArXLI-gxlTmWGGgec5D7w