NMFk: Nonnegative Matrix Factorization + k-means clustering and physics constraints
NMFk is one of the tools in the SmartTensors ML framework (smarttensors.com).
NMFk is a novel unsupervised machine learning methodology that allows for automatic identification of the optimal number of features (signals/signatures) present in the data.
Classical NMF approaches do not allow for automatic estimation of the number of features.
NMFk estimates the number of features
k through k-means clustering coupled with regularization constraints (sparsity, physical, mathematical, etc.).
NMFk can be applied to perform:
- Feature extraction (FE)
- Blind source separation (BSS)
- Detection of disruptions / anomalies
- Image recognition
- Text mining
- Data classification
- Separation (deconstruction) of co-occurring (physics) processes
- Discovery of unknown dependencies and phenomena
- Development of reduced-order/surrogate models
- Identification of dependencies between model inputs and outputs
- Guiding the development of physics models representing the ML analyzed data
- Blind predictions
- Optimization of data acquisition (optimal experimental design)
- Labeling of datasets for supervised ML analyses
NMFk provides high-performance computing capabilities to solve problems with Shared and Distributed Arrays in parallel. The parallelization allows for utilization of multi-core / multi-processor environments. GPU and TPU accelerations are available through existing Julia packages.
NMFk provides advanced tools for data visualization, pre- and post-processing. These tools substantially facilitate utilization of the package in various real-world applications.
NMFk methodology and applications are discussed in the research papers and presentations listed below.
NMFk is demonstrated with a series of examples and test problems provided here.
SmartTensors and NMFk were recently awarded:
- 2021 R&D100 Award: Information Technologies (IT)
- 2021 R&D100 Bronze Medal: Market Disruptor in Services
After starting Julia, execute:
import Pkg Pkg.add("NMFk")
to access the latest released version.
To utilize the latest code updates (commits), use:
import Pkg Pkg.add(Pkg.PackageSpec(name="NMFk", rev="master"))
docker run --interactive --tty montyvesselinov/tensors
The docker image provides access to all SmartTensors packages (smarttensors.github.io).
import Pkg Pkg.test("NMFk")
A simple problem demonstrating NMFk can be executed as follows.
First, generate 3 random signals in a matrix
a = rand(15) b = rand(15) c = rand(15) W = [a b c]
Then, mix the signals to produce a data matrix
X of 5 sensors observing the mixed signals as follows:
X = [a+c*3 a*10+b b b*5+c a+b*2+c*5]
This is equivalent to generating a mixing matrix
H and obtain
X by multiplying
H = [1 10 0 0 1; 0 1 1 5 2; 3 0 0 1 5] X = W * H
After that, execute NMFk to estimate the number of unknown mixed signals based only on the information in
import NMFk We, He, fitquality, robustness, aic, kopt = NMFk.execute(X, 2:5; save=false, method=:simple);
The execution will produce output like this:
[ Info: Results Signals: 2 Fit: 15.489 Silhouette: 0.9980145 AIC: -38.30184 Signals: 3 Fit: 3.452203e-07 Silhouette: 0.8540085 AIC: -1319.743 Signals: 4 Fit: 8.503988e-07 Silhouette: -0.5775127 AIC: -1212.129 Signals: 5 Fit: 2.598571e-05 Silhouette: -0.6757581 AIC: -915.6589 [ Info: Optimal solution: 3 signals
The code returns the estimated optimal number of signals
kopt, which in this case as expected is equal to 3.
The code returns the
robustness; they can applied to represent how the solutions change with the increase of
NMFk.plot_signal_selecton(2:5, fitquality, robustness)
The code also returns estimates of matrices
It can be easily verified that estimated
He[kopt] are scaled versions of the original
Note that the order of columns ('signals') in
We[kopt] are not expected to match.
The order of rows ('sensors') in
He[kopt] are also not expected to match.
The estimated orders will be different every time the code is executed.
The matrices can be visualized using:
import Pkg; Pkg.add("Mads") import Mads Mads.plotseries([a b c]) Mads.plotseries(We[kopt] ./ maximum(We[kopt]))
NMFk.plotmatrix(H) NMFk.plotmatrix(He[kopt] ./ maximum(He[kopt]))
More examples can be found in the
notebooks directories of the NMFk repository.
NMFk has been applied in a wide range of real-world applications. The analyzed datasets include model outputs, experimental laboratory data, and field tests:
- Climate data and simulations
- Watershed data and simulations
- Aquifer simulations
- Surface-water and Groundwater analyses
- Material characterization
- Reactive mixing
- Molecular dynamics
- Contaminant transport
- Induced seismicity
- Phase separation of co-polymers
- Oil / Gas extraction from unconventional reservoirs
- Geothermal exploration and produciton
- Geologic carbon storages
- Progress of nonnegative matrix factorization process:
More videos are available at YouTube
A series of Jupyter notebooks demonstrating NMFk have been developed:
The notebooks can also be accessed using:
Alexandrov, B.S., Vesselinov, V.V., Alexandrov, L.B., Stanev, V., Iliev, F.L., Source identification by non-negative matrix factorization combined with semi-supervised clustering, US20180060758A1
- Vesselinov, V.V., Mudunuru, M., Karra, S., O'Malley, D., Alexandrov, B.S., Unsupervised Machine Learning Based on Non-Negative Tensor Factorization for Analyzing Reactive-Mixing, 10.1016/j.jcp.2019.05.039, Journal of Computational Physics, 2019. PDF
- Vesselinov, V.V., Alexandrov, B.S., O'Malley, D., Nonnegative Tensor Factorization for Contaminant Source Identification, Journal of Contaminant Hydrology, 10.1016/j.jconhyd.2018.11.010, 2018. PDF
- O'Malley, D., Vesselinov, V.V., Alexandrov, B.S., Alexandrov, L.B., Nonnegative/binary matrix factorization with a D-Wave quantum annealer, PlosOne, 10.1371/journal.pone.0206653, 2018. PDF
- Stanev, V., Vesselinov, V.V., Kusne, A.G., Antoszewski, G., Takeuchi,I., Alexandrov, B.A., Unsupervised Phase Mapping of X-ray Diffraction Data by Nonnegative Matrix Factorization Integrated with Custom Clustering, Nature Computational Materials, 10.1038/s41524-018-0099-2, 2018. PDF
- Iliev, F.L., Stanev, V.G., Vesselinov, V.V., Alexandrov, B.S., Nonnegative Matrix Factorization for identification of unknown number of sources emitting delayed signals PLoS ONE, 10.1371/journal.pone.0193974. 2018. PDF
- Stanev, V.G., Iliev, F.L., Hansen, S.K., Vesselinov, V.V., Alexandrov, B.S., Identification of the release sources in advection-diffusion system by machine learning combined with Green function inverse method, Applied Mathematical Modelling, 10.1016/j.apm.2018.03.006, 2018. PDF
- Vesselinov, V.V., O'Malley, D., Alexandrov, B.S., Contaminant source identification using semi-supervised machine learning, Journal of Contaminant Hydrology, 10.1016/j.jconhyd.2017.11.002, 2017. PDF
- Alexandrov, B., Vesselinov, V.V., Blind source separation for groundwater level analysis based on non-negative matrix factorization, Water Resources Research, 10.1002/2013WR015037, 2014. PDF
- Vesselinov, V.V., Physics-Informed Machine Learning Methods for Data Analytics and Model Diagnostics, M3 NASA DRIVE Workshop, Los Alamos, 2019. PDF
- Vesselinov, V.V., Unsupervised Machine Learning Methods for Feature Extraction, New Mexico Big Data & Analytics Summit, Albuquerque, 2019. PDF
- Vesselinov, V.V., Novel Unsupervised Machine Learning Methods for Data Analytics and Model Diagnostics, Machine Learning in Solid Earth Geoscience, Santa Fe, 2019. PDF
- Vesselinov, V.V., Novel Machine Learning Methods for Extraction of Features Characterizing Datasets and Models, AGU Fall meeting, Washington D.C., 2018. PDF
- Vesselinov, V.V., Novel Machine Learning Methods for Extraction of Features Characterizing Complex Datasets and Models, Recent Advances in Machine Learning and Computational Methods for Geoscience, Institute for Mathematics and its Applications, University of Minnesota, 10.13140/RG.2.2.16024.03848, 2018. PDF
- Vesselinov, V.V., Mudunuru. M., Karra, S., O'Malley, D., Alexandrov, Unsupervised Machine Learning Based on Non-negative Tensor Factorization for Analysis of Filed Data and Simulation Outputs, Computational Methods in Water Resources (CMWR), Saint-Malo, France, 10.13140/RG.2.2.27777.92005, 2018. PDF
- O'Malley, D., Vesselinov, V.V., Alexandrov, B.S., Alexandrov, L.B., Nonnegative/binary matrix factorization with a D-Wave quantum annealer PDF
- Vesselinov, V.V., Alexandrov, B.A, Model-free Source Identification, AGU Fall Meeting, San Francisco, CA, 2014. PDF
Installation behind a firewall
Julia uses git for package management.
Julia uses git and curl to install the necessary packages.
It is important to set proxies if needed:
export ftp_proxy=http://proxyout.<your_site>:8080 export rsync_proxy=http://proxyout.<your_site>:8080 export http_proxy=http://proxyout.<your_site>:8080 export https_proxy=http://proxyout.<your_site>:8080 export no_proxy=.<your_site>
For example, if you are doing this at LANL, you will need to execute the following lines in your bash command-line environment:
export ftp_proxy=http://proxyout.lanl.gov:8080 export rsync_proxy=http://proxyout.lanl.gov:8080 export http_proxy=http://proxyout.lanl.gov:8080 export https_proxy=http://proxyout.lanl.gov:8080 export no_proxy=.lanl.gov
Proxies can be also set up directly in the Julia REPL as well:
ENV["ftp_proxy"] = "http://proxyout.lanl.gov:8080" ENV["rsync_proxy"] = "http://proxyout.lanl.gov:8080" ENV["http_proxy"] = "http://proxyout.lanl.gov:8080" ENV["https_proxy"] = "http://proxyout.lanl.gov:8080" ENV["no_proxy"] = ".lanl.gov"
To disable proxies, type these commands in the Julia REPL:
ENV["ftp_proxy"] = "" ENV["rsync_proxy"] = "" ENV["http_proxy"] = "" ENV["https_proxy"] = "" ENV["no_proxy"] = ""
In some situations, you may need to add in the
.gitconfig file in your home directory:
[url "email@example.com:"] insteadOf = https://github.com/ [url "firstname.lastname@example.org:"] insteadOf = https://gitlab.com/ [url "https://"] insteadOf = git:// [url "http://"] insteadOf = git://
git config --global url."https://".insteadOf git:// git config --global url."http://".insteadOf git:// git config --global url."email@example.com:".insteadOf https://gitlab.com/ git config --global url."firstname.lastname@example.org:".insteadOf https://github.com/
To resolve a "Private key location for 'email@example.com'" error message, execute: