TELF.factorization.NMFk: Non-negative Matrix Factorization with Automatic Model Determination#
NMFk is a Non-negative Matrix Factorization module with the capability to do automatic model determination.
Example#
First generate synthetic data with pre-determined k. It can be either dense (np.ndarray
) or sparse matrix (scipy.sparse._csr.csr_matrix
). Here, we are using the provided scripts for matrix generation (located here):
import sys; sys.path.append("../../scripts/")
from generate_X import gen_data,gen_data_sparse
Xsp = gen_data_sparse(shape=[100, 200], density=0.01)["X"]
X = gen_data(R=4, shape=[100, 200])["X"]
Now we can factorize the given matrix:
from TELF.factorization import NMFk
params = {
"n_perturbs":36,
"n_iters":100,
"epsilon":0.015,
"n_jobs":-1,
"init":"nnsvd",
"use_gpu":False,
"save_path":"../../results/",
"save_output":True,
"collect_output":True,
"predict_k":True,
"predict_k_method":"sill",
"verbose":True,
"nmf_verbose":False,
"transpose":False,
"sill_thresh":0.9,
"pruned":True,
'nmf_method':'nmf_kl_mu',
"calculate_error":True,
"predict_k":True,
"use_consensus_stopping":0,
"calculate_pac":False,
"perturb_type":"uniform"
}
Ks = range(1,11,1)
name = "Example_NMFk"
note = "This is an example run of NMFk"
model = NMFk(**params)
results = model.fit(X, range(1,10,1), name, note)
Resulting plots showing the estimation of the matrix rank, or the number of latent factors can be found at ../../results/
.
Available Functions#
|
NMFk is a Non-negative Matrix Factorization module with the capability to do automatic model determination. |
|
Factorize the input matrix |
Module Contents#
© 2022. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.
- class TELF.factorization.NMFk.NMFk(n_perturbs=20, n_iters=100, epsilon=0.015, perturb_type='uniform', n_jobs=1, n_nodes=1, init='nnsvd', use_gpu=True, save_path='', save_output=True, collect_output=False, predict_k=False, predict_k_method='WH_sill', verbose=True, nmf_verbose=False, perturb_verbose=False, transpose=False, sill_thresh=0.8, nmf_func=None, nmf_method='nmf_fro_mu', clustering_method='kmeans', nmf_obj_params={}, clustering_obj_params={}, pruned=True, calculate_error=True, perturb_multiprocessing=False, consensus_mat=False, use_consensus_stopping=0, mask=None, calculate_pac=False, get_plot_data=False, simple_plot=True, k_search_method='linear', H_sill_thresh=None, factor_thresholding=None, factor_thresholding_H_regression=None, factor_thresholding_obj_params={}, factor_thresholding_H_regression_obj_params={}, device=-1)[source]#
Bases:
object
NMFk is a Non-negative Matrix Factorization module with the capability to do automatic model determination.
- Parameters:
n_perturbs (int, optional) – Number of bootstrap operations, or random matrices generated around the original matrix. The default is 20.
n_iters (int, optional) – Number of NMF iterations. The default is 100.
epsilon (float or tuple of two elements, optional) –
Error amount for the random matrices generated around the original matrix. The default is 0.015.
The default when
perturb_type='bool'
orperturb_type='boolean'
is (epsilon, epsilon).epsilon
is used whenperturb_type='uniform'
orperturb_type='bool'
orperturb_type='boolean'
.Note
If
perturb_type='bool'
orperturb_type='boolean'
, use`epsilon=tuple()`
wherepositive noise: flip 0s to 1s (additive noise), negative noise: flip 1s to 0s (subtractive noise).
perturb_type –
Type of error sampling to perform for the bootstrap operation. The default is “uniform”.
perturb_type='uniform'
will use uniform distribution for sampling.perturb_type='poisson'
will use Poission distribution for sampling.perturb_type='bool'
orperturb_type='boolean'
will use Boolean perturbations.
- n_jobsint, optional
Number of parallel jobs. Use -1 to use all available resources. The default is 1.
- n_nodesint, optional
Number of HPC nodes. The default is 1.
- initstr, optional
Initilization of matrices for NMF procedure. The default is “nnsvd”.
init='nnsvd'
will use NNSVD for initilization.init='random'
will use random sampling for initilization.
- use_gpubool, optional
If True, uses GPU for operations. The default is True.
- save_pathstr, optional
Location to save output. The default is “”.
- save_outputbool, optional
If True, saves the resulting latent factors and plots. The default is True.
- collect_outputbool, optional
If True, collectes the resulting latent factors to be returned from
fit()
operation. The default is False.- predict_kbool, optional
If True, performs automatic prediction of the number of latent factors. The default is False.
Note
Even when
predict_k=False
, number of latent factors can be estimated using the figures saved insave_path
.- predict_k_methodstr, optional
Method to use when performing automatic k prediction. Default is “WH_sill”.
predict_k_method='pvalue'
will use L-Statistics with column-wise error for automatically estimating the number of latent factors.predict_k_method='WH_sill'
will use Silhouette scores from minimum of W and H latent factors for estimating the number of latent factors.predict_k_method='W_sill'
will use Silhouette scores from W latent factor for estimating the number of latent factors.predict_k_method='H_sill'
will use Silhouette scores from H latent factor for estimating the number of latent factors.predict_k_method='sill'
will default topredict_k_method='WH_sill'
.
Warning
predict_k_method='pvalue'
prediction will result in significantly longer processing time, altough it is more accurate!predict_k_method='WH_sill'
, on the other hand, will be much faster.- verbosebool, optional
If True, shows progress in each k. The default is True.
- nmf_verbosebool, optional
If True, shows progress in each NMF operation. The default is False.
- perturb_verbosebool, optional
If True, it shows progress in each perturbation. The default is False.
- transposebool, optional
If True, transposes the input matrix before factorization. The default is False.
- sill_threshfloat, optional
Threshold for the Silhouette score when performing automatic prediction of the number of latent factors. The default is 0.8.
- nmf_funcobject, optional
If not None, and if
nmf_method=func
, used for passing NMF function. The default is None.- nmf_methodstr, optional
What NMF to use. The default is “nmf_fro_mu”.
nmf_method='nmf_fro_mu'
will use NMF with Frobenious Norm.nmf_method='nmf_kl_mu'
will use NMF with Multiplicative Update rules with KL-Divergence.nmf_method='func'
will use the custom NMF function passed using thenmf_func
parameter.nmf_method='nmf_recommender'
will use the Recommender NMF method for collaborative filtering.nmf_method='wnmf'
will use the Weighted NMF for missing value completion.nmf_method='bnmf'
will use the Boolean NMF for missing value completion on boolean matrix.
Note
When using
nmf_method='nmf_recommender'
, RNMFk prediction method can be done usingfrom TELF.factorization import RNMFk_predict
.Here
RNMFk_predict(W, H, global_mean, bu, bi, u, i)
,W
andH
are the latent factors,global_mean
,bu
, andbi
are the biases returned fromnmf_recommender
method.Finally,
u
andi
are the indices to perform prediction on.Note
When using
nmf_method='wnmf'
, passnmf_obj_params={"WEIGHTS":P}
whereP
is a matrix of sizeX
and carries the weights for each item inX
.For example, here
P
can be used as a mask where 1s inP
are the known entries, and 0s are the missing values inX
that we want to predict (i.e. a recommender system).Note that
nmf_method='wnmf'
does not support sparse matrices currently.Note
When using
nmf_method='bnmf'
, passnmf_obj_params={"MASK":P}
whereP
is a mask matrix of sizeX
where 0s and 1s inP
are the known and unknown locations inX
.0s in
P
are the places we would like to predict.Note that
nmf_method='bnmf'
does not support sparse matrices currently.When
nmf_method='bnmf'
,perturb_type='bool'
orperturb_type='boolean'
is recommended to use. It will not set it automatically but raise warning if not used.- clustering_methodstr, optional
Clustering used on the W patterns. Default is “kmeans”. Options are “kmeans” and “bool” or “boolean”
- nmf_obj_paramsdict, optional
Parameters used by NMF function. The default is {}.
- clustering_obj_params: dict, optinal
Parameters used by custom clustering functions. The default is {}.
When
nmf_method='bnmf'
,max_iters:int
anddistance:str
can be passed here.distance
can be'hamming'
,'FN'
, or'FP'
. Default ishamming
.For all nmf methods passed in
nmf_method
,max_iters
can also be passed.- prunedbool, optional
When True, removes columns and rows from the input matrix that has only 0 values. The default is True.
Warning
If after pruning decomposition is not possible (for example if the number of samples left is 1, or K range is empty based on the rule
k < min(X.shape)
,fit()
will returnNone
.
- calculate_errorbool, optional
When True, calculates the relative reconstruction error. The default is True.
Warning
If
calculate_error=True
, it will result in longer processing time.- perturb_multiprocessingbool, optional
If
perturb_multiprocessing=True
, it will make parallel computation over each perturbation. Default isperturb_multiprocessing=False
.When
perturb_multiprocessing=False
, which is default, parallelization is done over each K (rank).- consensus_matbool, optional
When True, computes the Consensus Matrices for each k. The default is False.
- use_consensus_stoppingstr, optional
When not 0, uses Consensus matrices criteria for early stopping of NMF factorization. The default is 0.
- mask
np.ndarray
, optional Numpy array that points out the locations in input matrix that should be masked during factorization. The default is None.
- calculate_pacbool, optional
When True, calculates the PAC score for H matrix stability. The default is False.
- get_plot_databool, optional
When True, collectes the data used in plotting each intermidiate k factorization. The default is False.
- simple_plotbool, optional
When True, creates a simple plot for each intermidiate k factorization which hides the statistics such as average and maximum Silhouette scores. The default is True.
- k_search_methodstr, optional
Which approach to use when searching for the rank or k. The default is “linear”.
k_search_method='linear'
will linearly visit each K given inKs
hyper-parameter of thefit()
function.k_search_method='bst_post'
will perform post-order binary search. When an ideal rank is found, determined by the selectedpredict_k_method
, all lower ranks are pruned from the search space.k_search_method='bst_pre'
will perform pre-order binary search. When an ideal rank is found, determined by the selectedpredict_k_method
, all lower ranks are pruned from the search space.k_search_method='bst_in'
will perform in-order binary search. When an ideal rank is found, determined by the selectedpredict_k_method
, all lower ranks are pruned from the search space.
- H_sill_threshfloat, optional
Setting for removing higher ranks from the search space.
When searching for the optimal rank with binary search using
k_search='bst_post'
ork_search='bst_pre'
, this hyper-parameter can be used to cut off higher ranks from search space.The cut-off of higher ranks from the search space is based on threshold for H silhouette. When a H silhouette below
H_sill_thresh
is found for a given rank or K, all higher ranks are removed from the search space.If
H_sill_thresh=None
, it is not used. The default is None.- factor_thresholdingstr, optional
If not None, W and H factors are thresholded using a thresholding method specified to be boolean. Default is None.
Options are
WH_thresh
andcoord_desc_thresh
andotsu_thresh
andkmeans_thresh
.If
nmf_method='bnmf'
,factor_thresholding='otsu_thresh'
is used by default.- factor_thresholding_H_regressionstr, optional
If not None, H factor is thresholded using a thresholding method specified to be boolean. Default is None.
Options are
coord_desc_thresh
,otsu_thresh
, andkmeans_thresh
.If
nmf_method='bnmf'
,factor_thresholding='kmeans_thresh'
is used by default.- factor_thresholding_obj_paramsdict, optional
Extra settings used for the thresholding used in
factor_thresholding
. Default is {}.For
factor_thresholding='coord_desc_thresh'
, options includemax_iter:int
,wt
, andht
.For
factor_thresholding='WH_thresh'
, options includenpoint
.- factor_thresholding_H_regression_obj_paramsdict, optional
Extra settings used for the thresholding used in
factor_thresholding_H_regression
. Default is {}.For
factor_thresholding='coord_desc_thresh'
, options includemax_iter:int
,wt
, andht
.For
factor_thresholding='WH_thresh'
, options includenpoint
.- deviceint or list, optional
CUDA device or list of CUDA devices (GPUs) to use. Default is -1.
When device is a positive integer such as
device=0
it will use the given GPU with the id.When device is -1, it will use all devices.
When device is a list of devices, it will use those devices.
If device is negative integer other than -1, it will use number if GPUs minues the device + 1.
- Return type:
None.
- fit(X, Ks, name='NMFk', note='')[source]#
Factorize the input matrix
X
for the each given K value inKs
.- Parameters:
X (
np.ndarray
orscipy.sparse._csr.csr_matrix
matrix) – Input matrix to be factorized.Ks (list) –
List of K values to factorize the input matrix.
Example:
Ks=range(1, 10, 1)
.name (str, optional) – Name of the experiment. Default is “NMFk”.
note (str, optional) – Note for the experiment used in logs. Default is “”.
- Returns:
results – Resulting dict can include all the latent factors, plotting data, predicted latent factors, time took for factorization, and predicted k value depending on the settings specified.
If
get_plot_data=True
, results will include field forplot_data
.If
predict_k=True
, results will include field fork_predict
. This is an intiger for the automatically estimated number of latent factors.If
predict_k=True
andcollect_output=True
, results will include fields forW
andH
which are the latent factors in type ofnp.ndarray
.results will always include a field for
time
, that gives the total compute time.
- Return type:
dict