TELF.factorization package#
Subpackages#
- TELF.factorization.decompositions package
- Subpackages
- TELF.factorization.decompositions.utilities package
- Submodules
- TELF.factorization.decompositions.utilities.bool_clustering module
- TELF.factorization.decompositions.utilities.bool_noise module
- TELF.factorization.decompositions.utilities.clustering module
- TELF.factorization.decompositions.utilities.concensus_matrix module
- TELF.factorization.decompositions.utilities.data_reshaping module
- TELF.factorization.decompositions.utilities.generic_utils module
- TELF.factorization.decompositions.utilities.math_utils module
- TELF.factorization.decompositions.utilities.nnsvd module
- TELF.factorization.decompositions.utilities.resample module
- TELF.factorization.decompositions.utilities.silhouettes module
- Module contents
- TELF.factorization.decompositions.utilities package
- Submodules
- TELF.factorization.decompositions.nmf_fro_admm module
- TELF.factorization.decompositions.nmf_fro_mu module
- TELF.factorization.decompositions.nmf_kl_admm module
- TELF.factorization.decompositions.nmf_kl_mu module
- TELF.factorization.decompositions.nmf_mc_fro_mu module
- TELF.factorization.decompositions.rescal_fro_mu module
- TELF.factorization.decompositions.tri_nmf_fro_mu module
- Module contents
- Subpackages
- TELF.factorization.utilities package
- Submodules
- TELF.factorization.utilities.clustering module
- TELF.factorization.utilities.co_occurance_matrix module
- TELF.factorization.utilities.organize_n_jobs module
- TELF.factorization.utilities.plot_NMFk module
- TELF.factorization.utilities.pvalue_analysis module
- TELF.factorization.utilities.sppmi_matrix module
- TELF.factorization.utilities.take_note module
- TELF.factorization.utilities.vectorize module
- Module contents
Submodules#
TELF.factorization.NMFk module#
© 2022. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.
- class TELF.factorization.NMFk.NMFk(n_perturbs=20, n_iters=100, epsilon=0.015, perturb_type='uniform', n_jobs=1, n_nodes=1, init='nnsvd', use_gpu=True, save_path='./', save_output=True, collect_output=False, predict_k=False, predict_k_method='WH_sill', verbose=True, nmf_verbose=False, perturb_verbose=False, transpose=False, sill_thresh=0.8, nmf_func=None, nmf_method='nmf_fro_mu', nmf_obj_params={}, pruned=True, calculate_error=True, perturb_multiprocessing=False, consensus_mat=False, use_consensus_stopping=0, mask=None, calculate_pac=False, get_plot_data=False, simple_plot=True, k_search_method='linear', H_sill_thresh=None)[source]#
Bases:
object
NMFk is a Non-negative Matrix Factorization module with the capability to do automatic model determination.
- Parameters:
n_perturbs (int, optional) – Number of bootstrap operations, or random matrices generated around the original matrix. The default is 20.
n_iters (int, optional) – Number of NMF iterations. The default is 100.
epsilon (float, optional) –
Error amount for the random matrices generated around the original matrix. The default is 0.015.
epsilon
is used whenperturb_type='uniform'
.perturb_type (str, optional) –
Type of error sampling to perform for the bootstrap operation. The default is “uniform”.
perturb_type='uniform'
will use uniform distribution for sampling.perturb_type='poisson'
will use Poission distribution for sampling.
n_jobs (int, optional) – Number of parallel jobs. Use -1 to use all available resources. The default is 1.
n_nodes (int, optional) – Number of HPC nodes. The default is 1.
init (str, optional) –
Initilization of matrices for NMF procedure. The default is “nnsvd”.
init='nnsvd'
will use NNSVD for initilization.init='random'
will use random sampling for initilization.
use_gpu (bool, optional) – If True, uses GPU for operations. The default is True.
save_path (str, optional) – Location to save output. The default is “./”.
save_output (bool, optional) – If True, saves the resulting latent factors and plots. The default is True.
collect_output (bool, optional) – If True, collectes the resulting latent factors to be returned from
fit()
operation. The default is False.predict_k (bool, optional) –
If True, performs automatic prediction of the number of latent factors. The default is False.
Note
Even when
predict_k=False
, number of latent factors can be estimated using the figures saved insave_path
.predict_k_method (str, optional) –
Method to use when performing automatic k prediction. Default is “WH_sill”.
predict_k_method='pvalue'
will use L-Statistics with column-wise error for automatically estimating the number of latent factors.predict_k_method='WH_sill'
will use Silhouette scores from minimum of W and H latent factors for estimating the number of latent factors.predict_k_method='W_sill'
will use Silhouette scores from W latent factor for estimating the number of latent factors.predict_k_method='H_sill'
will use Silhouette scores from H latent factor for estimating the number of latent factors.predict_k_method='sill'
will default topredict_k_method='WH_sill'
.
Warning
predict_k_method='pvalue'
prediction will result in significantly longer processing time, altough it is more accurate!predict_k_method='WH_sill'
, on the other hand, will be much faster.verbose (bool, optional) – If True, shows progress in each k. The default is True.
nmf_verbose (bool, optional) – If True, shows progress in each NMF operation. The default is False.
perturb_verbose (bool, optional) – If True, it shows progress in each perturbation. The default is False.
transpose (bool, optional) – If True, transposes the input matrix before factorization. The default is False.
sill_thresh (float, optional) – Threshold for the Silhouette score when performing automatic prediction of the number of latent factors. The default is 0.8.
nmf_func (object, optional) – If not None, and if
nmf_method=func
, used for passing NMF function. The default is None.nmf_method –
What NMF to use. The default is “nmf_fro_mu”.
nmf_method='nmf_fro_mu'
will use NMF with Frobenious Norm.nmf_method='nmf_kl_mu'
will use NMF with Multiplicative Update rules with KL-Divergence.nmf_method='func'
will use the custom NMF function passed using thenmf_func
parameter.nmf_method='nmf_recommender'
will use the Recommender NMF method for collaborative filtering.nmf_method='wnmf'
will use the Weighted NMF for missing value completion.
- nmf_obj_paramsdict, optional
Parameters used by NMF function. The default is {}.
- prunedbool, optional
When True, removes columns and rows from the input matrix that has only 0 values. The default is True.
Warning
Pruning should not be used with
nmf_method='nmf_recommender'
.If after pruning decomposition is not possible (for example if the number of samples left is 1, or K range is empty based on the rule
k < min(X.shape)
,fit()
will returnNone
.
- calculate_errorbool, optional
When True, calculates the relative reconstruction error. The default is True.
Warning
If
calculate_error=True
, it will result in longer processing time.- perturb_multiprocessingbool, optional
If
perturb_multiprocessing=True
, it will make parallel computation over each perturbation. Default isperturb_multiprocessing=False
.When
perturb_multiprocessing=False
, which is default, parallelization is done over each K (rank).- consensus_matbool, optional
When True, computes the Consensus Matrices for each k. The default is False.
- use_consensus_stoppingstr, optional
When not 0, uses Consensus matrices criteria for early stopping of NMF factorization. The default is 0.
- mask
np.ndarray
, optional Numpy array that points out the locations in input matrix that should be masked during factorization. The default is None.
- calculate_pacbool, optional
When True, calculates the PAC score for H matrix stability. The default is False.
- get_plot_databool, optional
When True, collectes the data used in plotting each intermidiate k factorization. The default is False.
- simple_plotbool, optional
When True, creates a simple plot for each intermidiate k factorization which hides the statistics such as average and maximum Silhouette scores. The default is True.
- k_search_methodstr, optional
Which approach to use when searching for the rank or k. The default is “linear”.
k_search_method='linear'
will linearly visit each K given inKs
hyper-parameter of thefit()
function.k_search_method='bst_post'
will perform post-order binary search. When an ideal rank is found, determined by the selectedpredict_k_method
, all lower ranks are pruned from the search space.k_search_method='bst_pre'
will perform pre-order binary search. When an ideal rank is found, determined by the selectedpredict_k_method
, all lower ranks are pruned from the search space.k_search_method='bst_in'
will perform in-order binary search. When an ideal rank is found, determined by the selectedpredict_k_method
, all lower ranks are pruned from the search space.
- H_sill_threshfloat, optional
Setting for removing higher ranks from the search space.
When searching for the optimal rank with binary search using
k_search='bst_post'
ork_search='bst_pre'
, this hyper-parameter can be used to cut off higher ranks from search space.The cut-off of higher ranks from the search space is based on threshold for H silhouette. When a H silhouette below
H_sill_thresh
is found for a given rank or K, all higher ranks are removed from the search space.If
H_sill_thresh=None
, it is not used. The default is None.
- Return type:
None.
- fit(X, Ks, name='NMFk', note='')[source]#
Factorize the input matrix
X
for the each given K value inKs
.- Parameters:
X (
np.ndarray
orscipy.sparse._csr.csr_matrix
matrix) – Input matrix to be factorized.Ks (list) –
List of K values to factorize the input matrix.
Example:
Ks=range(1, 10, 1)
.name (str, optional) – Name of the experiment. Default is “NMFk”.
note (str, optional) – Note for the experiment used in logs. Default is “”.
- Returns:
results – Resulting dict can include all the latent factors, plotting data, predicted latent factors, time took for factorization, and predicted k value depending on the settings specified.
If
get_plot_data=True
, results will include field forplot_data
.If
predict_k=True
, results will include field fork_predict
. This is an intiger for the automatically estimated number of latent factors.If
predict_k=True
andcollect_output=True
, results will include fields forW
andH
which are the latent factors in type ofnp.ndarray
.results will always include a field for
time
, that gives the total compute time.
- Return type:
dict
TELF.factorization.RESCALk module#
© 2022. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.
- class TELF.factorization.RESCALk.RESCALk(n_perturbs=20, n_iters=100, epsilon=0.015, perturb_type='uniform', n_jobs=1, n_nodes=1, init='nnsvd', use_gpu=True, save_path='./', save_output=True, verbose=True, rescal_verbose=False, perturb_verbose=False, rescal_func=None, rescal_method='rescal_fro_mu', rescal_obj_params={}, pruned=False, calculate_error=False, perturb_multiprocessing=False, get_plot_data=False, simple_plot=True)[source]#
Bases:
object
RESCALk is a RESCAL module with the capability to do automatic model determination.
- Parameters:
n_perturbs (int, optional) – Number of bootstrap operations, or random matrices generated around the original matrix. The default is 20.
n_iters (int, optional) – Number of RESCAL iterations. The default is 100.
epsilon (float, optional) –
Error amount for the random matrices generated around the original matrix. The default is 0.015.
epsilon
is used whenperturb_type='uniform'
.perturb_type (str, optional) –
Type of error sampling to perform for the bootstrap operation. The default is “uniform”.
perturb_type='uniform'
will use uniform distribution for sampling.perturb_type='poisson'
will use Poission distribution for sampling.
n_jobs (int, optional) – Number of parallel jobs. Use -1 to use all available resources. The default is 1.
n_nodes (int, optional) – Number of HPC nodes. The default is 1.
init (str, optional) –
Initilization of matrices for RESCAL procedure. The default is “nnsvd”.
init='nnsvd'
will use NNSVD for initilization.init='random'
will use random sampling for initilization.
use_gpu (bool, optional) – If True, uses GPU for operations. The default is True.
save_path (str, optional) – Location to save output. The default is “./”.
save_output (bool, optional) – If True, saves the resulting latent factors and plots. The default is True.
verbose (bool, optional) – If True, shows progress in each k. The default is True.
rescal_verbose (bool, optional) – If True, shows progress in each RESCAL operation. The default is False.
perturb_verbose (bool, optional) – If True, it shows progress in each perturbation. The default is False.
rescal_func (object, optional) – If not None, and if
rescal_method=func
, used for passing RESCAL function. The default is None.rescal_method (str, optional) –
What RESCAL to use. The default is “rescal_fro_mu”.
rescal_method='rescal_fro_mu'
will use RESCAL with Frobenious Norm.
- rescal_obj_paramsdict, optional
Parameters used by RESCAL function. The default is {}.
- prunedbool, optional
When True, removes columns and rows from the input matrix that has only 0 values. The default is False.
Warning
Pruning is not implemented for RESCALk yet.
- calculate_errorbool, optional
When True, calculates the relative reconstruction error. The default is False.
Warning
If
calculate_error=True
, it will result in longer processing time.- perturb_multiprocessingbool, optional
If
perturb_multiprocessing=True
, it will make parallel computation over each perturbation. Default isperturb_multiprocessing=False
.When
perturb_multiprocessing=False
, which is default, parallelization is done over each K (rank).- get_plot_databool, optional
When True, collectes the data used in plotting each intermidiate k factorization. The default is False.
- simple_plotbool, optional
When True, creates a simple plot for each intermidiate k factorization which hides the statistics such as average and maximum Silhouette scores. The default is True.
- Return type:
None.
- fit(X, Ks, name='RESCALk', note='')[source]#
Factorize the input matrix
X
for the each given K value inKs
.- Parameters:
X (list of symmetric
np.ndarray
or list of symmetricscipy.sparse._csr.csr_matrix
matrix) – Input matrix to be factorized.Ks (list) –
List of K values to factorize the input matrix.
Example:
Ks=range(1, 10, 1)
.name (str, optional) – Name of the experiment. Default is “RESCALk”.
note (str, optional) – Note for the experiment used in logs. Default is “”.
- Returns:
results – Resulting dict can include all the latent factors, plotting data, predicted latent factors, time took for factorization, and predicted k value depending on the settings specified.
If
get_plot_data=True
, results will include field forplot_data
.results will always include a field for
time
, that gives the total compute time.
- Return type:
dict
TELF.factorization.TriNMFk module#
© 2022. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.
- class TELF.factorization.TriNMFk.TriNMFk(experiment_name='TriNMFk', nmfk_params={}, save_path='TriNMFk', nmf_verbose=False, use_gpu=False, n_jobs=-1, mask=None, use_consensus_stopping=0, alpha=(0, 0), n_iters=100, n_inits=10, pruned=True, transpose=False, verbose=True)[source]#
Bases:
object
TriNMFk is a Non-negative Matrix Factorization module with the capability to do automatic model determination for both estimating the number of latent patterns (
Wk
) and clusters (Hk
).- Parameters:
experiment_name (str, optional) – Name used for the experiment. Default is “TriNMFk”.
nmfk_params (str, optional) – Parameters for NMFk. See documentation for NMFk for the options.
save_path (str, optional) – Used for save location when NMFk fit is not performed first, and TriNMFk fit is done.
nmf_verbose (bool, optional) – If True, shows progress in each NMF operation. The default is False.
use_gpu (bool, optional) – If True, uses GPU for operations. The default is True.
n_jobs (int, optional) – Number of parallel jobs. Use -1 to use all available resources. The default is 1.
mask (
np.ndarray
, optional) – Numpy array that points out the locations in input matrix that should be masked during factorization. The default is None.use_consensus_stopping (str, optional) – When not 0, uses Consensus matrices criteria for early stopping of NMF factorization. The default is 0.
alpha (tupl, optional) – Error rate used in bootstrap operation. Default is (0, 0).
n_iters (int, optional) – Number of NMF iterations. The default is 100.
n_inits (int, optional) – Number of matrix initilization for the bootstrap operation. The default is 10.
pruned (bool, optional) – When True, removes columns and rows from the input matrix that has only 0 values. The default is True.
transpose (bool, optional) – If True, transposes the input matrix before factorization. The default is False.
verbose (bool, optional) – If True, shows progress in each k. The default is False.
- Return type:
None.
- fit_nmfk(X, Ks, note='')[source]#
Factorize the input matrix
X
for the each given K value inKs
.- Parameters:
X (
np.ndarray
orscipy.sparse._csr.csr_matrix
matrix) – Input matrix to be factorized.Ks (list) –
List of K values to factorize the input matrix.
Example:
Ks=range(1, 10, 1)
.name (str, optional) – Name of the experiment. Default is “NMFk”.
note (str, optional) – Note for the experiment used in logs. Default is “”.
- Returns:
results – Resulting dict can include all the latent factors, plotting data, predicted latent factors, time took for factorization, and predicted k value depending on the settings specified in
nmfk_params
.If
get_plot_data=True
, results will include field forplot_data
.If
predict_k=True
, results will include field fork_predict
. This is an intiger for the automatically estimated number of latent factors.If
predict_k=True
andcollect_output=True
, results will include fields forW
andH
which are the latent factors in type ofnp.ndarray
.results will always include a field for
time
, that gives the total compute time.
- Return type:
dict
- fit_tri_nmfk(X, k1k2: tuple)[source]#
Factorize the input matrix
X
.after applying
fit_nmfk()
to select theWk
andHk
, to factorize the given matrix withk1k2=(Wk, Hk)
.- Parameters:
X (
np.ndarray
orscipy.sparse._csr.csr_matrix
matrix) – Input matrix to be factorized.k1k2 (tuple) – Tuple of
Wk
(number of latent patterns) andHk
(number of latent clusters), to factorize the matrixX
to. Example:Ks=range(4,3)
.
- Returns:
results – Resulting dict will include latent patterns
W
,H
, and mixing matrixS
along with the error from eachn_inits
.- Return type:
dict