TELF.factorization package#

Subpackages#

Submodules#

TELF.factorization.NMFk module#

© 2022. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.

class TELF.factorization.NMFk.NMFk(n_perturbs=20, n_iters=100, epsilon=0.015, perturb_type='uniform', n_jobs=1, n_nodes=1, init='nnsvd', use_gpu=True, save_path='./', save_output=True, collect_output=False, predict_k=False, predict_k_method='WH_sill', verbose=True, nmf_verbose=False, perturb_verbose=False, transpose=False, sill_thresh=0.8, nmf_func=None, nmf_method='nmf_fro_mu', nmf_obj_params={}, pruned=True, calculate_error=True, perturb_multiprocessing=False, consensus_mat=False, use_consensus_stopping=0, mask=None, calculate_pac=False, get_plot_data=False, simple_plot=True, k_search_method='linear', H_sill_thresh=None)[source]#

Bases: object

NMFk is a Non-negative Matrix Factorization module with the capability to do automatic model determination.

Parameters:
  • n_perturbs (int, optional) – Number of bootstrap operations, or random matrices generated around the original matrix. The default is 20.

  • n_iters (int, optional) – Number of NMF iterations. The default is 100.

  • epsilon (float, optional) –

    Error amount for the random matrices generated around the original matrix. The default is 0.015.

    epsilon is used when perturb_type='uniform'.

  • perturb_type (str, optional) –

    Type of error sampling to perform for the bootstrap operation. The default is “uniform”.

    • perturb_type='uniform' will use uniform distribution for sampling.

    • perturb_type='poisson' will use Poission distribution for sampling.

  • n_jobs (int, optional) – Number of parallel jobs. Use -1 to use all available resources. The default is 1.

  • n_nodes (int, optional) – Number of HPC nodes. The default is 1.

  • init (str, optional) –

    Initilization of matrices for NMF procedure. The default is “nnsvd”.

    • init='nnsvd' will use NNSVD for initilization.

    • init='random' will use random sampling for initilization.

  • use_gpu (bool, optional) – If True, uses GPU for operations. The default is True.

  • save_path (str, optional) – Location to save output. The default is “./”.

  • save_output (bool, optional) – If True, saves the resulting latent factors and plots. The default is True.

  • collect_output (bool, optional) – If True, collectes the resulting latent factors to be returned from fit() operation. The default is False.

  • predict_k (bool, optional) –

    If True, performs automatic prediction of the number of latent factors. The default is False.

    Note

    Even when predict_k=False, number of latent factors can be estimated using the figures saved in save_path.

  • predict_k_method (str, optional) –

    Method to use when performing automatic k prediction. Default is “WH_sill”.

    • predict_k_method='pvalue' will use L-Statistics with column-wise error for automatically estimating the number of latent factors.

    • predict_k_method='WH_sill' will use Silhouette scores from minimum of W and H latent factors for estimating the number of latent factors.

    • predict_k_method='W_sill' will use Silhouette scores from W latent factor for estimating the number of latent factors.

    • predict_k_method='H_sill' will use Silhouette scores from H latent factor for estimating the number of latent factors.

    • predict_k_method='sill' will default to predict_k_method='WH_sill'.

    Warning

    predict_k_method='pvalue' prediction will result in significantly longer processing time, altough it is more accurate! predict_k_method='WH_sill', on the other hand, will be much faster.

  • verbose (bool, optional) – If True, shows progress in each k. The default is True.

  • nmf_verbose (bool, optional) – If True, shows progress in each NMF operation. The default is False.

  • perturb_verbose (bool, optional) – If True, it shows progress in each perturbation. The default is False.

  • transpose (bool, optional) – If True, transposes the input matrix before factorization. The default is False.

  • sill_thresh (float, optional) – Threshold for the Silhouette score when performing automatic prediction of the number of latent factors. The default is 0.8.

  • nmf_func (object, optional) – If not None, and if nmf_method=func, used for passing NMF function. The default is None.

  • nmf_method

    What NMF to use. The default is “nmf_fro_mu”.

    • nmf_method='nmf_fro_mu' will use NMF with Frobenious Norm.

    • nmf_method='nmf_kl_mu' will use NMF with Multiplicative Update rules with KL-Divergence.

    • nmf_method='func' will use the custom NMF function passed using the nmf_func parameter.

    • nmf_method='nmf_recommender' will use the Recommender NMF method for collaborative filtering.

    • nmf_method='wnmf' will use the Weighted NMF for missing value completion.

nmf_obj_paramsdict, optional

Parameters used by NMF function. The default is {}.

prunedbool, optional

When True, removes columns and rows from the input matrix that has only 0 values. The default is True.

Warning

  • Pruning should not be used with nmf_method='nmf_recommender'.

  • If after pruning decomposition is not possible (for example if the number of samples left is 1, or K range is empty based on the rule k < min(X.shape), fit() will return None.

calculate_errorbool, optional

When True, calculates the relative reconstruction error. The default is True.

Warning

If calculate_error=True, it will result in longer processing time.

perturb_multiprocessingbool, optional

If perturb_multiprocessing=True, it will make parallel computation over each perturbation. Default is perturb_multiprocessing=False.

When perturb_multiprocessing=False, which is default, parallelization is done over each K (rank).

consensus_matbool, optional

When True, computes the Consensus Matrices for each k. The default is False.

use_consensus_stoppingstr, optional

When not 0, uses Consensus matrices criteria for early stopping of NMF factorization. The default is 0.

masknp.ndarray, optional

Numpy array that points out the locations in input matrix that should be masked during factorization. The default is None.

calculate_pacbool, optional

When True, calculates the PAC score for H matrix stability. The default is False.

get_plot_databool, optional

When True, collectes the data used in plotting each intermidiate k factorization. The default is False.

simple_plotbool, optional

When True, creates a simple plot for each intermidiate k factorization which hides the statistics such as average and maximum Silhouette scores. The default is True.

k_search_methodstr, optional

Which approach to use when searching for the rank or k. The default is “linear”.

  • k_search_method='linear' will linearly visit each K given in Ks hyper-parameter of the fit() function.

  • k_search_method='bst_post' will perform post-order binary search. When an ideal rank is found, determined by the selected predict_k_method, all lower ranks are pruned from the search space.

  • k_search_method='bst_pre' will perform pre-order binary search. When an ideal rank is found, determined by the selected predict_k_method, all lower ranks are pruned from the search space.

  • k_search_method='bst_in' will perform in-order binary search. When an ideal rank is found, determined by the selected predict_k_method, all lower ranks are pruned from the search space.

H_sill_threshfloat, optional

Setting for removing higher ranks from the search space.

When searching for the optimal rank with binary search using k_search='bst_post' or k_search='bst_pre', this hyper-parameter can be used to cut off higher ranks from search space.

The cut-off of higher ranks from the search space is based on threshold for H silhouette. When a H silhouette below H_sill_thresh is found for a given rank or K, all higher ranks are removed from the search space.

If H_sill_thresh=None, it is not used. The default is None.

Return type:

None.

fit(X, Ks, name='NMFk', note='')[source]#

Factorize the input matrix X for the each given K value in Ks.

Parameters:
  • X (np.ndarray or scipy.sparse._csr.csr_matrix matrix) – Input matrix to be factorized.

  • Ks (list) –

    List of K values to factorize the input matrix.

    Example: Ks=range(1, 10, 1).

  • name (str, optional) – Name of the experiment. Default is “NMFk”.

  • note (str, optional) – Note for the experiment used in logs. Default is “”.

Returns:

results – Resulting dict can include all the latent factors, plotting data, predicted latent factors, time took for factorization, and predicted k value depending on the settings specified.

  • If get_plot_data=True, results will include field for plot_data.

  • If predict_k=True, results will include field for k_predict. This is an intiger for the automatically estimated number of latent factors.

  • If predict_k=True and collect_output=True, results will include fields for W and H which are the latent factors in type of np.ndarray.

  • results will always include a field for time, that gives the total compute time.

Return type:

dict

TELF.factorization.RESCALk module#

© 2022. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.

class TELF.factorization.RESCALk.RESCALk(n_perturbs=20, n_iters=100, epsilon=0.015, perturb_type='uniform', n_jobs=1, n_nodes=1, init='nnsvd', use_gpu=True, save_path='./', save_output=True, verbose=True, rescal_verbose=False, perturb_verbose=False, rescal_func=None, rescal_method='rescal_fro_mu', rescal_obj_params={}, pruned=False, calculate_error=False, perturb_multiprocessing=False, get_plot_data=False, simple_plot=True)[source]#

Bases: object

RESCALk is a RESCAL module with the capability to do automatic model determination.

Parameters:
  • n_perturbs (int, optional) – Number of bootstrap operations, or random matrices generated around the original matrix. The default is 20.

  • n_iters (int, optional) – Number of RESCAL iterations. The default is 100.

  • epsilon (float, optional) –

    Error amount for the random matrices generated around the original matrix. The default is 0.015.

    epsilon is used when perturb_type='uniform'.

  • perturb_type (str, optional) –

    Type of error sampling to perform for the bootstrap operation. The default is “uniform”.

    • perturb_type='uniform' will use uniform distribution for sampling.

    • perturb_type='poisson' will use Poission distribution for sampling.

  • n_jobs (int, optional) – Number of parallel jobs. Use -1 to use all available resources. The default is 1.

  • n_nodes (int, optional) – Number of HPC nodes. The default is 1.

  • init (str, optional) –

    Initilization of matrices for RESCAL procedure. The default is “nnsvd”.

    • init='nnsvd' will use NNSVD for initilization.

    • init='random' will use random sampling for initilization.

  • use_gpu (bool, optional) – If True, uses GPU for operations. The default is True.

  • save_path (str, optional) – Location to save output. The default is “./”.

  • save_output (bool, optional) – If True, saves the resulting latent factors and plots. The default is True.

  • verbose (bool, optional) – If True, shows progress in each k. The default is True.

  • rescal_verbose (bool, optional) – If True, shows progress in each RESCAL operation. The default is False.

  • perturb_verbose (bool, optional) – If True, it shows progress in each perturbation. The default is False.

  • rescal_func (object, optional) – If not None, and if rescal_method=func, used for passing RESCAL function. The default is None.

  • rescal_method (str, optional) –

    What RESCAL to use. The default is “rescal_fro_mu”.

    • rescal_method='rescal_fro_mu' will use RESCAL with Frobenious Norm.

rescal_obj_paramsdict, optional

Parameters used by RESCAL function. The default is {}.

prunedbool, optional

When True, removes columns and rows from the input matrix that has only 0 values. The default is False.

Warning

Pruning is not implemented for RESCALk yet.

calculate_errorbool, optional

When True, calculates the relative reconstruction error. The default is False.

Warning

If calculate_error=True, it will result in longer processing time.

perturb_multiprocessingbool, optional

If perturb_multiprocessing=True, it will make parallel computation over each perturbation. Default is perturb_multiprocessing=False.

When perturb_multiprocessing=False, which is default, parallelization is done over each K (rank).

get_plot_databool, optional

When True, collectes the data used in plotting each intermidiate k factorization. The default is False.

simple_plotbool, optional

When True, creates a simple plot for each intermidiate k factorization which hides the statistics such as average and maximum Silhouette scores. The default is True.

Return type:

None.

fit(X, Ks, name='RESCALk', note='')[source]#

Factorize the input matrix X for the each given K value in Ks.

Parameters:
  • X (list of symmetric np.ndarray or list of symmetric scipy.sparse._csr.csr_matrix matrix) – Input matrix to be factorized.

  • Ks (list) –

    List of K values to factorize the input matrix.

    Example: Ks=range(1, 10, 1).

  • name (str, optional) – Name of the experiment. Default is “RESCALk”.

  • note (str, optional) – Note for the experiment used in logs. Default is “”.

Returns:

results – Resulting dict can include all the latent factors, plotting data, predicted latent factors, time took for factorization, and predicted k value depending on the settings specified.

  • If get_plot_data=True, results will include field for plot_data.

  • results will always include a field for time, that gives the total compute time.

Return type:

dict

TELF.factorization.TriNMFk module#

© 2022. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.

class TELF.factorization.TriNMFk.TriNMFk(experiment_name='TriNMFk', nmfk_params={}, save_path='TriNMFk', nmf_verbose=False, use_gpu=False, n_jobs=-1, mask=None, use_consensus_stopping=0, alpha=(0, 0), n_iters=100, n_inits=10, pruned=True, transpose=False, verbose=True)[source]#

Bases: object

TriNMFk is a Non-negative Matrix Factorization module with the capability to do automatic model determination for both estimating the number of latent patterns (Wk) and clusters (Hk).

Parameters:
  • experiment_name (str, optional) – Name used for the experiment. Default is “TriNMFk”.

  • nmfk_params (str, optional) – Parameters for NMFk. See documentation for NMFk for the options.

  • save_path (str, optional) – Used for save location when NMFk fit is not performed first, and TriNMFk fit is done.

  • nmf_verbose (bool, optional) – If True, shows progress in each NMF operation. The default is False.

  • use_gpu (bool, optional) – If True, uses GPU for operations. The default is True.

  • n_jobs (int, optional) – Number of parallel jobs. Use -1 to use all available resources. The default is 1.

  • mask (np.ndarray, optional) – Numpy array that points out the locations in input matrix that should be masked during factorization. The default is None.

  • use_consensus_stopping (str, optional) – When not 0, uses Consensus matrices criteria for early stopping of NMF factorization. The default is 0.

  • alpha (tupl, optional) – Error rate used in bootstrap operation. Default is (0, 0).

  • n_iters (int, optional) – Number of NMF iterations. The default is 100.

  • n_inits (int, optional) – Number of matrix initilization for the bootstrap operation. The default is 10.

  • pruned (bool, optional) – When True, removes columns and rows from the input matrix that has only 0 values. The default is True.

  • transpose (bool, optional) – If True, transposes the input matrix before factorization. The default is False.

  • verbose (bool, optional) – If True, shows progress in each k. The default is False.

Return type:

None.

fit_nmfk(X, Ks, note='')[source]#

Factorize the input matrix X for the each given K value in Ks.

Parameters:
  • X (np.ndarray or scipy.sparse._csr.csr_matrix matrix) – Input matrix to be factorized.

  • Ks (list) –

    List of K values to factorize the input matrix.

    Example: Ks=range(1, 10, 1).

  • name (str, optional) – Name of the experiment. Default is “NMFk”.

  • note (str, optional) – Note for the experiment used in logs. Default is “”.

Returns:

results – Resulting dict can include all the latent factors, plotting data, predicted latent factors, time took for factorization, and predicted k value depending on the settings specified in nmfk_params.

  • If get_plot_data=True, results will include field for plot_data.

  • If predict_k=True, results will include field for k_predict. This is an intiger for the automatically estimated number of latent factors.

  • If predict_k=True and collect_output=True, results will include fields for W and H which are the latent factors in type of np.ndarray.

  • results will always include a field for time, that gives the total compute time.

Return type:

dict

fit_tri_nmfk(X, k1k2: tuple)[source]#

Factorize the input matrix X.

after applying fit_nmfk() to select the Wk and Hk, to factorize the given matrix with k1k2=(Wk, Hk).

Parameters:
  • X (np.ndarray or scipy.sparse._csr.csr_matrix matrix) – Input matrix to be factorized.

  • k1k2 (tuple) – Tuple of Wk (number of latent patterns) and Hk (number of latent clusters), to factorize the matrix X to. Example: Ks=range(4,3).

Returns:

results – Resulting dict will include latent patterns W, H, and mixing matrix S along with the error from each n_inits.

Return type:

dict

Module contents#