TELF.factorization.RESCALk: RESCAL with Automatic Model Determination#
RECALk is a RESCAL module with the capability to do automatic model determination.
Example#
First generate synthetic data with pre-determined k. It can be either dense (np.ndarray
) or sparse matrix (scipy.sparse._csr.csr_matrix
). Here, we are using the provided scripts for matrix generation (located here):
import sys;
sys.path.append("../../scripts/")
from generate_X import gen_data
X = gen_data(R=4, shape=[1000,1000,8],gen='rescal')["X"]
Now we can factorize the given matrix:
from TELF.factorization import RESCALk
params = {
"n_perturbs": 2,
"n_iters": 2,
"epsilon": 0.015,
"n_jobs": 1,
"init": "nnsvd",
"use_gpu": False,
"save_path": "../../results/",
"save_output": True,
"collect_output": True,
"verbose": True,
"transpose": False,
"pruned":False,
"rescal_verbose": False,
"verbose":True,
"rescal_method": 'rescal_fro_mu'
}
model = RESCALk(**params)
Ks = range(1, 7, 1)
name = "RESCALk"
note = "This is an example run of RESCALk"
results = model.fit(X, Ks, name, note)
Resulting plots showing the estimation of the matrix rank, or the number of latent factors can be found at ../../results/
.
Available Functions#
|
RESCALk is a RESCAL module with the capability to do automatic model determination. |
|
Factorize the input matrix |
Module Contents#
© 2022. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.
- class TELF.factorization.RESCALk.RESCALk(n_perturbs=20, n_iters=100, epsilon=0.015, perturb_type='uniform', n_jobs=1, n_nodes=1, init='nnsvd', use_gpu=True, save_path='./', save_output=True, verbose=True, rescal_verbose=False, perturb_verbose=False, rescal_func=None, rescal_method='rescal_fro_mu', rescal_obj_params={}, pruned=False, calculate_error=False, perturb_multiprocessing=False, get_plot_data=False, simple_plot=True)[source]#
Bases:
object
RESCALk is a RESCAL module with the capability to do automatic model determination.
- Parameters:
n_perturbs (int, optional) – Number of bootstrap operations, or random matrices generated around the original matrix. The default is 20.
n_iters (int, optional) – Number of RESCAL iterations. The default is 100.
epsilon (float, optional) –
Error amount for the random matrices generated around the original matrix. The default is 0.015.
epsilon
is used whenperturb_type='uniform'
.perturb_type (str, optional) –
Type of error sampling to perform for the bootstrap operation. The default is “uniform”.
perturb_type='uniform'
will use uniform distribution for sampling.perturb_type='poisson'
will use Poission distribution for sampling.
n_jobs (int, optional) – Number of parallel jobs. Use -1 to use all available resources. The default is 1.
n_nodes (int, optional) – Number of HPC nodes. The default is 1.
init (str, optional) –
Initilization of matrices for RESCAL procedure. The default is “nnsvd”.
init='nnsvd'
will use NNSVD for initilization.init='random'
will use random sampling for initilization.
use_gpu (bool, optional) – If True, uses GPU for operations. The default is True.
save_path (str, optional) – Location to save output. The default is “./”.
save_output (bool, optional) – If True, saves the resulting latent factors and plots. The default is True.
verbose (bool, optional) – If True, shows progress in each k. The default is True.
rescal_verbose (bool, optional) – If True, shows progress in each RESCAL operation. The default is False.
perturb_verbose (bool, optional) – If True, it shows progress in each perturbation. The default is False.
rescal_func (object, optional) – If not None, and if
rescal_method=func
, used for passing RESCAL function. The default is None.rescal_method (str, optional) –
What RESCAL to use. The default is “rescal_fro_mu”.
rescal_method='rescal_fro_mu'
will use RESCAL with Frobenious Norm.
- rescal_obj_paramsdict, optional
Parameters used by RESCAL function. The default is {}.
- prunedbool, optional
When True, removes columns and rows from the input matrix that has only 0 values. The default is False.
Warning
Pruning is not implemented for RESCALk yet.
- calculate_errorbool, optional
When True, calculates the relative reconstruction error. The default is False.
Warning
If
calculate_error=True
, it will result in longer processing time.- perturb_multiprocessingbool, optional
If
perturb_multiprocessing=True
, it will make parallel computation over each perturbation. Default isperturb_multiprocessing=False
.When
perturb_multiprocessing=False
, which is default, parallelization is done over each K (rank).- get_plot_databool, optional
When True, collectes the data used in plotting each intermidiate k factorization. The default is False.
- simple_plotbool, optional
When True, creates a simple plot for each intermidiate k factorization which hides the statistics such as average and maximum Silhouette scores. The default is True.
- Return type:
None.
- fit(X, Ks, name='RESCALk', note='')[source]#
Factorize the input matrix
X
for the each given K value inKs
.- Parameters:
X (list of symmetric
np.ndarray
or list of symmetricscipy.sparse._csr.csr_matrix
matrix) – Input matrix to be factorized.Ks (list) –
List of K values to factorize the input matrix.
Example:
Ks=range(1, 10, 1)
.name (str, optional) – Name of the experiment. Default is “RESCALk”.
note (str, optional) – Note for the experiment used in logs. Default is “”.
- Returns:
results – Resulting dict can include all the latent factors, plotting data, predicted latent factors, time took for factorization, and predicted k value depending on the settings specified.
If
get_plot_data=True
, results will include field forplot_data
.results will always include a field for
time
, that gives the total compute time.
- Return type:
dict