pyDRESCALk package¶
Submodules¶
pyDRESCALk.config module¶
pyDRESCALk.data_generator module¶
-
class
pyDRESCALk.data_generator.
data_generator
(args)[source]¶ Bases:
object
Generates synthetic data in distributed manner where each MPI process generates a chunk from the data parallelly. The W matrix is generated with gaussian distribution whereas the H matrix is random.
- Parameters
args (class) -- Class which comprises following attributes
fpath (str) -- Directory path of file to be stored
p_r (int) -- Count of row processor in the cartesian grid
p_c (int) -- Count of column processor in the cartesian grid
m (int) -- row dimension of the data
n (int) -- Column dimension of the data
k (int) -- Feature count
-
determine_block_index_range_asymm
()[source]¶ Determines the start and end indices for the Data block for each rank
-
dist_fromfunction
(func, shape, pgrid, *args, unravel_index=<function unravel_index>, **kwargs)[source]¶ produces X_{i,j} = func(i,j) in a distributed manner, so that each processor has an array_split section of X according to the grid.
-
gauss_matrix_generator
(n, k, axis=0)[source]¶ Construct a matrix of dimensions n by k where the ith column is a Gaussian kernel corresponding to approximately N(i*n/k, 0.01*n^2)
- Parameters
n (int) -- the ambient space dimension
k (int) -- the latent space diemnsion
- Returns
W -- A matrix with Gaussian kernel columns of size n x k.
- Return type
ndarray
pyDRESCALk.data_io module¶
-
class
pyDRESCALk.data_io.
data_read
(args)[source]¶ Bases:
object
Class for reading data.
- Parameters
args (class) -- Class which comprises following attributes
fpath (str) -- Directory path of file to be read
pgrid (tuple) -- Cartesian grid configuration
ftype (str) -- Type of data to read(mat/npy/csv/folder)
fname (str) -- Name of the file to read
comm (object) -- comm object for distributed read
-
data_partition
()[source]¶ This function divides the input matrix into chunks as specified by grid configuration.
Return n array of shape (nrows_i, ncols_i) where i is the index of each chunk. Sum_i^n ( nrows_i * ncols_i ) = arr.size
If arr is a 2D array, the returned array should look like n subblocks with each subblock preserving the "physical" layout of arr.
-
class
pyDRESCALk.data_io.
data_write
(args)[source]¶ Bases:
object
Class for writing data/results.
- Parameters
args (class) -- class which comprises following attributes
results_path (str) -- Directory path of file to write
pgrid (tuple) -- Cartesian grid configuration
ftype (str) -- Type of data to read(mat/npy/csv/folder)
comm (object) -- comm object for distributed read
pyDRESCALk.dist_clustering module¶
-
class
pyDRESCALk.dist_clustering.
custom_clustering
(Wall, Hall, params)[source]¶ Bases:
object
Greedy algorithm to approximate a quadratic assignment problem to cluster vectors. Given p groups of k vectors, construct k clusters, each cluster containing a single vector from each of the p groups. This clustering approximation uses cos distances and mean centroids.
- Parameters
A_all (ndarray) -- Order three tensor of shape m by k by p, where m is the ambient dimension of the vectors, k is the number of vectors in each group, and p is the number of groups of vectors.
R_all (ndarray) -- Order three tensor of shape n by k by p, where n is the ambient dimension of the vectors, k is the number of vectors in each group, and p is the number of groups of vectors.
params (class) -- Class object with communication parameters which comprises of grid information (p_r,p_c) , commincator (comm) and epsilon (eps).
-
dist_custom_clustering
(centroids=None, vb=0)[source]¶ Performs the distributed custom clustering
- Parameters
centroids (ndarray, optional) -- The m by k initialization of the centroids of the clusters. None corresponds to using the first slice, A_all[:,:,0], as the initial centroids. Defaults to None.
vb (bool, optional) -- Verbose to display intermediate results
- Returns
centroids (ndarray) -- The m by k centroids of the clusters
A_all (ndarray) -- Clustered organization of the vectors A_all
R_all (ndarray) -- Clustered organization of the vectors R_all
permute_order (list) -- Indices of the permuted features
-
dist_silhouettes
()[source]¶ Computes the cosine distances silhouettes of a distributed clustering of vectors.
- Returns
sils -- The k by p array of silhouettes where sils[i,j] is the silhouette measure for the vector A_all[:,i,j]
- Return type
ndarray
-
fit
()[source]¶ Calls the sub routines to perform distributed custom clustering and compute silhouettes
- Returns
centroids (ndarray) -- The m by k centroids of the clusters
CentStd (ndarray) -- Absolute deviation of the features from the centroid
A_all (ndarray) -- Clustered organization of the vectors A_all
R_all (ndarray) -- Clustered organization of the vectors R_all
S_avg (ndarray) -- mean Silhouette score
permute_order (list) -- Indices of the permuted features
pyDRESCALk.dist_comm module¶
-
class
pyDRESCALk.dist_comm.
MPI_comm
(comm, p_r, p_c)[source]¶ Bases:
object
Initialization of MPI communicator to construct the cartesian topology and sub communicators
- Parameters
comm (object) -- MPI communicator object
p_r (int) -- row processors count
p_c (int) -- column processors count
pyDRESCALk.dist_rescal module¶
-
class
pyDRESCALk.dist_rescal.
rescal_algorithms_2D
(X_ijk, A_i, A_j, R_ijk, params=None)[source]¶ Bases:
object
Performs the distributed RESCAL operation along 2D cartesian grid
- Parameters
X_ijk (ndarray) -- Distributed Data
A_ij (ndarray) -- Distributed factor A
R_ijk (ndarray) -- Distributed factor R
params (class) -- Class which comprises following attributes
params.comm1 (object) -- Global Communicator
params.comm (object) -- Modified communicator object
params.k (int) -- Rank for decomposition
params.m (int) -- Global dimensions m
params.n (int) -- Global dimensions n
params.p_r (int) -- Cartesian grid row count
params.p_c (int) -- Cartesian grid column count
params.row_comm (object) -- Sub communicator along row
params.col_comm (object) -- Sub communicator along columns
params.W_update (bool) -- flag to set W update True/False
params.norm (str) -- NMF norm to be minimized
params.method (str) -- NMF optimization method
params.eps (float) -- Epsilon value
-
Fro_MU_update
(A_update=True)[source]¶ Frobenius norm based multiplicative update of A and R parameter Function computes updated A and R parameter for each mpi rank
- Parameters
self (object) --
- Returns
self.A_i (ndarray)
self.R_ijk (ndarray)
-
column_mm
(A, B)[source]¶ Distributed matrix multiplication along column of matrix
Computes the matrix multiplication of matrix A and B along column sub communicator .. math:: AB
- Parameters
A (ndarray) --
B (ndarray) --
- Returns
AB_glob
- Return type
ndarray
-
global_gram
(A)[source]¶ Distributed gram computation
Computes the global gram operation of matrix A .. math:: A^TA
- Parameters
A (ndarray) --
- Returns
A_TA_glob
- Return type
ndarray
pyDRESCALk.main module¶
pyDRESCALk.plot_results module¶
-
pyDRESCALk.plot_results.
box_plot
(dat, respath)[source]¶ Plots the boxplot from the given data and saves the results
-
pyDRESCALk.plot_results.
plot_W
(W)[source]¶ Reads a factor and plots into subplots for each component
-
pyDRESCALk.plot_results.
plot_err
(err)[source]¶ Plots the relative error for NMF decomposition as a function of number of iterations
-
pyDRESCALk.plot_results.
plot_results
(startProcess, endProcess, stepProcess, RECON, SILL_AVG, SILL_MIN, out_put, name)[source]¶ Plots the relative error and Silhouette results for estimation of k
-
pyDRESCALk.plot_results.
plot_results_paper
(startProcess, endProcess, stepProcess, RECON, SILL_AVG, SILL_MIN, out_put, name, k=-1)[source]¶
-
pyDRESCALk.plot_results.
plot_timing_stats
(fpath, respath)[source]¶ Plots the timing stats for the MPI operation. fpath: Stats data path respath: Path to save graph
pyDRESCALk.pyDRESCAL module¶
-
class
pyDRESCALk.pyDRESCAL.
pyDRESCAL
(X_ijk, factors=None, save_factors=False, params=None)[source]¶ Bases:
object
Performs the distributed NMF decomposition of given matrix X into factors W and H
- Parameters
A_ij (ndarray) -- Distributed Data
factors (tuple) -- Distributed factors W and H
params (class) -- Class which comprises following attributes
params.init (str) -- NMF initialization(rand/nnsvd)
params.comm1 (object) -- Global Communicator
params.comm (object) -- Modified communicator object
params.k (int) -- Rank for decomposition
params.m (int) -- Global dimensions m
params.n (int) -- Global dimensions n
params.p_r (int) -- Cartesian grid row count
params.p_c (int) -- Cartesian grid column count
params.row_comm (object) -- Sub communicator along row
params.col_comm (object) -- Sub communicator along columns
params.W_update (bool) -- flag to set W update True/False
params.norm (str) -- NMF norm to be minimized
params.method (str) -- NMF optimization method
params.eps (float) -- Epsilon value
params.verbose (bool) -- Flag to enable/disable display results
params.save_factors (bool) -- Flag to enable/disable saving computed factors
-
compute_global_dim
()[source]¶ Computes global dimensions m and n from given chunk sizes for any grid configuration
-
fit
()[source]¶ Calls the sub routines to perform distributed NMF decomposition with initialization for a given norm minimization and update method
- Returns
W_i (ndarray) -- Factor W of shape m/p_r * k
H_j (ndarray) -- Factor H of shape k * n/p_c
recon_err (float) -- Reconstruction error for NMF decomposition
pyDRESCALk.pyDRESCALk module¶
-
class
pyDRESCALk.pyDRESCALk.
pyDRESCALk
(X_ijk, factors=None, params=None)[source]¶ Bases:
object
Performs the distributed RESCAL decomposition with custom clustering for estimating hidden factors k
- Parameters
A_ij (ndarray) -- Distributed Data
factors (tuple) -- Distributed factors W and H
params (class) -- Class which comprises following attributes
params.init (str) -- RESCAL initialization(rand/nnsvd)
params.comm1 (object) -- Global Communicator
params.comm (object) -- Modified communicator object
params.k (int) -- Rank for decomposition
params.m (int) -- Global dimensions m
params.n (int) -- Global dimensions n
params.p_r (int) -- Cartesian grid row count
params.p_c (int) -- Cartesian grid column count
params.row_comm (object) -- Sub communicator along row
params.col_comm (object) -- Sub communicator along columns
params.A_update (bool) -- flag to set W update True/False
params.norm (str) -- RESCAL norm to be minimized
params.method (str) -- RESCAL optimization method
params.eps (float) -- Epsilon value
params.verbose (bool) -- Flag to enable/disable display results
params.save_factors (bool) -- Flag to enable/disable saving computed factors
params.perturbations (int) -- Number of Perturbations for clustering
params.noise_var (float) -- Set noise variance for perturbing the data
params.sill_thr (float) -- Set the sillhouette threshold for estimating K with p-test
params.start_k (int) -- Starting range for Feature search K
params.end_k (int) -- Ending range for Feature search K
-
class
pyDRESCALk.pyDRESCALk.
sample
(data, noise_var, method, params, seed=None)[source]¶ Bases:
object
Generates perturbed version of data based on sampling distribution.
- Parameters
data (ndarray, sparse matrix) -- Array of which to find a perturbation.
noise_var (float) -- The perturbation amount.
method (str) -- Method for sampling (uniform/poisson)
seed (float) -- Set seed for random data generation
-
fit
()[source]¶ Calls the sub routines to perform resampling on data
- Returns
X_per -- Perturbed version of data
- Return type
ndarry
pyDRESCALk.utils module¶
-
class
pyDRESCALk.utils.
comm_timing
[source]¶ Bases:
object
Decorator class for computing timing for MPI operations. The class uses the global variables flag and time initialized in config file and updates them for each call dynamically.
- Parameters
flag (bool) -- if Set true, enables the decorator to compute the timings.
time (dict) -- Dictionary to store timing for each function calls
-
class
pyDRESCALk.utils.
data_operations
(data)[source]¶ Bases:
object
Performs various operations on the data
- Parameters
data (ndarray) -- Data to operate on
-
class
pyDRESCALk.utils.
determine_block_params
(comm, pgrid, shape)[source]¶ Bases:
object
Computes the parameters for each chunk to be read by MPI process
- Parameters
comm (object) -- MPI communicator object
pgrid (tuple) -- Cartesian grid configuration
shape (tuple) -- Data shape
-
pyDRESCALk.utils.
norm
(X, comm, norm=2, axis=None, p=-1)[source]¶ Compute the data norm
- Parameters
X (ndarray) -- Data to operate on
comm (object) -- MPI communicator object
norm (int) -- type of norm to be computed
axis (int) -- axis of array for the norm to be computed along
p (int) -- Processor count
- Returns
norm -- Norm of the given data X
- Return type
float
-
class
pyDRESCALk.utils.
parse
[source]¶ Bases:
object
Define a class parse which is used for adding attributes