pyDRESCALk package¶

Submodules¶

pyDRESCALk.config module¶

pyDRESCALk.config.init(arg)[source]¶: Global variables declaration here. The variables declared within this function in this file are shared across other files and functions during import.

pyDRESCALk.data_generator module¶

class pyDRESCALk.data_generator.data_generator(args)[source]¶

Bases: object

Generates synthetic data in distributed manner where each MPI process generates a chunk from the data parallelly. The W matrix is generated with gaussian distribution whereas the H matrix is random.

Parameters

args (class) -- Class which comprises following attributes
fpath (str) -- Directory path of file to be stored
p_r (int) -- Count of row processor in the cartesian grid
p_c (int) -- Count of column processor in the cartesian grid
m (int) -- row dimension of the data
n (int) -- Column dimension of the data
k (int) -- Feature count

create_folder_dir(fpath)[source]¶: Create a folder if doesn't exist

determine_block_index_range_asymm()[source]¶: Determines the start and end indices for the Data block for each rank

determine_block_shape_asymm()[source]¶: Determines the shape for the Data block for each rank

dist_fromfunction(func, shape, pgrid, *args, unravel_index=<function unravel_index>, **kwargs)[source]¶: produces X_{i,j} = func(i,j) in a distributed manner, so that each processor has an array_split section of X according to the grid.

fit()[source]¶: generates and save factors

gauss_matrix_generator(n, k, axis=0)[source]¶

Construct a matrix of dimensions n by k where the ith column is a Gaussian kernel corresponding to approximately N(i*n/k, 0.01*n^2)

Parameters

n (int) -- the ambient space dimension
k (int) -- the latent space diemnsion

Returns

W -- A matrix with Gaussian kernel columns of size n x k.

Return type

ndarray

generate_factors_data()[source]¶: Generates the chunk of factors W,H and data X for each MPI process

random_matrix_generator(k, n, seed)[source]¶: Generator for random matric with given seed

unravel_column()[source]¶: finds the column rank for 2d grid

unravel_row()[source]¶: finds the row rank for 2d grid

pyDRESCALk.data_generator.parser()[source]¶: Reads the input arguments from the user and parses the parameters to the data generator module.

pyDRESCALk.data_io module¶

class pyDRESCALk.data_io.data_read(args)[source]¶

Bases: object

Class for reading data.

Parameters

args (class) -- Class which comprises following attributes
fpath (str) -- Directory path of file to be read
pgrid (tuple) -- Cartesian grid configuration
ftype (str) -- Type of data to read(mat/npy/csv/folder)
fname (str) -- Name of the file to read
comm (object) -- comm object for distributed read

data_partition()[source]¶

This function divides the input matrix into chunks as specified by grid configuration.

Return n array of shape (nrows_i, ncols_i) where i is the index of each chunk. Sum_i^n ( nrows_i * ncols_i ) = arr.size

If arr is a 2D array, the returned array should look like n subblocks with each subblock preserving the "physical" layout of arr.

read()[source]¶: Data read function

read_dat()[source]¶: Function for reading the data and split into chunks to be reach by each MPI rank

read_file_csv()[source]¶: CSV data read function

read_file_mat()[source]¶: mat file read function

read_file_npy()[source]¶: Numpy data read function

read_file_pickle()[source]¶

save_data_to_file(fpath)[source]¶: This function saves the splitted data to numpy array indexed with chunk number

class pyDRESCALk.data_io.data_write(args)[source]¶

Bases: object

Class for writing data/results.

Parameters

args (class) -- class which comprises following attributes
results_path (str) -- Directory path of file to write
pgrid (tuple) -- Cartesian grid configuration
ftype (str) -- Type of data to read(mat/npy/csv/folder)
comm (object) -- comm object for distributed read

create_folder_dir(fpath)[source]¶: Create directory if not present

save_cluster_results(params)[source]¶: Save cluster results to a h5 file with rank 0

save_factors(factors, reg=False)[source]¶: Save the W and H factors for each MPI process

class pyDRESCALk.data_io.read_factors(factors_path, pgrid)[source]¶

Bases: object

Class for reading saved factors.

Parameters

factors_path (str) -- Directory path of factors to read from
pgrid (tuple) -- Cartesian grid configuration

custom_read_npy(fpath)[source]¶: Read numpy files

load_factors()[source]¶: Load the final stacked factors for visualization

read_factor(fpath)[source]¶: Read factors as chunks and stack them

class pyDRESCALk.data_io.split_files_save(data, pgrid, fpath)[source]¶

Bases: object

Rank 0 based data read, split and save

save_data_to_file()[source]¶: Function to save the chunks into numpy files

split_files()[source]¶: Compute the index range for each block and partition the data as per the chunk

pyDRESCALk.dist_clustering module¶

class pyDRESCALk.dist_clustering.custom_clustering(Wall, Hall, params)[source]¶

Bases: object

Greedy algorithm to approximate a quadratic assignment problem to cluster vectors. Given p groups of k vectors, construct k clusters, each cluster containing a single vector from each of the p groups. This clustering approximation uses cos distances and mean centroids.

Parameters

A_all (ndarray) -- Order three tensor of shape m by k by p, where m is the ambient dimension of the vectors, k is the number of vectors in each group, and p is the number of groups of vectors.
R_all (ndarray) -- Order three tensor of shape n by k by p, where n is the ambient dimension of the vectors, k is the number of vectors in each group, and p is the number of groups of vectors.
params (class) -- Class object with communication parameters which comprises of grid information (p_r,p_c) , commincator (comm) and epsilon (eps).

change_order(tens)[source]¶: change the order of features

dist_custom_clustering(centroids=None, vb=0)[source]¶

Performs the distributed custom clustering

Parameters

centroids (ndarray, optional) -- The m by k initialization of the centroids of the clusters. None corresponds to using the first slice, A_all[:,:,0], as the initial centroids. Defaults to None.
vb (bool, optional) -- Verbose to display intermediate results

Returns

centroids (ndarray) -- The m by k centroids of the clusters
A_all (ndarray) -- Clustered organization of the vectors A_all
R_all (ndarray) -- Clustered organization of the vectors R_all
permute_order (list) -- Indices of the permuted features

dist_feature_ordering(centroids, W_sub)[source]¶: return the features in proper order

dist_silhouettes()[source]¶

Computes the cosine distances silhouettes of a distributed clustering of vectors.

Returns: sils -- The k by p array of silhouettes where sils[i,j] is the silhouette measure for the vector A_all[:,i,j]
Return type: ndarray

fit()[source]¶

Calls the sub routines to perform distributed custom clustering and compute silhouettes

Returns

centroids (ndarray) -- The m by k centroids of the clusters
CentStd (ndarray) -- Absolute deviation of the features from the centroid
A_all (ndarray) -- Clustered organization of the vectors A_all
R_all (ndarray) -- Clustered organization of the vectors R_all
S_avg (ndarray) -- mean Silhouette score
permute_order (list) -- Indices of the permuted features

greedy_lsa(A)[source]¶: Return the permutation order

mad(data, flag=1, axis=-1)[source]¶: Compute the median/mean absolute deviation

normalize_by_A()[source]¶: Normalize the factors A and R

pyDRESCALk.dist_comm module¶

class pyDRESCALk.dist_comm.MPI_comm(comm, p_r, p_c)[source]¶

Bases: object

Initialization of MPI communicator to construct the cartesian topology and sub communicators

Parameters

comm (object) -- MPI communicator object
p_r (int) -- row processors count
p_c (int) -- column processors count

Free()[source]¶: Frees the sub communicators

cart_1d_column()[source]¶

Constructs a cartesian column communicator through construction of a sub communicator across columns

Returns: cartesian1d_column -- Sub Communicator object
Return type: object

cart_1d_row()[source]¶

Constructs a cartesian row communicator through construction of a sub communicator across rows

Returns: cartesian1d_row -- Sub Communicator object
Return type: object

pyDRESCALk.dist_rescal module¶

class pyDRESCALk.dist_rescal.rescal_algorithms_2D(X_ijk, A_i, A_j, R_ijk, params=None)[source]¶

Bases: object

Performs the distributed RESCAL operation along 2D cartesian grid

Parameters

X_ijk (ndarray) -- Distributed Data
A_ij (ndarray) -- Distributed factor A
R_ijk (ndarray) -- Distributed factor R
params (class) -- Class which comprises following attributes
params.comm1 (object) -- Global Communicator
params.comm (object) -- Modified communicator object
params.k (int) -- Rank for decomposition
params.m (int) -- Global dimensions m
params.n (int) -- Global dimensions n
params.p_r (int) -- Cartesian grid row count
params.p_c (int) -- Cartesian grid column count
params.row_comm (object) -- Sub communicator along row
params.col_comm (object) -- Sub communicator along columns
params.W_update (bool) -- flag to set W update True/False
params.norm (str) -- NMF norm to be minimized
params.method (str) -- NMF optimization method
params.eps (float) -- Epsilon value

Fro_MU_update(A_update=True)[source]¶

Frobenius norm based multiplicative update of A and R parameter Function computes updated A and R parameter for each mpi rank

Parameters

self (object) --

Returns

self.A_i (ndarray)
self.R_ijk (ndarray)

column_broadcast(A)[source]¶: Performs all reduce along column sub communicator

column_mm(A, B)[source]¶

Distributed matrix multiplication along column of matrix

Computes the matrix multiplication of matrix A and B along column sub communicator .. math:: AB

Parameters

A (ndarray) --
B (ndarray) --

Returns

AB_glob

Return type

ndarray

column_reduce(A)[source]¶: Performs all reduce along column sub communicator

element_op(A, B, operation)[source]¶: Performs Element operations between A and B

global_gram(A)[source]¶

Distributed gram computation

Computes the global gram operation of matrix A .. math:: A^TA

Parameters: A (ndarray) --
Returns: A_TA_glob
Return type: ndarray

gram_mul(A)[source]¶: Computes the gram operation of matrix A

matrix_mul(A, B)[source]¶: Computes the matrix multiplication of matrix A and B

row_broadcast(A)[source]¶: Performs broadcast along row sub communicator

row_mm(A, B)[source]¶

Distributed matrix multiplication along row of matrix

Computes the matrix multiplication of matrix A and B along row sub communicator .. math:: AB

Parameters

A (ndarray) --
B (ndarray) --

Returns

AB_glob

Return type

ndarray

row_reduce(A)[source]¶: Performs all reduce along row sub communicator

update()[source]¶

Performs 1 step Update for factors W and H based on NMF method and corresponding norm minimization

Returns

W_ij (ndarray) -- The m/p X k distributed factor W
H_ij (ndarray) -- The k X n/p distributed factor H

pyDRESCALk.main module¶

pyDRESCALk.main.parser_pyRescal(parser)[source]¶

pyDRESCALk.main.parser_pyRescalk(parser)[source]¶

pyDRESCALk.plot_results module¶

pyDRESCALk.plot_results.box_plot(dat, respath)[source]¶: Plots the boxplot from the given data and saves the results

pyDRESCALk.plot_results.plot_W(W)[source]¶: Reads a factor and plots into subplots for each component

pyDRESCALk.plot_results.plot_err(err)[source]¶: Plots the relative error for NMF decomposition as a function of number of iterations

pyDRESCALk.plot_results.plot_results(startProcess, endProcess, stepProcess, RECON, SILL_AVG, SILL_MIN, out_put, name)[source]¶: Plots the relative error and Silhouette results for estimation of k

pyDRESCALk.plot_results.plot_results_paper(startProcess, endProcess, stepProcess, RECON, SILL_AVG, SILL_MIN, out_put, name, k=-1)[source]¶

pyDRESCALk.plot_results.plot_timing_stats(fpath, respath)[source]¶: Plots the timing stats for the MPI operation. fpath: Stats data path respath: Path to save graph

pyDRESCALk.plot_results.read_plot_factors(factors_path, pgrid)[source]¶: Reads the factors W and H and Plots them

pyDRESCALk.plot_results.timing_stats(fpath)[source]¶: Reads the timing stats dictionary from the stored file and parses the data.

pyDRESCALk.pyDRESCAL module¶

class pyDRESCALk.pyDRESCAL.pyDRESCAL(X_ijk, factors=None, save_factors=False, params=None)[source]¶

Bases: object

Performs the distributed NMF decomposition of given matrix X into factors W and H

Parameters

A_ij (ndarray) -- Distributed Data
factors (tuple) -- Distributed factors W and H
params (class) -- Class which comprises following attributes
params.init (str) -- NMF initialization(rand/nnsvd)
params.comm1 (object) -- Global Communicator
params.comm (object) -- Modified communicator object
params.k (int) -- Rank for decomposition
params.m (int) -- Global dimensions m
params.n (int) -- Global dimensions n
params.p_r (int) -- Cartesian grid row count
params.p_c (int) -- Cartesian grid column count
params.row_comm (object) -- Sub communicator along row
params.col_comm (object) -- Sub communicator along columns
params.W_update (bool) -- flag to set W update True/False
params.norm (str) -- NMF norm to be minimized
params.method (str) -- NMF optimization method
params.eps (float) -- Epsilon value
params.verbose (bool) -- Flag to enable/disable display results
params.save_factors (bool) -- Flag to enable/disable saving computed factors

compute_global_dim()[source]¶: Computes global dimensions m and n from given chunk sizes for any grid configuration

dist_norm(X, proc=-1, norm='fro', axis=None)[source]¶: Computes the distributed norm

fit()[source]¶

Calls the sub routines to perform distributed NMF decomposition with initialization for a given norm minimization and update method

Returns

W_i (ndarray) -- Factor W of shape m/p_r * k
H_j (ndarray) -- Factor H of shape k * n/p_c
recon_err (float) -- Reconstruction error for NMF decomposition

init_factors()[source]¶: Initializes Rescal factors with rand/nnsvd method

normalize_features(Wall, Wall1, Hall)[source]¶: Normalizes features Wall and Hall

relative_err()[source]¶: Computes the relative error for NMF decomposition

pyDRESCALk.pyDRESCALk module¶

class pyDRESCALk.pyDRESCALk.pyDRESCALk(X_ijk, factors=None, params=None)[source]¶

Bases: object

Performs the distributed RESCAL decomposition with custom clustering for estimating hidden factors k

Parameters

A_ij (ndarray) -- Distributed Data
factors (tuple) -- Distributed factors W and H
params (class) -- Class which comprises following attributes
params.init (str) -- RESCAL initialization(rand/nnsvd)
params.comm1 (object) -- Global Communicator
params.comm (object) -- Modified communicator object
params.k (int) -- Rank for decomposition
params.m (int) -- Global dimensions m
params.n (int) -- Global dimensions n
params.p_r (int) -- Cartesian grid row count
params.p_c (int) -- Cartesian grid column count
params.row_comm (object) -- Sub communicator along row
params.col_comm (object) -- Sub communicator along columns
params.A_update (bool) -- flag to set W update True/False
params.norm (str) -- RESCAL norm to be minimized
params.method (str) -- RESCAL optimization method
params.eps (float) -- Epsilon value
params.verbose (bool) -- Flag to enable/disable display results
params.save_factors (bool) -- Flag to enable/disable saving computed factors
params.perturbations (int) -- Number of Perturbations for clustering
params.noise_var (float) -- Set noise variance for perturbing the data
params.sill_thr (float) -- Set the sillhouette threshold for estimating K with p-test
params.start_k (int) -- Starting range for Feature search K
params.end_k (int) -- Ending range for Feature search K

fit()[source]¶

Calls the sub routines to perform distributed RESCAL decomposition and then custom clustering to estimate k

Returns: nopt -- Estimated value of latent features
Return type: int

pyrescalk_per_k()[source]¶: Performs RESCAL decomposition and clustering for each k to estimate silhouette statistics

class pyDRESCALk.pyDRESCALk.sample(data, noise_var, method, params, seed=None)[source]¶

Bases: object

Generates perturbed version of data based on sampling distribution.

Parameters

data (ndarray, sparse matrix) -- Array of which to find a perturbation.
noise_var (float) -- The perturbation amount.
method (str) -- Method for sampling (uniform/poisson)
seed (float) -- Set seed for random data generation

fit()[source]¶

Calls the sub routines to perform resampling on data

Returns: X_per -- Perturbed version of data
Return type: ndarry

poisson()[source]¶: Resamples each element of a matrix from a Poisson distribution with the mean set by that element. Y_{i,j} = Poisson(X_{i,j}

randM()[source]¶: Multiplies each element of X by a uniform random number in (1-epsilon, 1+epsilon).

pyDRESCALk.utils module¶

class pyDRESCALk.utils.comm_timing[source]¶

Bases: object

Decorator class for computing timing for MPI operations. The class uses the global variables flag and time initialized in config file and updates them for each call dynamically.

Parameters

flag (bool) -- if Set true, enables the decorator to compute the timings.
time (dict) -- Dictionary to store timing for each function calls

class pyDRESCALk.utils.count_flops[source]¶: Bases: object

class pyDRESCALk.utils.count_memory[source]¶: Bases: object

class pyDRESCALk.utils.data_operations(data)[source]¶

Bases: object

Performs various operations on the data

Parameters: data (ndarray) -- Data to operate on

commonFactors(intList)[source]¶

cutZero(thresh=1e-08)[source]¶: Prunes zero columns from the data

desampleT(factor, axis=0)[source]¶

matSplit(name, p_r, p_c, format='npy')[source]¶

primeFactors(n)[source]¶

recZero(indexList)[source]¶

remove_bad_factors(Wall, Hall, ErrTol, features_k)[source]¶

class pyDRESCALk.utils.determine_block_params(comm, pgrid, shape)[source]¶

Bases: object

Computes the parameters for each chunk to be read by MPI process

Parameters

comm (object) -- MPI communicator object
pgrid (tuple) -- Cartesian grid configuration
shape (tuple) -- Data shape

determine_block_index_range_asymm()[source]¶: Determines the start and end indices for the Data block for each rank

determine_block_shape_asymm()[source]¶: Determines the shape for the Data block for each rank

pyDRESCALk.utils.norm(X, comm, norm=2, axis=None, p=-1)[source]¶

Compute the data norm

Parameters

X (ndarray) -- Data to operate on
comm (object) -- MPI communicator object
norm (int) -- type of norm to be computed
axis (int) -- axis of array for the norm to be computed along
p (int) -- Processor count

Returns

norm -- Norm of the given data X

Return type

float

class pyDRESCALk.utils.parse[source]¶

Bases: object

Define a class parse which is used for adding attributes

pyDRESCALk.utils.str2bool(v)[source]¶: Returns instance of string parameter to bool type

class pyDRESCALk.utils.transform_H_index(grid)[source]¶

Bases: object

Collected H factors after MPI operation aren't aligned. This operation performs careful reordering of H factors such that the collected factors are aligned

rankidx2blkidx()[source]¶: This is to transform the column index to rank index for H

transform_H_idx(rank)[source]¶: This is to transform H based on new index

pyDRESCALk.utils.var_init(clas, var, default)[source]¶: Checks if class attribute is present and if not, intializes the attribute with given default value

pyDRESCALk package¶

Submodules¶

pyDRESCALk.config module¶

pyDRESCALk.data_generator module¶

pyDRESCALk.data_io module¶

pyDRESCALk.dist_clustering module¶

pyDRESCALk.dist_comm module¶

pyDRESCALk.dist_rescal module¶

pyDRESCALk.main module¶

pyDRESCALk.plot_results module¶

pyDRESCALk.pyDRESCAL module¶

pyDRESCALk.pyDRESCALk module¶

pyDRESCALk.utils module¶

pyDRESCALk.version module¶

Module contents¶