pyDRESCALk package

Submodules

pyDRESCALk.config module

pyDRESCALk.config.init(arg)[source]

Global variables declaration here. The variables declared within this function in this file are shared across other files and functions during import.

pyDRESCALk.data_generator module

class pyDRESCALk.data_generator.data_generator(args)[source]

Bases: object

Generates synthetic data in distributed manner where each MPI process generates a chunk from the data parallelly. The W matrix is generated with gaussian distribution whereas the H matrix is random.

Parameters
  • args (class) -- Class which comprises following attributes

  • fpath (str) -- Directory path of file to be stored

  • p_r (int) -- Count of row processor in the cartesian grid

  • p_c (int) -- Count of column processor in the cartesian grid

  • m (int) -- row dimension of the data

  • n (int) -- Column dimension of the data

  • k (int) -- Feature count

create_folder_dir(fpath)[source]

Create a folder if doesn't exist

determine_block_index_range_asymm()[source]

Determines the start and end indices for the Data block for each rank

determine_block_shape_asymm()[source]

Determines the shape for the Data block for each rank

dist_fromfunction(func, shape, pgrid, *args, unravel_index=<function unravel_index>, **kwargs)[source]

produces X_{i,j} = func(i,j) in a distributed manner, so that each processor has an array_split section of X according to the grid.

fit()[source]

generates and save factors

gauss_matrix_generator(n, k, axis=0)[source]

Construct a matrix of dimensions n by k where the ith column is a Gaussian kernel corresponding to approximately N(i*n/k, 0.01*n^2)

Parameters
  • n (int) -- the ambient space dimension

  • k (int) -- the latent space diemnsion

Returns

W -- A matrix with Gaussian kernel columns of size n x k.

Return type

ndarray

generate_factors_data()[source]

Generates the chunk of factors W,H and data X for each MPI process

random_matrix_generator(k, n, seed)[source]

Generator for random matric with given seed

unravel_column()[source]

finds the column rank for 2d grid

unravel_row()[source]

finds the row rank for 2d grid

pyDRESCALk.data_generator.parser()[source]

Reads the input arguments from the user and parses the parameters to the data generator module.

pyDRESCALk.data_io module

class pyDRESCALk.data_io.data_read(args)[source]

Bases: object

Class for reading data.

Parameters
  • args (class) -- Class which comprises following attributes

  • fpath (str) -- Directory path of file to be read

  • pgrid (tuple) -- Cartesian grid configuration

  • ftype (str) -- Type of data to read(mat/npy/csv/folder)

  • fname (str) -- Name of the file to read

  • comm (object) -- comm object for distributed read

data_partition()[source]

This function divides the input matrix into chunks as specified by grid configuration.

Return n array of shape (nrows_i, ncols_i) where i is the index of each chunk. Sum_i^n ( nrows_i * ncols_i ) = arr.size

If arr is a 2D array, the returned array should look like n subblocks with each subblock preserving the "physical" layout of arr.

read()[source]

Data read function

read_dat()[source]

Function for reading the data and split into chunks to be reach by each MPI rank

read_file_csv()[source]

CSV data read function

read_file_mat()[source]

mat file read function

read_file_npy()[source]

Numpy data read function

read_file_pickle()[source]
save_data_to_file(fpath)[source]

This function saves the splitted data to numpy array indexed with chunk number

class pyDRESCALk.data_io.data_write(args)[source]

Bases: object

Class for writing data/results.

Parameters
  • args (class) -- class which comprises following attributes

  • results_path (str) -- Directory path of file to write

  • pgrid (tuple) -- Cartesian grid configuration

  • ftype (str) -- Type of data to read(mat/npy/csv/folder)

  • comm (object) -- comm object for distributed read

create_folder_dir(fpath)[source]

Create directory if not present

save_cluster_results(params)[source]

Save cluster results to a h5 file with rank 0

save_factors(factors, reg=False)[source]

Save the W and H factors for each MPI process

class pyDRESCALk.data_io.read_factors(factors_path, pgrid)[source]

Bases: object

Class for reading saved factors.

Parameters
  • factors_path (str) -- Directory path of factors to read from

  • pgrid (tuple) -- Cartesian grid configuration

custom_read_npy(fpath)[source]

Read numpy files

load_factors()[source]

Load the final stacked factors for visualization

read_factor(fpath)[source]

Read factors as chunks and stack them

class pyDRESCALk.data_io.split_files_save(data, pgrid, fpath)[source]

Bases: object

Rank 0 based data read, split and save

save_data_to_file()[source]

Function to save the chunks into numpy files

split_files()[source]

Compute the index range for each block and partition the data as per the chunk

pyDRESCALk.dist_clustering module

class pyDRESCALk.dist_clustering.custom_clustering(Wall, Hall, params)[source]

Bases: object

Greedy algorithm to approximate a quadratic assignment problem to cluster vectors. Given p groups of k vectors, construct k clusters, each cluster containing a single vector from each of the p groups. This clustering approximation uses cos distances and mean centroids.

Parameters
  • A_all (ndarray) -- Order three tensor of shape m by k by p, where m is the ambient dimension of the vectors, k is the number of vectors in each group, and p is the number of groups of vectors.

  • R_all (ndarray) -- Order three tensor of shape n by k by p, where n is the ambient dimension of the vectors, k is the number of vectors in each group, and p is the number of groups of vectors.

  • params (class) -- Class object with communication parameters which comprises of grid information (p_r,p_c) , commincator (comm) and epsilon (eps).

change_order(tens)[source]

change the order of features

dist_custom_clustering(centroids=None, vb=0)[source]

Performs the distributed custom clustering

Parameters
  • centroids (ndarray, optional) -- The m by k initialization of the centroids of the clusters. None corresponds to using the first slice, A_all[:,:,0], as the initial centroids. Defaults to None.

  • vb (bool, optional) -- Verbose to display intermediate results

Returns

  • centroids (ndarray) -- The m by k centroids of the clusters

  • A_all (ndarray) -- Clustered organization of the vectors A_all

  • R_all (ndarray) -- Clustered organization of the vectors R_all

  • permute_order (list) -- Indices of the permuted features

dist_feature_ordering(centroids, W_sub)[source]

return the features in proper order

dist_silhouettes()[source]

Computes the cosine distances silhouettes of a distributed clustering of vectors.

Returns

sils -- The k by p array of silhouettes where sils[i,j] is the silhouette measure for the vector A_all[:,i,j]

Return type

ndarray

fit()[source]

Calls the sub routines to perform distributed custom clustering and compute silhouettes

Returns

  • centroids (ndarray) -- The m by k centroids of the clusters

  • CentStd (ndarray) -- Absolute deviation of the features from the centroid

  • A_all (ndarray) -- Clustered organization of the vectors A_all

  • R_all (ndarray) -- Clustered organization of the vectors R_all

  • S_avg (ndarray) -- mean Silhouette score

  • permute_order (list) -- Indices of the permuted features

greedy_lsa(A)[source]

Return the permutation order

mad(data, flag=1, axis=-1)[source]

Compute the median/mean absolute deviation

normalize_by_A()[source]

Normalize the factors A and R

pyDRESCALk.dist_comm module

class pyDRESCALk.dist_comm.MPI_comm(comm, p_r, p_c)[source]

Bases: object

Initialization of MPI communicator to construct the cartesian topology and sub communicators

Parameters
  • comm (object) -- MPI communicator object

  • p_r (int) -- row processors count

  • p_c (int) -- column processors count

Free()[source]

Frees the sub communicators

cart_1d_column()[source]

Constructs a cartesian column communicator through construction of a sub communicator across columns

Returns

cartesian1d_column -- Sub Communicator object

Return type

object

cart_1d_row()[source]

Constructs a cartesian row communicator through construction of a sub communicator across rows

Returns

cartesian1d_row -- Sub Communicator object

Return type

object

pyDRESCALk.dist_rescal module

class pyDRESCALk.dist_rescal.rescal_algorithms_2D(X_ijk, A_i, A_j, R_ijk, params=None)[source]

Bases: object

Performs the distributed RESCAL operation along 2D cartesian grid

Parameters
  • X_ijk (ndarray) -- Distributed Data

  • A_ij (ndarray) -- Distributed factor A

  • R_ijk (ndarray) -- Distributed factor R

  • params (class) -- Class which comprises following attributes

  • params.comm1 (object) -- Global Communicator

  • params.comm (object) -- Modified communicator object

  • params.k (int) -- Rank for decomposition

  • params.m (int) -- Global dimensions m

  • params.n (int) -- Global dimensions n

  • params.p_r (int) -- Cartesian grid row count

  • params.p_c (int) -- Cartesian grid column count

  • params.row_comm (object) -- Sub communicator along row

  • params.col_comm (object) -- Sub communicator along columns

  • params.W_update (bool) -- flag to set W update True/False

  • params.norm (str) -- NMF norm to be minimized

  • params.method (str) -- NMF optimization method

  • params.eps (float) -- Epsilon value

Fro_MU_update(A_update=True)[source]

Frobenius norm based multiplicative update of A and R parameter Function computes updated A and R parameter for each mpi rank

Parameters

self (object) --

Returns

  • self.A_i (ndarray)

  • self.R_ijk (ndarray)

column_broadcast(A)[source]

Performs all reduce along column sub communicator

column_mm(A, B)[source]

Distributed matrix multiplication along column of matrix

Computes the matrix multiplication of matrix A and B along column sub communicator .. math:: AB

Parameters
  • A (ndarray) --

  • B (ndarray) --

Returns

AB_glob

Return type

ndarray

column_reduce(A)[source]

Performs all reduce along column sub communicator

element_op(A, B, operation)[source]

Performs Element operations between A and B

global_gram(A)[source]

Distributed gram computation

Computes the global gram operation of matrix A .. math:: A^TA

Parameters

A (ndarray) --

Returns

A_TA_glob

Return type

ndarray

gram_mul(A)[source]

Computes the gram operation of matrix A

matrix_mul(A, B)[source]

Computes the matrix multiplication of matrix A and B

row_broadcast(A)[source]

Performs broadcast along row sub communicator

row_mm(A, B)[source]

Distributed matrix multiplication along row of matrix

Computes the matrix multiplication of matrix A and B along row sub communicator .. math:: AB

Parameters
  • A (ndarray) --

  • B (ndarray) --

Returns

AB_glob

Return type

ndarray

row_reduce(A)[source]

Performs all reduce along row sub communicator

update()[source]

Performs 1 step Update for factors W and H based on NMF method and corresponding norm minimization

Returns

  • W_ij (ndarray) -- The m/p X k distributed factor W

  • H_ij (ndarray) -- The k X n/p distributed factor H

pyDRESCALk.main module

pyDRESCALk.main.parser_pyRescal(parser)[source]
pyDRESCALk.main.parser_pyRescalk(parser)[source]

pyDRESCALk.plot_results module

pyDRESCALk.plot_results.box_plot(dat, respath)[source]

Plots the boxplot from the given data and saves the results

pyDRESCALk.plot_results.plot_W(W)[source]

Reads a factor and plots into subplots for each component

pyDRESCALk.plot_results.plot_err(err)[source]

Plots the relative error for NMF decomposition as a function of number of iterations

pyDRESCALk.plot_results.plot_results(startProcess, endProcess, stepProcess, RECON, SILL_AVG, SILL_MIN, out_put, name)[source]

Plots the relative error and Silhouette results for estimation of k

pyDRESCALk.plot_results.plot_results_paper(startProcess, endProcess, stepProcess, RECON, SILL_AVG, SILL_MIN, out_put, name, k=-1)[source]
pyDRESCALk.plot_results.plot_timing_stats(fpath, respath)[source]

Plots the timing stats for the MPI operation. fpath: Stats data path respath: Path to save graph

pyDRESCALk.plot_results.read_plot_factors(factors_path, pgrid)[source]

Reads the factors W and H and Plots them

pyDRESCALk.plot_results.timing_stats(fpath)[source]

Reads the timing stats dictionary from the stored file and parses the data.

pyDRESCALk.pyDRESCAL module

class pyDRESCALk.pyDRESCAL.pyDRESCAL(X_ijk, factors=None, save_factors=False, params=None)[source]

Bases: object

Performs the distributed NMF decomposition of given matrix X into factors W and H

Parameters
  • A_ij (ndarray) -- Distributed Data

  • factors (tuple) -- Distributed factors W and H

  • params (class) -- Class which comprises following attributes

  • params.init (str) -- NMF initialization(rand/nnsvd)

  • params.comm1 (object) -- Global Communicator

  • params.comm (object) -- Modified communicator object

  • params.k (int) -- Rank for decomposition

  • params.m (int) -- Global dimensions m

  • params.n (int) -- Global dimensions n

  • params.p_r (int) -- Cartesian grid row count

  • params.p_c (int) -- Cartesian grid column count

  • params.row_comm (object) -- Sub communicator along row

  • params.col_comm (object) -- Sub communicator along columns

  • params.W_update (bool) -- flag to set W update True/False

  • params.norm (str) -- NMF norm to be minimized

  • params.method (str) -- NMF optimization method

  • params.eps (float) -- Epsilon value

  • params.verbose (bool) -- Flag to enable/disable display results

  • params.save_factors (bool) -- Flag to enable/disable saving computed factors

compute_global_dim()[source]

Computes global dimensions m and n from given chunk sizes for any grid configuration

dist_norm(X, proc=-1, norm='fro', axis=None)[source]

Computes the distributed norm

fit()[source]

Calls the sub routines to perform distributed NMF decomposition with initialization for a given norm minimization and update method

Returns

  • W_i (ndarray) -- Factor W of shape m/p_r * k

  • H_j (ndarray) -- Factor H of shape k * n/p_c

  • recon_err (float) -- Reconstruction error for NMF decomposition

init_factors()[source]

Initializes Rescal factors with rand/nnsvd method

normalize_features(Wall, Wall1, Hall)[source]

Normalizes features Wall and Hall

relative_err()[source]

Computes the relative error for NMF decomposition

pyDRESCALk.pyDRESCALk module

class pyDRESCALk.pyDRESCALk.pyDRESCALk(X_ijk, factors=None, params=None)[source]

Bases: object

Performs the distributed RESCAL decomposition with custom clustering for estimating hidden factors k

Parameters
  • A_ij (ndarray) -- Distributed Data

  • factors (tuple) -- Distributed factors W and H

  • params (class) -- Class which comprises following attributes

  • params.init (str) -- RESCAL initialization(rand/nnsvd)

  • params.comm1 (object) -- Global Communicator

  • params.comm (object) -- Modified communicator object

  • params.k (int) -- Rank for decomposition

  • params.m (int) -- Global dimensions m

  • params.n (int) -- Global dimensions n

  • params.p_r (int) -- Cartesian grid row count

  • params.p_c (int) -- Cartesian grid column count

  • params.row_comm (object) -- Sub communicator along row

  • params.col_comm (object) -- Sub communicator along columns

  • params.A_update (bool) -- flag to set W update True/False

  • params.norm (str) -- RESCAL norm to be minimized

  • params.method (str) -- RESCAL optimization method

  • params.eps (float) -- Epsilon value

  • params.verbose (bool) -- Flag to enable/disable display results

  • params.save_factors (bool) -- Flag to enable/disable saving computed factors

  • params.perturbations (int) -- Number of Perturbations for clustering

  • params.noise_var (float) -- Set noise variance for perturbing the data

  • params.sill_thr (float) -- Set the sillhouette threshold for estimating K with p-test

  • params.start_k (int) -- Starting range for Feature search K

  • params.end_k (int) -- Ending range for Feature search K

fit()[source]

Calls the sub routines to perform distributed RESCAL decomposition and then custom clustering to estimate k

Returns

nopt -- Estimated value of latent features

Return type

int

pyrescalk_per_k()[source]

Performs RESCAL decomposition and clustering for each k to estimate silhouette statistics

class pyDRESCALk.pyDRESCALk.sample(data, noise_var, method, params, seed=None)[source]

Bases: object

Generates perturbed version of data based on sampling distribution.

Parameters
  • data (ndarray, sparse matrix) -- Array of which to find a perturbation.

  • noise_var (float) -- The perturbation amount.

  • method (str) -- Method for sampling (uniform/poisson)

  • seed (float) -- Set seed for random data generation

fit()[source]

Calls the sub routines to perform resampling on data

Returns

X_per -- Perturbed version of data

Return type

ndarry

poisson()[source]

Resamples each element of a matrix from a Poisson distribution with the mean set by that element. Y_{i,j} = Poisson(X_{i,j}

randM()[source]

Multiplies each element of X by a uniform random number in (1-epsilon, 1+epsilon).

pyDRESCALk.utils module

class pyDRESCALk.utils.comm_timing[source]

Bases: object

Decorator class for computing timing for MPI operations. The class uses the global variables flag and time initialized in config file and updates them for each call dynamically.

Parameters
  • flag (bool) -- if Set true, enables the decorator to compute the timings.

  • time (dict) -- Dictionary to store timing for each function calls

class pyDRESCALk.utils.count_flops[source]

Bases: object

class pyDRESCALk.utils.count_memory[source]

Bases: object

class pyDRESCALk.utils.data_operations(data)[source]

Bases: object

Performs various operations on the data

Parameters

data (ndarray) -- Data to operate on

commonFactors(intList)[source]
cutZero(thresh=1e-08)[source]

Prunes zero columns from the data

desampleT(factor, axis=0)[source]
matSplit(name, p_r, p_c, format='npy')[source]
primeFactors(n)[source]
recZero(indexList)[source]
remove_bad_factors(Wall, Hall, ErrTol, features_k)[source]
class pyDRESCALk.utils.determine_block_params(comm, pgrid, shape)[source]

Bases: object

Computes the parameters for each chunk to be read by MPI process

Parameters
  • comm (object) -- MPI communicator object

  • pgrid (tuple) -- Cartesian grid configuration

  • shape (tuple) -- Data shape

determine_block_index_range_asymm()[source]

Determines the start and end indices for the Data block for each rank

determine_block_shape_asymm()[source]

Determines the shape for the Data block for each rank

pyDRESCALk.utils.norm(X, comm, norm=2, axis=None, p=-1)[source]

Compute the data norm

Parameters
  • X (ndarray) -- Data to operate on

  • comm (object) -- MPI communicator object

  • norm (int) -- type of norm to be computed

  • axis (int) -- axis of array for the norm to be computed along

  • p (int) -- Processor count

Returns

norm -- Norm of the given data X

Return type

float

class pyDRESCALk.utils.parse[source]

Bases: object

Define a class parse which is used for adding attributes

pyDRESCALk.utils.str2bool(v)[source]

Returns instance of string parameter to bool type

class pyDRESCALk.utils.transform_H_index(grid)[source]

Bases: object

Collected H factors after MPI operation aren't aligned. This operation performs careful reordering of H factors such that the collected factors are aligned

rankidx2blkidx()[source]

This is to transform the column index to rank index for H

transform_H_idx(rank)[source]

This is to transform H based on new index

pyDRESCALk.utils.var_init(clas, var, default)[source]

Checks if class attribute is present and if not, intializes the attribute with given default value

pyDRESCALk.version module

Module contents