pyDNMFk package

Submodules

pyDNMFk.config module

pyDNMFk.config.init(arg)[source]

Global variables declaration here. The variables declared within this function in this file are shared across other files and functions during import.

pyDNMFk.data_generator module

class pyDNMFk.data_generator.data_generator(args)[source]

Bases: object

Generates synthetic data in distributed manner where each MPI process generates a chunk from the data parallelly. The W matrix is generated with gaussian distribution whereas the H matrix is random.

argsclass

Class which comprises following attributes

fpathstr

Directory path of file to be stored

p_rint

Count of row processor in the cartesian grid

p_cint

Count of column processor in the cartesian grid

mint

row dimension of the data

nint

Column dimension of the data

kint

Feature count

create_folder_dir(fpath)[source]

Create a folder if doesn't exist

determine_block_index_range_asymm()[source]

Determines the start and end indices for the Data block for each rank

determine_block_shape_asymm()[source]

Determines the shape for the Data block for each rank

dist_fromfunction(func, shape, pgrid, *args, unravel_index=<function unravel_index>, **kwargs)[source]

produces X_{i,j} = func(i,j) in a distributed manner, so that each processor has an array_split section of X according to the grid.

fit()[source]

generates and save factors

gauss_matrix_generator(n, k)[source]

Construct a matrix of dimensions n by k where the ith column is a Gaussian kernel corresponding to approximately N(i*n/k, 0.01*n^2)

nint

the ambient space dimension

k :int

the latent space diemnsion

Wndarray

A matrix with Gaussian kernel columns of size n x k.

generate_factors_data()[source]

Generates the chunk of factors W,H and data X for each MPI process

random_matrix_generator(n, k, seed)[source]

Generator for random matric with given seed

unravel_column()[source]

finds the column rank for 2d grid

unravel_row()[source]

finds the row rank for 2d grid

pyDNMFk.data_generator.parser()[source]

Reads the input arguments from the user and parses the parameters to the data generator module.

pyDNMFk.data_io module

class pyDNMFk.data_io.data_read(args)[source]

Bases: object

Class for reading data.

argsclass

Class which comprises following attributes

fpathstr

Directory path of file to be read

pgridtuple

Cartesian grid configuration

ftypestr

Type of data to read(mat/npy/csv/folder)

fnamestr

Name of the file to read

comm (object): comm object for distributed read

data_partition()[source]

This function divides the input matrix into chunks as specified by grid configuration.

Return n array of shape (nrows_i, ncols_i) where i is the index of each chunk. Sum_i^n ( nrows_i * ncols_i ) = arr.size

If arr is a 2D array, the returned array should look like n subblocks with each subblock preserving the "physical" layout of arr.

read()[source]

Data read function

read_dat()[source]

Function for reading the data and split into chunks to be reach by each MPI rank

read_file_csv()[source]

CSV data read function

read_file_mat()[source]

mat file read function

read_file_npy()[source]

Numpy data read function

save_data_to_file(fpath)[source]

This function saves the splitted data to numpy array indexed with chunk number

class pyDNMFk.data_io.data_write(args)[source]

Bases: object

Class for writing data/results.

args (class): class which comprises following attributes results_path (str): Directory path of file to write pgrid (tuple): Cartesian grid configuration ftype (str): Type of data to read(mat/npy/csv/folder) comm (object): comm object for distributed read

create_folder_dir(fpath)[source]

Create directory if not present

save_cluster_results(params)[source]

Save cluster results to a h5 file with rank 0

save_factors(factors, reg=False)[source]

Save the W and H factors for each MPI process

class pyDNMFk.data_io.read_factors(factors_path, pgrid)[source]

Bases: object

Class for reading saved factors.

factors_pathstr

Directory path of factors to read from

pgridtuple

Cartesian grid configuration

custom_read_npy(fpath)[source]

Read numpy files

load_factors()[source]

Load the final stacked factors for visualization

read_factor(fpath)[source]

Read factors as chunks and stack them

class pyDNMFk.data_io.split_files_save(data, pgrid, fpath)[source]

Bases: object

Rank 0 based data read, split and save

save_data_to_file()[source]

Function to save the chunks into numpy files

split_files()[source]

Compute the index range for each block and partition the data as per the chunk

pyDNMFk.dist_clustering module

class pyDNMFk.dist_clustering.custom_clustering(Wall, Hall, params)[source]

Bases: object

Greedy algorithm to approximate a quadratic assignment problem to cluster vectors. Given p groups of k vectors, construct k clusters, each cluster containing a single vector from each of the p groups. This clustering approximation uses cos distances and mean centroids.

W_allndarray

Order three tensor of shape m by k by p, where m is the ambient dimension of the vectors, k is the number of vectors in each group, and p is the number of groups of vectors.

H_allndarray

Order three tensor of shape n by k by p, where n is the ambient dimension of the vectors, k is the number of vectors in each group, and p is the number of groups of vectors.

paramsclass

Class object with communication parameters which comprises of grid information (p_r,p_c) , commincator (comm) and epsilon (eps).

change_order(tens)[source]

change the order of features

dist_custom_clustering(centroids=None, vb=0)[source]

Performs the distributed custom clustering

centroidsndarray, optional

The m by k initialization of the centroids of the clusters. None corresponds to using the first slice, W_all[:,:,0], as the initial centroids. Defaults to None.

vbbool, optional

Verbose to display intermediate results

centroidsndarray

The m by k centroids of the clusters

W_all :ndarray

Clustered organization of the vectors W_all

H_allndarray

Clustered organization of the vectors H_all

permute_orderlist

Indices of the permuted features

dist_feature_ordering(centroids, W_sub)[source]

return the features in proper order

dist_silhouettes()[source]

Computes the cosine distances silhouettes of a distributed clustering of vectors.

silsndarray

The k by p array of silhouettes where sils[i,j] is the silhouette measure for the vector W_all[:,i,j]

fit()[source]

Calls the sub routines to perform distributed custom clustering and compute silhouettes

centroidsndarray

The m by k centroids of the clusters

CentStdndarray

Absolute deviation of the features from the centroid

W_allndarray

Clustered organization of the vectors W_all

H_allndarray

Clustered organization of the vectors H_all

S_avgndarray

mean Silhouette score

permute_orderlist

Indices of the permuted features

greedy_lsa(A)[source]

Return the permutation order

mad(data, flag=1, axis=- 1)[source]

Compute the median/mean absolute deviation

normalize_by_W()[source]

Normalize the factors W and H

pyDNMFk.dist_comm module

class pyDNMFk.dist_comm.MPI_comm(comm, p_r, p_c)[source]

Bases: object

Initialization of MPI communicator to construct the cartesian topology and sub communicators

commobject

MPI communicator object

p_rint

row processors count

p_cint

column processors count

Free()[source]

Frees the sub communicators

cart_1d_column()[source]

Constructs a cartesian column communicator through construction of a sub communicator across columns

cartesian1d_columnobject

Sub Communicator object

cart_1d_row()[source]

Constructs a cartesian row communicator through construction of a sub communicator across rows

cartesian1d_rowobject

Sub Communicator object

pyDNMFk.dist_nmf module

class pyDNMFk.dist_nmf.nmf_algorithms_1D(A_ij, W_i, H_j, params=None)[source]

Bases: object

Performs the distributed NMF operation along 1D cartesian grid

A_ijndarray

Distributed Data

W_indarray

Distributed factor W

H_jndarray

Distributed factor H

paramsclass

Class which comprises following attributes

params.comm1object

Global Communicator

params.kint

Rank for decomposition

params.mint

Global dimensions m

params.nint

Global dimensions n

params.p_rint

Cartesian grid row count

params.p_cint

Cartesian grid column count

params.W_updatebool

flag to set W update True/False

params.normstr

NMF norm to be minimized

params.methodstr

NMF optimization method

params.epsfloat

Epsilon value

FRO_BCD_update(W_update=True, itr=1000)[source]

Frobenius norm minimization based BCD update of W and H parameter Function computes updated W and H parameter for each mpi rank

W_update: bool

flag to enable/disable W update

self.W_i : ndarray (m/p_r X k) self.H_j : ndarray (k X n/p_c)

FRO_HALS_update(W_update=True)[source]

Frobenius norm minimizatio based HALS update of W and H parameter Function computes updated W and H parameter for each mpi rank

W_updatebool

Flag to enable/disable W_update

self.H_j : ndarray (k X n/p_r) self.W_i : ndarray (m/p_c X k)

FRO_HALS_update_H()[source]

Frobenius norm minimization based HALS update of H parameter Function computes updated H parameter for each mpi rank

self : object

self.H_j : ndarray ( k X n/p_c)

FRO_HALS_update_W()[source]

Frobenius norm minimization based HALS update of W parameter Function computes updated W parameter for each mpi rank

self : object

self.W_i : ndarray (m/p_r X k)

Fro_MU_update(W_update=True)[source]

Frobenius norm based multiplicative update of W and H parameter Function computes updated W and H parameter for each mpi rank

self : object

self.H_ij : ndarray self.W_ij : ndarray

Fro_MU_update_H()[source]

Frobenius norm based multiplicative update of H parameter Function computes updated H parameter for each mpi rank

self : object

self.H_j : ndarray

Fro_MU_update_W()[source]

Frobenius norm based multiplicative update of W parameter Function computes updated H parameter for each mpi rank

self : object

self.W_i : ndarray

KL_MU_update(W_update=True)[source]

KL divergence based multiplicative update of W and H parameter Function computes updated W and H parameter for each mpi rank

W_updatebool

Flag to enable/disable W_update

self.H_j : ndarray (k X n/p_r) self.W_i : ndarray (m/p_c X k)

KL_MU_update_H()[source]

KL divergence based multiplicative update of H parameter Function computes updated H parameter for each mpi rank

self : object

self.H_jndarray

Distributed factor H of shape k X n/p_c

KL_MU_update_W()[source]

KL divergence based multiplicative update of W parameter Function computes updated W parameter for each mpi rank

self : object

self.W_indarray

Distributed factor W of shape m/p_r X k

glob_UX(axis)[source]

Perform a global operation UX for W and H update with KL

globalSqNorm(X, p=- 1)[source]

Calc global squared norm of any matrix

global_gram(A, p=1)[source]

Distributed gram computation

Computes the global gram operation of matrix A .. math:: A^TA

A : ndarray p : Processor count

A_TA_glob : ndarray

global_mm(A, B, p=- 1)[source]

Distributed matrix multiplication

Computes the global matrix multiplication of matrix A and B .. math:: AB

A : ndarray B : ndarray p : processor count

AB_glob : ndarray

initWandH()[source]

Initialize the parameters for BCD updates

sum_along_axis(X, p=1, axis=0)[source]

Performs sum of the matrix along given axis

Xndarray

Data

pint

Processor count

axisint

Axis along which the sum is to be performed

global_axis_sumndarray

Vector array after summation operation along axis

update()[source]

Performs 1 step Update for factors W and H based on NMF method and corresponding norm minimization

W_indarray

The m/p_r X k distributed factor W

H_jndarray

The k X n/p_c distributed factor H

class pyDNMFk.dist_nmf.nmf_algorithms_2D(A_ij, W_ij, H_ij, params=None)[source]

Bases: object

Performs the distributed NMF operation along 2D cartesian grid

A_ijndarray

Distributed Data

W_ijndarray

Distributed factor W

H_ijndarray

Distributed factor H

paramsclass

Class which comprises following attributes

params.comm1object

Global Communicator

params.commobject

Modified communicator object

params.kint

Rank for decomposition

params.mint

Global dimensions m

params.nint

Global dimensions n

params.p_rint

Cartesian grid row count

params.p_cint

Cartesian grid column count

params.row_commobject

Sub communicator along row

params.col_commobject

Sub communicator along columns

params.W_updatebool

flag to set W update True/False

params.normstr

NMF norm to be minimized

params.methodstr

NMF optimization method

params.epsfloat

Epsilon value

AH_glob(H_ij=None)[source]

Distributed computation of AH^T

Computes the global matrix multiplication of matrix A and H .. math:: AH^T

A : ndarray H : ndarray

AH : ndarray

ATW_glob()[source]

Distributed computation of W^TA

Computes the global matrix multiplication of matrix W and A .. math:: W^TA

W : ndarray A : ndarray

Atw : ndarray

FRO_BCD_update(W_update=True, itr=1000)[source]

Frobenius norm minimization based BCD update of W and H parameter Function computes updated W and H parameter for each mpi rank

self : object

self.W_ij : ndarray (m/p X k)

self.H_ij : ndarray (k X n/p)

FRO_HALS_update(W_update=True)[source]

Frobenius norm minimization based HALS update of W and H parameter Function computes updated W and H parameter for each mpi rank

self : object

self.W_ij : ndarray (m/p X k) self.H_ij : ndarray (k X n/p)

FRO_HALS_update_H()[source]

Frobenius norm minimization based HALS update of H parameter Function computes updated H parameter for each mpi rank

self : object

self.H_ij : ndarray ( k X n/p)

FRO_HALS_update_W()[source]

Frobenius norm minimization based HALS update of W parameter Function computes updated W parameter for each mpi rank

self : object

self.W_ij : ndarray (m/p X k)

Fro_MU_update(W_update=True)[source]

Frobenius norm based multiplicative update of W and H parameter Function computes updated W and H parameter for each mpi rank

self : object

self.H_ij : ndarray self.W_ij : ndarray

Fro_MU_update_H()[source]

Frobenius norm based multiplicative update of H parameter Function computes updated H parameter for each mpi rank

self : object

self.H_ij : ndarray

Fro_MU_update_W()[source]

Frobenius norm based multiplicative update of W parameter Function computes updated H parameter for each mpi rank

self : object

self.W_ij : ndarray

KL_MU_update(W_update=True)[source]

KL divergence based multiplicative update of W and H parameter Function computes updated W and H parameter for each mpi rank

self : object

self.H_ij : ndarray (k X n/p) self.W_ij : ndarray (m/p X k)

KL_MU_update_H()[source]

Frobenius norm based multiplicative update of H parameter Function computes updated H parameter for each mpi rank

self : object

self.H_ijndarray

Distributed factor H of shape k X n/p

KL_MU_update_W()[source]

KL divergence based multiplicative update of W parameter Function computes updated W parameter for each mpi rank

self : object

self.W_ijndarray

Distributed factor W of shape m/p X k

UHT_glob()[source]

Distributed computation of UH^T

Computes the global matrix multiplication of matrix W and U for KL .. math:: UH^T

W : ndarray H : ndarray A : ndarray

UHT : ndarray

WTU_glob()[source]

Distributed computation of W^TU

Computes the global matrix multiplication of matrix W and U for KL .. math:: W^TU

W : ndarray H : ndarray A : ndarray

WTU : ndarray

gather_W_H(gW=True, gH=True)[source]

Gathers W and H factors across cartesian groups i.e H_ij -> H_j if gH=True and W_ij -> W_i and gW=True

gW : boolen gH : boolen

self.H_j : ndarray self.W_i : ndarray

globalSqNorm(comm, X)[source]

Calc global squared norm of any matrix

global_gram(A)[source]

Distributed gram computation

Computes the global gram operation of matrix A .. math:: A^TA

A : ndarray

A_TA_glob : ndarray

global_mm(A, B)[source]

Distributed matrix multiplication

Computes the global matrix multiplication of matrix A and B .. math:: AB

A : ndarray B : ndarray

AB_glob : ndarray

initWandH()[source]

Initialize the parameters for BCD updates

sum_axis(dat, axis)[source]
update()[source]

Performs 1 step Update for factors W and H based on NMF method and corresponding norm minimization

W_ijndarray

The m/p X k distributed factor W

H_ijndarray

The k X n/p distributed factor H

pyDNMFk.dist_svd module

class pyDNMFk.dist_svd.DistSVD(args, A)[source]

Bases: object

Distributed Computation of SVD along 1D distribution of the data. Only U or V is distributed based on data size.

Andarray

Distributed Data

argsclass

Class which comprises following attributes

args.globalmint

Global row dimensions of A

args.globalnint

Global column dimension of A

args.kint(optional)

Rank for decomposition

args.p_rint

Cartesian grid row count

args.p_cint

Cartesian grid column count

args.seedint(optional)

Set the random seed

args.commobject

comm object for distributed read

args.epsfloat

Epsilon value

calc_norm(vec)[source]

Compute the norm of vector

globalGram(X, Y)[source]

Compute the global gram betwee X and Y

nnsvd(flag=1, verbose=1)[source]

Computes the distributed Non-Negative SVD(NNSVD) components from the computed SVD factors.

flagbool, optional

Computes nnSVD factors with different configurations

verbosebool, optional

Verbose to set returned errors. If true returns SVD and NNSVD reconstruction errors.

W :ndarray

Non-negative factor W of shape (m/p_r,k)

Hndarray

Non-negative factor H of shape (k,n/p_c)

errordictionary (optional)

Dictinoary of reconstruction error for svd and nnsvd

normalize_by_W(Wall, Hall, comm1)[source]

Normalize the factors W and H

randomUnitVector(d)[source]

Construnct a rondom unit vector

rel_error(U, S, V)[source]

Computes the relative error between the reconstructed data with factors vs original data

svd()[source]

Computes the SVD for a given matrix

singularValueslist

List of singular values of length k

Us :ndarray

Factor Us of shape (m/p_r,k)

Vsndarray

Factor Vs of shape (k,n/p_c)

svd1D()[source]

One dimensional SVD

pyDNMFk.plot_results module

pyDNMFk.plot_results.box_plot(dat, respath)[source]

Plots the boxplot from the given data and saves the results

pyDNMFk.plot_results.plot_W(W)[source]

Reads a factor and plots into subplots for each component

pyDNMFk.plot_results.plot_err(err)[source]

Plots the relative error for NMF decomposition as a function of number of iterations

pyDNMFk.plot_results.plot_results(startProcess, endProcess, RECON, RECON1, SILL_MIN, out_put, name)[source]

Plots the relative error and Silhouette results for estimation of k

pyDNMFk.plot_results.plot_timing_stats(fpath, respath)[source]

Plots the timing stats for the MPI operation. fpath: Stats data path respath: Path to save graph

pyDNMFk.plot_results.read_plot_factors(factors_path, pgrid)[source]

Reads the factors W and H and Plots them

pyDNMFk.plot_results.timing_stats(fpath)[source]

Reads the timing stats dictionary from the stored file and parses the data.

pyDNMFk.pyDNMF module

class pyDNMFk.pyDNMF.PyNMF(A_ij, factors=None, save_factors=False, params=None)[source]

Bases: object

Performs the distributed NMF decomposition of given matrix X into factors W and H

A_ijndarray

Distributed Data

factorstuple (optional)

Distributed factors W and H

paramsclass

Class which comprises following attributes

params.initstr

NMF initialization(rand/nnsvd)

params.comm1object

Global Communicator

params.commobject

Modified communicator object

params.kint

Rank for decomposition

params.mint

Global dimensions m

params.nint

Global dimensions n

params.p_rint

Cartesian grid row count

params.p_cint

Cartesian grid column count

params.row_commobject

Sub communicator along row

params.col_commobject

Sub communicator along columns

params.W_updatebool

flag to set W update True/False

params.normstr

NMF norm to be minimized

params.methodstr

NMF optimization method

params.epsfloat

Epsilon value

params.verbosebool

Flag to enable/disable display results

params.save_factorsbool

Flag to enable/disable saving computed factors

cart_2d_collect_factors()[source]

Collects factors along each sub communicators

column_err()[source]

Computes the distributed column wise norm

compute_global_dim()[source]

Computes global dimensions m and n from given chunk sizes for any grid configuration

dist_norm(X, proc=- 1, norm='fro', axis=None)[source]

Computes the distributed norm

fit()[source]

Calls the sub routines to perform distributed NMF decomposition with initialization for a given norm minimization and update method

W_indarray

Factor W of shape m/p_r * k

H_jndarray

Factor H of shape k * n/p_c

recon_errfloat

Reconstruction error for NMF decomposition

init_factors()[source]

Initializes NMF factors with rand/nnsvd method

normalize_features(Wall, Hall)[source]

Normalizes features Wall and Hall

relative_err()[source]

Computes the relative error for NMF decomposition

pyDNMFk.pyDNMFk module

class pyDNMFk.pyDNMFk.PyNMFk(A_ij, factors=None, params=None)[source]

Bases: object

Performs the distributed NMF decomposition with custom clustering for estimating hidden factors k

A_ijndarray

Distributed Data

factorstuple (optional)

Distributed factors W and H

paramsclass

Class which comprises following attributes

params.initstr

NMF initialization(rand/nnsvd)

params.comm1object

Global Communicator

params.commobject

Modified communicator object

params.kint

Rank for decomposition

params.mint

Global dimensions m

params.nint

Global dimensions n

params.p_rint

Cartesian grid row count

params.p_cint

Cartesian grid column count

params.row_commobject

Sub communicator along row

params.col_commobject

Sub communicator along columns

params.W_updatebool

flag to set W update True/False

params.normstr

NMF norm to be minimized

params.methodstr

NMF optimization method

params.epsfloat

Epsilon value

params.verbosebool

Flag to enable/disable display results

params.save_factorsbool

Flag to enable/disable saving computed factors

params.perturbationsint

Number of Perturbations for clustering

params.noise_varfloat

Set noise variance for perturbing the data

params.sill_thrfloat

Set the sillhouette threshold for estimating K with p-test

params.start_kint

Starting range for Feature search K

params.end_kint

Ending range for Feature search K

fit()[source]

Calls the sub routines to perform distributed NMF decomposition and then custom clustering to estimate k

noptint

Estimated value of latent features

pvalueAnalysis(errRegres, SILL_MIN)[source]

Calculates nopt by analysing the errors distributions

errRegresarray

array for storing the distributions of errors

SILL_MINfloat

Minimum of silhouette score

pynmfk_per_k()[source]

Performs NMF decomposition and clustering for each k to estimate silhouette statistics

class pyDNMFk.pyDNMFk.sample(data, noise_var, method, seed=None)[source]

Bases: object

Generates perturbed version of data based on sampling distribution.

datandarray

Array of which to find a perturbation.

noise_varfloat

The perturbation amount.

methodstr

Method for sampling (uniform/poisson)

seedfloat(optional)

Set seed for random data generation

fit()[source]

Calls the sub routines to perform resampling on data

X_perndarry

Perturbed version of data

poisson()[source]

Resamples each element of a matrix from a Poisson distribution with the mean set by that element. Y_{i,j} = Poisson(X_{i,j}

randM()[source]

Multiplies each element of X by a uniform random number in (1-epsilon, 1+epsilon).

pyDNMFk.utils module

class pyDNMFk.utils.comm_timing[source]

Bases: object

Decorator class for computing timing for MPI operations. The class uses the global variables flag and time initialized in config file and updates them for each call dynamically.

flag: bool

if Set true, enables the decorator to compute the timings.

time: dict

Dictionary to store timing for each function calls

class pyDNMFk.utils.data_operations(data)[source]

Bases: object

Performs various operations on the data

datandarray

Data to operate on

commonFactors(intList)[source]
cutZero(thresh=1e-08)[source]

Prunes zero columns from the data

desampleT(factor, axis=0)[source]
matSplit(name, p_r, p_c, format='npy')[source]
primeFactors(n)[source]
recZero(indexList)[source]
remove_bad_factors(Wall, Hall, ErrTol, features_k)[source]
class pyDNMFk.utils.determine_block_params(comm, pgrid, shape)[source]

Bases: object

Computes the parameters for each chunk to be read by MPI process

commobject

MPI communicator object

pgridtuple

Cartesian grid configuration

shapetuple

Data shape

determine_block_index_range_asymm()[source]

Determines the start and end indices for the Data block for each rank

determine_block_shape_asymm()[source]

Determines the shape for the Data block for each rank

pyDNMFk.utils.norm(X, comm, norm=2, axis=None, p=- 1)[source]

Compute the data norm

Xndarray

Data to operate on

commobject

MPI communicator object

normint

type of norm to be computed

axisint

axis of array for the norm to be computed along

p: int

Processor count

normfloat

Norm of the given data X

class pyDNMFk.utils.parse[source]

Bases: object

Define a class parse which is used for adding attributes

pyDNMFk.utils.str2bool(v)[source]

Returns instance of string parameter to bool type

class pyDNMFk.utils.transform_H_index(grid)[source]

Bases: object

Collected H factors after MPI operation aren't aligned. This operation performs careful reordering of H factors such that the collected factors are aligned

rankidx2blkidx()[source]

This is to transform the column index to rank index for H

transform_H_idx(rank)[source]

This is to transform H based on new index

pyDNMFk.utils.var_init(clas, var, default)[source]

Checks if class attribute is present and if not, intializes the attribute with given default value

Module contents