Welcome to pyDNMFk's documentation!¶
pyDNMFk is a software package for applying non-negative matrix factorization in a distrubuted fashion to large datasets. It has the ability to minimize the difference between reconstructed data and the original data through various norms (Frobenious, KL-divergence). Additionally, the Custom Clustering algorithm allows for automated determination for the number of Latent features.
Features¶
Utilization of MPI4py for distributed operation.
Distributed NNSVD and SVD initiaizations.
Distributed Custom Clustering algorithm for estimating automated latent feature number (k) determination.
Objective of minimization of KL divergence/Frobenius norm.
Optimization with multiplicative updates, BCD, and HALS.
Scalability¶
pyDNMFk Scales from laptops to clusters. The library is convenient on a laptop. It can be installed easily with conda or pip and extends the matrix decomposition from a single core to numerous cores across nodes. pyDNMFk is efficient and has been tested on powerful servers across LANL and Oakridge scaling beyond 1000+ nodes. This library facilitates the transition between single-machine to large scale cluster so as to enable users to both start simple and scale up when necessary.
Installation¶
git clone https://github.com/lanl/pyDNMFk.git
cd pyDNMFk
conda create --name pyDNMFk python=3.7.1 openmpi mpi4py
source activate pyDNMFk
python setup.py install
Usage Example¶
We provide a sample dataset that can be used for estimation of k:
'''Imports block'''
import sys
import pyDNMFk.config as config
config.init(0)
from pyDNMFk.pyDNMFk import *
from pyDNMFk.data_io import *
from pyDNMFk.dist_comm import *
from scipy.io import loadmat
from mpi4py import MPI
comm = MPI.COMM_WORLD
args = parse()
'''parameters initialization block'''
# Data Read here
args.fpath = 'data/'
args.fname = 'wtsi'
args.ftype = 'mat'
args.precision = np.float32
#Distributed Comm config block
p_r, p_c = 4, 1
#NMF config block
args.norm = 'kl'
args.method = 'mu'
args.init = 'nnsvd'
args.itr = 5000
args.verbose = True
#Cluster config block
args.start_k = 2
args.end_k = 5
args.sill_thr = 0.9
#Data Write
args.results_path = 'results/'
'''Parameters prep block'''
comms = MPI_comm(comm, p_r, p_c)
comm1 = comms.comm
rank = comm.rank
size = comm.size
args.size, args.rank, args.comm, args.p_r, args.p_c = size, rank, comms, p_r, p_c
args.row_comm, args.col_comm, args.comm1 = comms.cart_1d_row(), comms.cart_1d_column(), comm1
A_ij = data_read(args).read().astype(args.precision)
nopt = PyNMFk(A_ij, factors=None, params=args).fit()
print('Estimated k with NMFk is ',nopt)