Welcome to pyDRESCALk's documentation!

pyDRESCALk is a software package for applying non-negative RESCAL decomposition in a distributed fashion to large datasets. It can be utilized for decomposing relational datasets. It can minimize the difference between reconstructed data and the original data through Frobenius norm. Additionally, the Custom Clustering algorithm allows for automated determination for the number of Latent features.

Features

  • Ability to decompose relational datasets.

  • Utilization of MPI4py for distributed operation.

  • Distributed random initializations.

  • Distributed Custom Clustering algorithm for estimating automated latent feature number (k) determination.

  • Objective of minimization of Frobenius norm.

  • Support for distributed CPUs/GPUs.

  • Support for Dense/Sparse data.

  • Demonstrated scaling performance upto 10TB of dense and 9Exabytes of Sparse data.

Scalability

pyDRESCALk Scales from laptops to clusters. The library is convenient on a laptop. It can be installed easily with conda or pip and extends the matrix decomposition from a single core to numerous cores across nodes. pyDRESCALk is efficient and has been tested on powerful servers across LANL and Oakridge scaling beyond 1000+ nodes. This library facilitates the transition between single-machine to large scale cluster so as to enable users to both start simple and scale up when necessary.

Installation

git clone https://github.com/lanl/pyDRESCALk.git
cd pyDRESCALk
conda create --name pyDRESCALk python=3.7.1 openmpi mpi4py
source activate pyDRESCALk
python setup.py install

Usage Example

We provide a sample dataset that can be used for estimation of k:

'''Imports block'''

import sys
import pyDRESCALk.config as config
config.init(0)
from pyDRESCALk.pyDRESCALk import *
from pyDRESCALk.data_io import *
from pyDRESCALk.dist_comm import *
from scipy.io import loadmat
from mpi4py import MPI
comm = MPI.COMM_WORLD
args = parse()
comm = MPI.COMM_WORLD
p_r, p_c = 2, 2
comms = MPI_comm(comm, p_r, p_c)
comm1 = comms.comm
rank = comm.rank
size = comm.size
args = parse()
args.size, args.rank, args.comm, args.p_r, args.p_c = size, rank, comms, p_r, p_c
args.row_comm, args.col_comm, args.comm1 = comms.cart_1d_row(), comms.cart_1d_column(), comm1
rank = comms.rank
args.fpath = '../data/'
args.fname = 'dnations'
args.ftype = 'mat'
args.start_k = 2
args.end_k = 5
args.itr = 200
args.init = 'rand'
args.noise_var = 0.005
args.verbose = True
args.norm = 'fro'
args.method = 'mu'
args.np = np
args.precision = np.float32
args.key = 'R'
A_ij = np.moveaxis(data_read(args).read().astype(args.precision),-1,0) #Always make data of dimension mxnxn.
print('Data dimension for rank=',rank,'=',A_ij.shape)
args.results_path = '../Results/'
pyDRESCALk(A_ij, factors=None, params=args).fit()

Indices and tables

Indices and tables