TELF.factorization.HNMFk: Hierarchical Non-negative Matrix Factorization with Automatic Model Determination#

Hierarchical Non-negative matrix factorization with automatic model determination with custom settings including missing value prediction. HNMFk has HPC and

Available Functions#

`HNMFk.__init__`([nmfk_params, cluster_on, ...])	HNMFk is a Hierarchical Non-negative Matrix Factorization module with the capability to do automatic model determination.
`HNMFk.fit`(X, Ks[, from_checkpoint, ...])	Factorize the input matrix `X` for the each given K value in `Ks`.
`HNMFk.traverse_nodes`()	Graph iterator.
`HNMFk.go_to_root`()	Graph iterator.
`HNMFk.get_node`()	Graph iterator.
`HNMFk.go_to_parent`()	Graph iterator.
`HNMFk.go_to_children`(idx)	Graph iterator.
`HNMFk.traverse_tiny_leaf_topics`([threshold])	Graph iterator with thresholding on number of documents.
`HNMFk.process_tiny_leaf_topics`([threshold])	Graph post-processing with thresholding on number of documents.
`HNMFk.get_tiny_leaf_topics`()	Graph iterator for tiny documents if processed already with self.process_tiny_leaf_topics(threshold:int).

Module Contents#

job-schedule - 200 job-complete - 300 signal-exit - 400

class TELF.factorization.HNMFk.HNMFk(nmfk_params=[{}], cluster_on='H', depth=1, sample_thresh=-1, Ks_deep_min=1, Ks_deep_max=None, Ks_deep_step=1, K2=False, experiment_name='HNMFk_Output', generate_X_callback=None, n_nodes=1, verbose=True, comm_buff_size=10000000, random_identifiers=False, root_node_name='Root')[source]#

Bases: object

HNMFk is a Hierarchical Non-negative Matrix Factorization module with the capability to do automatic model determination.

Parameters:

nmfk_params (list of dicts, optional) –
We can specify NMFk parameters for each depth, or use same for all depth.

If there is single items in nmfk_params, HMMFk will use the same NMFk parameters for all depths.

When using for each depth, append to the list. For example, [nmfk_params0, nmfk_params1, nmfk_params2] for depth of 2 The default is [{}], which defaults to NMFk with defaults with required params["collect_output"] = False, params["save_output"] = True, and params["predict_k"] = True when K2=False.
cluster_on (str, optional) – Where to perform clustering, can be W or H. Ff W, row of X should be samples, and if H, columns of X should be samples. The default is “H”.
depth (int, optional) – How deep to go in each topic after root node. if -1, it goes until samples cannot be seperated further. The default is 1.
sample_thresh (int, optional) – Stopping criteria for num of samples in the cluster. When -1, this criteria is not used. The default is -1.
Ks_deep_min (int, optional) – After first nmfk, when selecting Ks search range, minimum k to start. The default is 1.
Ks_deep_max (int, optinal) –
After first nmfk, when selecting Ks search range, maximum k to try.

When None, maximum k will be same as k selected for parent node.

The default is None.
Ks_deep_step (int, optional) – After first nmfk, when selecting Ks search range, k step size. The default is 1.
K2 (bool, optional) – If K2=True, decomposition is done only for k=2 instead of finding and predicting the number of stable latent features. The default is False.
experiment_name (str, optional) – Where to save the results.
generate_X_callback (object, optional) –
This can be used to re-generate the data matrix X before each NMFk operation. When not used, slice of original X is taken, which is equal to serial decomposition.

generate_X_callback object should be a class with def __call__(original_indices) defined so that new_X, save_at_node=generate_X_callback(original_indices) can be done.

original_indices hyper-parameter is the indices of samples (columns of original X when clustering on H).

Here save_at_node is a dictionary that can be used to save additional information in each node’s user_node_data variable. The default is None.
n_nodes (int, optional) – Number of HPC nodes. The default is 1.
verbose (bool, optional) – If True, it prints progress. The default is True.
random_identifiers (bool, optional) – If True, model will use randomly generated strings as the identifiers of the nodes. Otherwise, it will use the k for ancestry naming convention.
root_node_name (str, optional) – Naming convention to be used when saving the root name. Default is “Root”.

Return type:

None.

fit(X, Ks, from_checkpoint=False, save_checkpoint=True)[source]#

Factorize the input matrix X for the each given K value in Ks.

Parameters:

X (np.ndarray or scipy.sparse._csr.csr_matrix matrix) – Input matrix to be factorized.
Ks (list) –
List of K values to factorize the input matrix.

Example: Ks=range(1, 10, 1).
from_checkpoint (bool, optional) – If True, it continues the process from the checkpoint. The default is False.
save_checkpoint (bool, optional) – If True, it saves checkpoints. The default is True.

Return type:

None

get_node()[source]#

Graph iterator. Returns the current node.

This operation is online, only one node is kept in the memory at a time.

Returns:: data – Dictionary format of node.
Return type:: dict

get_tiny_leaf_topics()[source]#

Graph iterator for tiny documents if processed already with self.process_tiny_leaf_topics(threshold:int).

Returns:: tiny_leafs – List of dictionarys that are format of node for each entry in the list.
Return type:: list

go_to_children(idx: int)[source]#

Graph iterator. Goes to the child node specified by index.

This operation is online, only one node is kept in the memory at a time.

Parameters:: idx (int) – Child index.
Returns:: data – Dictionary format of node.
Return type:: dict

go_to_node(name: str)[source]#

Graph iterator. Goes to node specified by name (node.node_name).

This operation is online, only one node is kept in the memory at a time.

Parameters:: name (str) – Name of the node
Returns:: data – Dictionary format of node.
Return type:: dict

go_to_parent()[source]#

Graph iterator. Goes to the parent of current node.

This operation is online, only one node is kept in the memory at a time.

Returns:: data – Dictionary format of node.
Return type:: dict

go_to_root()[source]#

Graph iterator. Goes to root node.

This operation is online, only one node is kept in the memory at a time.

Returns:: data – Dictionary format of node.
Return type:: dict

load_model()[source]#: Loads existing model from checkpoint file located at self.experiment_name path. This checkpoint file exists if fit(save_checkpoint=True) was used. Use this to leverage the graph iterator for an existing model.

process_tiny_leaf_topics(threshold=5)[source]#

Graph post-processing with thresholding on number of documents.

Returns a list of all tiny nodes, with all the nodes that had number of documents less than the threshold.

Removes these outlier nodes from child-node lists on the original graph from their parents.

Graph is re-set each time this function is called such that original child nodes are re-assigned.

If threshold=None, this function will re-assign the original child indices only, and return None.

Parameters:: threshold (int) – Minimum number of documents each node should have.
Returns:: tiny_leafs – List of dictionarys that are format of node for each entry in the list.
Return type:: list

traverse_nodes()[source]#

Graph iterator. Returns all nodes in list format.

This operation will load each node persistently into the memory.

Returns:: data – List of all nodes where each node is a dictionary.
Return type:: list

traverse_tiny_leaf_topics(threshold=5)[source]#

Graph iterator with thresholding on number of documents. Returns a list of nodes where number of documents are less than the threshold.

This operation is online, only the nodes that are outliers based on the number of documents are kept in the memory.

Parameters:: threshold (int) – Minimum number of documents each node should have.
Returns:: data – List of dictionarys that are format of node for each entry in the list.
Return type:: list

class TELF.factorization.HNMFk.Node(node_name: str, depth: int, parent_topic: int, parent_node_k: int, W: ndarray, H: ndarray, k: int, parent_node_name: str, child_node_names: list, original_indices: ndarray, num_samples: int, leaf: bool, user_node_data: dict, cluster_indices_in_parent: list, node_save_path: str, parent_node_save_path: str, parent_node_factors_path: str, exception: bool, signature: array, probabilities: array, centroids: array, factors_path: str)[source]#: Bases: object

class TELF.factorization.HNMFk.OnlineNode(node_path: str, node_name: str, parent_node: object, child_nodes: list)[source]#: Bases: object