TELF.factorization.HNMFk: Hierarchical Non-negative Matrix Factorization with Automatic Model Determination#

Hierarchical Non-negative matrix factorization with automatic model determination with custom settings including missing value prediction. HNMFk has HPC and

Available Functions#

HNMFk.__init__([nmfk_params, cluster_on, ...])

HNMFk is a Hierarchical Non-negative Matrix Factorization module with the capability to do automatic model determination.

HNMFk.fit(X, Ks[, from_checkpoint, ...])

Factorize the input matrix X for the each given K value in Ks.

Module Contents#

job-schedule - 200 job-complete - 300 signal-exit - 400

class TELF.factorization.HNMFk.HNMFk(nmfk_params=[{}], cluster_on='H', depth=1, sample_thresh=-1, Ks_deep_min=1, Ks_deep_max=None, Ks_deep_step=1, K2=False, experiment_name='HNMFk_Output', generate_X_callback=None, n_nodes=1, verbose=True, comm_buff_size=10000000, random_identifiers=False)[source]#

Bases: object

HNMFk is a Hierarchical Non-negative Matrix Factorization module with the capability to do automatic model determination.

Parameters:
  • nmfk_params (list of dicts, optional) –

    We can specify NMFk parameters for each depth, or use same for all depth.

    If there is single items in nmfk_params, HMMFk will use the same NMFk parameters for all depths.

    When using for each depth, append to the list. For example, [nmfk_params0, nmfk_params1, nmfk_params2] for depth of 2 The default is [{}], which defaults to NMFk with defaults with required params["collect_output"] = False, params["save_output"] = True, and params["predict_k"] = True when K2=False.

  • cluster_on (str, optional) – Where to perform clustering, can be W or H. Ff W, row of X should be samples, and if H, columns of X should be samples. The default is “H”.

  • depth (int, optional) – How deep to go in each topic after root node. if -1, it goes until samples cannot be seperated further. The default is 1.

  • sample_thresh (int, optional) – Stopping criteria for num of samples in the cluster. When -1, this criteria is not used. The default is -1.

  • Ks_deep_min (int, optional) – After first nmfk, when selecting Ks search range, minimum k to start. The default is 1.

  • Ks_deep_max (int, optinal) –

    After first nmfk, when selecting Ks search range, maximum k to try.

    When None, maximum k will be same as k selected for parent node.

    The default is None.

  • Ks_deep_step (int, optional) – After first nmfk, when selecting Ks search range, k step size. The default is 1.

  • K2 (bool, optional) – If K2=True, decomposition is done only for k=2 instead of finding and predicting the number of stable latent features. The default is False.

  • experiment_name (str, optional) – Where to save the results.

  • generate_X_callback (object, optional) –

    This can be used to re-generate the data matrix X before each NMFk operation. When not used, slice of original X is taken, which is equal to serial decomposition.

    generate_X_callback object should be a class with def __call__(original_indices) defined so that new_X, save_at_node=generate_X_callback(original_indices) can be done.

    original_indices hyper-parameter is the indices of samples (columns of original X when clustering on H).

    Here save_at_node is a dictionary that can be used to save additional information in each node’s user_node_data variable. The default is None.

  • n_nodes (int, optional) – Number of HPC nodes. The default is 1.

  • verbose (bool, optional) – If True, it prints progress. The default is True.

  • random_identifiers (bool, optional) – If True, model will use randomly generated strings as the identifiers of the nodes. Otherwise, it will use the k for ancestry naming convention.

Return type:

None.

fit(X, Ks, from_checkpoint=False, save_checkpoint=True)[source]#

Factorize the input matrix X for the each given K value in Ks.

Parameters:
  • X (np.ndarray or scipy.sparse._csr.csr_matrix matrix) – Input matrix to be factorized.

  • Ks (list) –

    List of K values to factorize the input matrix.

    Example: Ks=range(1, 10, 1).

  • from_checkpoint (bool, optional) – If True, it continues the process from the checkpoint. The default is False.

  • save_checkpoint (bool, optional) – If True, it saves checkpoints. The default is True.

Return type:

None

get_node()[source]#

Graph iterator. Returns the current node.

This operation is online, only one node is kept in the memory at a time.

Returns:

data – Dictionary format of node.

Return type:

dict

go_to_children(idx: int)[source]#

Graph iterator. Goes to the child node specified by index.

This operation is online, only one node is kept in the memory at a time.

Parameters:

idx (int) – Child index.

Returns:

data – Dictionary format of node.

Return type:

dict

go_to_node(name: str)[source]#

Graph iterator. Goes to node specified by name (node.node_name).

This operation is online, only one node is kept in the memory at a time.

Parameters:

name (str) – Name of the node

Returns:

data – Dictionary format of node.

Return type:

dict

go_to_parent()[source]#

Graph iterator. Goes to the parent of current node.

This operation is online, only one node is kept in the memory at a time.

Returns:

data – Dictionary format of node.

Return type:

dict

go_to_root()[source]#

Graph iterator. Goes to root node.

This operation is online, only one node is kept in the memory at a time.

Returns:

data – Dictionary format of node.

Return type:

dict

load_model()[source]#

Loads existing model from checkpoint file located at self.experiment_name path.

This checkpoint file exist if fit(save_checkpoint=True) when the model was run.

Use this function to leverage the graph iterator for an existing model.

Return type:

None

traverse_nodes()[source]#

Graph iterator. Returns all nodes in list format.

This operation will load each node persistently into the memory.

Returns:

data – List of all nodes where each node is a dictionary.

Return type:

list

class TELF.factorization.HNMFk.Node(node_name: str, depth: int, parent_topic: int, parent_node_k: int, W: ndarray, H: ndarray, k: int, parent_node_name: str, child_node_names: list, original_indices: ndarray, num_samples: int, leaf: bool, user_node_data: dict, cluster_indices_in_parent: list, node_save_path: str, parent_node_save_path: str, parent_node_factors_path: str, exception: bool, signature: array, probabilities: array, centroids: array, factors_path: str)[source]#

Bases: object

class TELF.factorization.HNMFk.OnlineNode(node_path: str, node_name: str, parent_node: object, child_nodes: list)[source]#

Bases: object