TELF.factorization.utilities package#

Submodules#

TELF.factorization.utilities.clustering module#

TELF.factorization.utilities.clustering.H_clustering(H, verbose=False) -> (<class 'dict'>, <class 'dict'>)[source]#

Performs H-clustering, and gathers cluster information.

Parameters:
  • H (np.ndarray or scipy.sparse.csr_matrix) – H matrix from NMF

  • verbose (bool, default is False) – If True, shows the progress.

Returns:

  • (clusters_information, centroid_similarities) (tuple of dict and list)

  • Dictionary carrying information for each cluster,

  • and dictionary carrying information for each document.

TELF.factorization.utilities.clustering.plot_H_clustering(H, name='filename')[source]#

Plots the centroids of the H-clusters

Parameters:
  • H (np.ndarray or scipy.sparse.csr_matrix) – H matrix from NMF

  • name (File name to save) –

Return type:

Matplotlib plots

TELF.factorization.utilities.co_occurance_matrix module#

Created on Mon Nov 22 18:38:26 2021

@author: maksimekineren

TELF.factorization.utilities.co_occurance_matrix.co_occurrence(documents, vocabulary, window_size=20, dense=True, verbose=True, sentences=False)[source]#

Forms co-occurance matrix.

Parameters:
  • documents (list) – List of documents. Each entry in the list contains text. if sentences=True, then documents is a list of lists, where each entry in documents is a list of sentences.

  • window_size (int, optional) – Number of consecutive words to perform counting. If sentences based analysis used, window size indicate the number of sentences.

  • vocabulary (list) – List of unqiue words present in the all documents as vocabulary.

  • dense (bool, optional) – If True, dense Numpy array is build. If False, Sparse CSR matrix is build.

  • verbose (bool, optional) – Print progress or not.

  • sentences (bool, optional) – If True, documents are list of lists, where each entry in documents is a list of sentences from that document. In this case, window size is used as number of sentences, and the matrix is build based on the sentences. When False, window size is used, and documents is a list of documents.

Returns:

M – Co-occurance matrix.

Return type:

np.ndarray or sparse CSR matrix

TELF.factorization.utilities.organize_n_jobs module#

TELF.factorization.utilities.organize_n_jobs.organize_n_jobs(use_gpu, n_jobs)[source]#

TELF.factorization.utilities.plot_NMFk module#

TELF.factorization.utilities.plot_NMFk.plot_BNMFk(Ks, sils, bool_err, path=None, name=None)[source]#
TELF.factorization.utilities.plot_NMFk.plot_NMFk(data, k_predict, name, path, plot_predict=False, plot_final=False, simple_plot=False, calculate_error=True, Ks_not_computed=[])[source]#
TELF.factorization.utilities.plot_NMFk.plot_RESCALk(data, k_predict, name, path, plot_predict=False, plot_final=False, simple_plot=False, calculate_error=True)[source]#
TELF.factorization.utilities.plot_NMFk.plot_SymNMFk(data, name, path, plot_final=False)[source]#
TELF.factorization.utilities.plot_NMFk.plot_consensus_mat(C, figname)[source]#
TELF.factorization.utilities.plot_NMFk.plot_cophenetic_coeff(Ks, coeff, figname)[source]#

TELF.factorization.utilities.pvalue_analysis module#

TELF.factorization.utilities.pvalue_analysis.pvalue_analysis(errRegres, Ks, SILL_MIN, SILL_thr=0.9)[source]#
Parameters:
  • errRegres (TYPE) – DESCRIPTION.

  • Ks (TYPE) – DESCRIPTION.

  • SILL_MIN (TYPE) – DESCRIPTION.

  • SILL_thr (TYPE, optional) – DESCRIPTION. The default is 0.9.

Returns:

  • TYPE – DESCRIPTION.

  • TYPE – DESCRIPTION.

TELF.factorization.utilities.sppmi_matrix module#

TELF.factorization.utilities.sppmi_matrix.sppmi(cooc, shift=4)[source]#

computes the shifted positive pointwise mutual information from the cooccurrence matrix input:

cooc: sparse cooccurence matrix shift: the shift

output:

sppmi_matrix: sparse shifted positive cooccurrence matrix

author: Erik Skau

TELF.factorization.utilities.take_note module#

TELF.factorization.utilities.take_note.append_to_note(notes, path, lock, name='experiment')[source]#

Writes string from a list of strings called notes into a log file. Each string is separated with a newline character

This function is safe for use with multiple threads, processes (or nodes in a DFS application). Each write operation is protected by a file lock, ensuring that multiple threads or processes do not write to the file simultaneously.

Parameters:#

notes: list

A list of strings to enter on the

path: str

The directory path where the log file is located.

name: str, optional

The base name of the log and lock files. Default is “experiment”

rtype:

None

TELF.factorization.utilities.take_note.format_note(kwargs, spacing=16)[source]#

Formats the values from kwargs dictionary into a single string. The first element in kwargs is left aligned while the rest are right aligned. Each element is spaced according to the spacing parameter.

Parameters:#

kwargs: dict

The dictionary whose values are to be formatted.

spacing: int

The number of spaces between each element in the output string. Default is 16 (4 tabs)

Returns:#

str:

A string with the formatted values from the dictionary, followed by a newline.

TELF.factorization.utilities.take_note.take_note(notes, path, lock, name='experiment')[source]#

Writes key-value pairs from a dictionary into a log file in the format ‘key = value’. Each string is separated with a newline character

This function is safe for use with multiple threads, processes (or nodes in a DFS application). Each write operation is protected by a file lock, ensuring that multiple threads or processes do not write to the file simultaneously.

Parameters:#

notes: dict

Dictionary containing key-value pairs to be written to the log file.

path: str

The directory path where the log file is located.

name: str, optional

The base name of the log and lock files. Default is “experiment”

rtype:

None

TELF.factorization.utilities.take_note.take_note_fmat(path, lock, name='experiment', sort_index=0, spacing=16, **kwargs)[source]#

Records some stats into the log file in a formatted manner. These stats are the values stored in the kwargs dictionary. The keys signify the type of stat being recorded and the value is the entry into the log file. The first stat in kwargs is left aligned and the rest of the stats are right aligned. This function also takes care to sort the current batch of stats being recorded to the log file. The stats are sorted based on the value at index sort_index. This sorting is important as using multiple processes may cause the recorded stats to be out of order.

This function is safe for use with multiple threads, processes (or nodes in a DFS application). Each write operation is protected by a file lock, ensuring that multiple threads or processes do not write to the file simultaneously.

Parameters:#

path: str

The directory path where the log file is located.

name: str, optional

The base name of the log and lock files. Default is “experiment”

sort_index: int, optional

The index of the stat value on which to sort the stats

spacing: int, optional

The number of spaces between each element in the output string. Default is 16 (4 tabs)

kwargs: dict

The stats to record

rtype:

None

TELF.factorization.utilities.vectorize module#

TELF.factorization.utilities.vectorize.count(documents, options)[source]#

Count vectorizer.

Parameters:
  • documents (list) – List of documents. Each entry in the list contains text.

  • options (dict) – Parameters for sklearn library.

Returns:

  • X (list) – sparse bag of words.

  • vocabulary (list) – List of unqiue words present in the all documents as vocabulary.

TELF.factorization.utilities.vectorize.tfidf(documents, options)[source]#

TF-IDF

Parameters:
  • documents (list) – List of documents. Each entry in the list contains text.

  • options (dict) – Parameters for sklearn library.

Returns:

  • X (list) – sparse tf-idf matrix.

  • vocabulary (list) – List of unqiue words present in the all documents as vocabulary.

Module contents#