TELF.factorization.utilities package#
Submodules#
TELF.factorization.utilities.clustering module#
- TELF.factorization.utilities.clustering.H_clustering(H, verbose=False) -> (<class 'dict'>, <class 'dict'>)[source]#
Performs H-clustering, and gathers cluster information.
- Parameters:
H (np.ndarray or scipy.sparse.csr_matrix) – H matrix from NMF
verbose (bool, default is False) – If True, shows the progress.
- Returns:
(clusters_information, centroid_similarities) (tuple of dict and list)
Dictionary carrying information for each cluster,
and dictionary carrying information for each document.
TELF.factorization.utilities.co_occurance_matrix module#
Created on Mon Nov 22 18:38:26 2021
@author: maksimekineren
- TELF.factorization.utilities.co_occurance_matrix.co_occurrence(documents, vocabulary, window_size=20, dense=True, verbose=True, sentences=False)[source]#
Forms co-occurance matrix.
- Parameters:
documents (list) – List of documents. Each entry in the list contains text. if sentences=True, then documents is a list of lists, where each entry in documents is a list of sentences.
window_size (int, optional) – Number of consecutive words to perform counting. If sentences based analysis used, window size indicate the number of sentences.
vocabulary (list) – List of unqiue words present in the all documents as vocabulary.
dense (bool, optional) – If True, dense Numpy array is build. If False, Sparse CSR matrix is build.
verbose (bool, optional) – Print progress or not.
sentences (bool, optional) – If True, documents are list of lists, where each entry in documents is a list of sentences from that document. In this case, window size is used as number of sentences, and the matrix is build based on the sentences. When False, window size is used, and documents is a list of documents.
- Returns:
M – Co-occurance matrix.
- Return type:
np.ndarray or sparse CSR matrix
TELF.factorization.utilities.organize_n_jobs module#
TELF.factorization.utilities.plot_NMFk module#
- TELF.factorization.utilities.plot_NMFk.plot_BNMFk(Ks, sils, bool_err, path=None, name=None)[source]#
- TELF.factorization.utilities.plot_NMFk.plot_NMFk(data, k_predict, name, path, plot_predict=False, plot_final=False, simple_plot=False, calculate_error=True, Ks_not_computed=[])[source]#
TELF.factorization.utilities.pvalue_analysis module#
- TELF.factorization.utilities.pvalue_analysis.pvalue_analysis(errRegres, Ks, SILL_MIN, SILL_thr=0.9)[source]#
- Parameters:
errRegres (TYPE) – DESCRIPTION.
Ks (TYPE) – DESCRIPTION.
SILL_MIN (TYPE) – DESCRIPTION.
SILL_thr (TYPE, optional) – DESCRIPTION. The default is 0.9.
- Returns:
TYPE – DESCRIPTION.
TYPE – DESCRIPTION.
TELF.factorization.utilities.sppmi_matrix module#
TELF.factorization.utilities.take_note module#
- TELF.factorization.utilities.take_note.append_to_note(notes, path, lock, name='experiment')[source]#
Writes string from a list of strings called notes into a log file. Each string is separated with a newline character
This function is safe for use with multiple threads, processes (or nodes in a DFS application). Each write operation is protected by a file lock, ensuring that multiple threads or processes do not write to the file simultaneously.
Parameters:#
- notes: list
A list of strings to enter on the
- path: str
The directory path where the log file is located.
- name: str, optional
The base name of the log and lock files. Default is “experiment”
- rtype:
None
- TELF.factorization.utilities.take_note.format_note(kwargs, spacing=16)[source]#
Formats the values from kwargs dictionary into a single string. The first element in kwargs is left aligned while the rest are right aligned. Each element is spaced according to the spacing parameter.
Parameters:#
- kwargs: dict
The dictionary whose values are to be formatted.
- spacing: int
The number of spaces between each element in the output string. Default is 16 (4 tabs)
Returns:#
- str:
A string with the formatted values from the dictionary, followed by a newline.
- TELF.factorization.utilities.take_note.take_note(notes, path, lock, name='experiment')[source]#
Writes key-value pairs from a dictionary into a log file in the format ‘key = value’. Each string is separated with a newline character
This function is safe for use with multiple threads, processes (or nodes in a DFS application). Each write operation is protected by a file lock, ensuring that multiple threads or processes do not write to the file simultaneously.
Parameters:#
- notes: dict
Dictionary containing key-value pairs to be written to the log file.
- path: str
The directory path where the log file is located.
- name: str, optional
The base name of the log and lock files. Default is “experiment”
- rtype:
None
- TELF.factorization.utilities.take_note.take_note_fmat(path, lock, name='experiment', sort_index=0, spacing=16, **kwargs)[source]#
Records some stats into the log file in a formatted manner. These stats are the values stored in the kwargs dictionary. The keys signify the type of stat being recorded and the value is the entry into the log file. The first stat in kwargs is left aligned and the rest of the stats are right aligned. This function also takes care to sort the current batch of stats being recorded to the log file. The stats are sorted based on the value at index sort_index. This sorting is important as using multiple processes may cause the recorded stats to be out of order.
This function is safe for use with multiple threads, processes (or nodes in a DFS application). Each write operation is protected by a file lock, ensuring that multiple threads or processes do not write to the file simultaneously.
Parameters:#
- path: str
The directory path where the log file is located.
- name: str, optional
The base name of the log and lock files. Default is “experiment”
- sort_index: int, optional
The index of the stat value on which to sort the stats
- spacing: int, optional
The number of spaces between each element in the output string. Default is 16 (4 tabs)
- kwargs: dict
The stats to record
- rtype:
None
TELF.factorization.utilities.vectorize module#
- TELF.factorization.utilities.vectorize.count(documents, options)[source]#
Count vectorizer.
- Parameters:
documents (list) – List of documents. Each entry in the list contains text.
options (dict) – Parameters for sklearn library.
- Returns:
X (list) – sparse bag of words.
vocabulary (list) – List of unqiue words present in the all documents as vocabulary.
- TELF.factorization.utilities.vectorize.tfidf(documents, options)[source]#
TF-IDF
- Parameters:
documents (list) – List of documents. Each entry in the list contains text.
options (dict) – Parameters for sklearn library.
- Returns:
X (list) – sparse tf-idf matrix.
vocabulary (list) – List of unqiue words present in the all documents as vocabulary.