TELF.pre_processing.Beaver package#
Submodules#
TELF.pre_processing.Beaver.beaver module#
© 2022. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.
- class TELF.pre_processing.Beaver.beaver.Beaver(n_nodes=1, n_jobs=1)[source]#
Bases:
object
- SUPPORTED_OUTPUT_FORMATS = {'pydata', 'scipy'}#
- citation_tensor(dataset: DataFrame, target_columns: tuple = ('author_ids', 'paper_id', 'references', 'year'), dimension_order: list = [0, 1, 2], split_authors_with: str = ';', split_references_with: str = ';', save_path: str = None, n_nodes: int = None, n_jobs: int = None, joblib_backend: str = 'loky', verbose: bool = False, return_object: bool = True, output_mode: str = 'pydata') tuple [source]#
Creates an Authors by Papers by Time tensor. An non-zero entry x at author i, paper j, year k means that author i cited paper j x times in year k
- Parameters:
dataset (pd.DataFrame) – Dataframe containing the target columns.
target_columns (tuple, optional) – Target column names in dataset DataFrame. The default is (“author_ids”, “paper_id”, “references”, “year”). When assigning names in this tuple, type order should be preserved, e.g. time column name comes last.
dimension_order (list, optional) – Order in which the dimensions appear. For example, [0,1,2] means it is Authors, Papers, Time and [1,0,2] means it is Papers, Authors, Time.
split_authors_with (str, optional) – What symbol to use to get list of individual elements from string of target_columns[0]. The default is “;”.
split_references_with (str, optional) – What symbol to use to get list of individual elements from string of target_columns[2]. The default is “;”.
save_path (str, optional) – If not None, saves the outputs. The default is None.
n_nodes (int, optional) – Number of nodes to use. Default is the Beaver default.
n_jobs (int, optional) – Number of jobs to use. Default is the Beaver default.
joblib_backend (str, optional) – Joblib parallel backend. Default is multiprocessing.
verbose (bool, optional) – Vebosity flag. The default is False.
return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.
output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘pydata’.
- coauthor_tensor(dataset: DataFrame, target_columns: tuple = ('authorIDs', 'year'), split_authors_with: str = ';', verbose: bool = False, save_path: str = None, n_nodes: int = None, n_jobs: int = None, joblib_backend: str = 'multiprocessing', authors_idx_map: dict = {}, time_idx_map: dict = {}, return_object: bool = True, output_mode: str = 'pydata') tuple [source]#
Create co-author tensor. Returns tuple of tensor, authors, and time.
- Parameters:
dataset (pd.DataFrame) – Dataframe containing the target columns.
target_columns (tuple, optional) – Target column names in dataset DataFrame. The default is (“authorIDs”, “year”). When assigning names in this tuple, type order should be preserved, e.g. time column name comes second.
split_authors_with (str, optional) – What symbol to use to get list of individual authors from string. The default is “;”.
verbose (bool, optional) – Vebosity flag. The default is False.
save_path (str, optional) – If not None, saves the outputs.. The default is None.
n_nodes (int, optional) – Number of nodes to use. Default is the Beaver default.
n_jobs (int, optional) – Number of jobs to use. Default is the Beaver default.
joblib_backend (str, optional) – Joblib parallel backend. Default is multiprocessing.
authors_idx_map (dict, optional) – Author to tensor dimension index mapping. Default is {}. If not passed, it is created.
time_idx_map (dict, optional) – Time to tensor dimension index mapping. Default is {}. If not passed, it is created.
return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.
output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘pydata’.
- Returns:
Tuple of tensor, author vocabulary, and time vocabulary.
- Return type:
tuple
- cocitation_tensor(dataset: DataFrame, target_columns: tuple = ('authorIDs', 'year', 'paper_id', 'references'), split_authors_with: str = ';', split_references_with: str = ';', verbose: bool = False, save_path: str = None, n_nodes: int = None, n_jobs: int = None, joblib_backend: str = 'multiprocessing', authors_idx_map: dict = {}, time_idx_map: dict = {}, return_object: bool = True, output_mode: str = 'pydata') tuple [source]#
Creates an Authors by Authors by Time tensor. An non-zero entry x at author i, author j, year k means that author j cited author i x times in year k. Note that x is normalized. This means that for two papers a and b where a cites b, the n authors of a, and a single author from b, the author from b receives 1/n citations from each author on paper a.
- Parameters:
dataset (pd.DataFrame) – Dataframe containing the target columns.
target_columns (tuple, optional) – Target column names in dataset DataFrame. The default is (“authorIDs”, “year”, “paper_id”, “references”). When assigning names in this tuple, type order should be preserved, e.g. time column name comes second.
split_authors_with (str, optional) – What symbol to use to get list of individual authors from string. The default is “;”.
split_references_with (TYPE, optional) – What symbol to use to get list of individual references from string. The default is “;”.
verbose (bool, optional) – Vebosity flag. The default is False.
save_path (str, optional) – If not None, saves the outputs. The default is None.
n_nodes (int, optional) – Number of nodes to use. Default is is the Beaver default.
n_jobs (int, optional) – Number of parallel jobs. The default is the Beaver default.
authors_idx_map (dict, optional) – Author to tensor dimension index mapping. Default is {}. If not passed, it is created.
time_idx_map (dict, optional) – Time to tensor dimension index mapping. Default is {}. If not passed, it is created.
return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.
output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘pydata’.
- Returns:
Tuple of tensor, author vocabulary, and time vocabulary.
- Return type:
tuple
- cooccurrence_matrix(dataset: DataFrame, target_column: str = 'abstracts', cooccurrence_settings: dict = {}, sppmi_settings: dict = {}, save_path: str = None, return_object: bool = True, output_mode: str = 'scipy') tuple [source]#
Generates co-occurance and SPPMI matrix.
- Parameters:
dataset (pd.DataFrame) – Dataframe containing the target columns.
target_column (str, optional) – Target column name in dataset DataFrame. The default is “abstracts”. Target column should be for text data, where tokens are retrived via empty spaces.
cooccurrence_settings (dict, optional) – Settings for co-occurance matrix. The default is dict. Options are: vocabulary, window_size=20, dense=True, verbose=True, sentences=False
sppmi_settings (dict, optional) – Settings for SPPMI matrix. The default is dict. Options are: shift=4
save_path (str, optional) – If not None, saves the outputs. The default is None.
return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.
output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘scipy’.
- Returns:
Tuple of co-occurance and SPPMI matrix.
- Return type:
tuple
- documents_words(dataset: DataFrame, target_column: str = 'abstracts', options: dict = {'max_df': 0.5, 'min_df': 5}, highlighting: list = [], weights: list = [], matrix_type: str = 'tfidf', verbose: bool = False, return_object: bool = True, output_mode: str = 'scipy', save_path: str = None) tuple [source]#
Creates document-words matrix.
- Parameters:
dataset (pd.DataFrame) – Dataframe containing the target columns.
target_column (str, optional) – Target column name in dataset DataFrame. The default is “abstracts”. Target column should be for text data, where tokens are retrived via empty spaces.
options (dict, optional) – Settings for when doing vectorization. The default is {“min_df”: 5, “max_df”: 0.5}.
matrix_type (str, optional) – TF-IDF or Count vectorization. The default is “tfidf”. Other option is “count”.
verbose (bool, optional) – Vebosity flag. The default is False.
highlighting (list, optional) – The vocabulary or list of tokens to highlight. The default is []. Other option is “count”.
weights (list or float or int, optional) – Weights of the highlighted words. The default is [].
save_path (str, optional) – If not None, saves the outputs. The default is None.
return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.
output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘scipy’.
- Returns:
Tuple of matrix and vocabulary.
- Return type:
tuple
- get_ngrams(dataset: DataFrame, target_column: str = None, n: int = 1, limit: int = None, save_path: str = None) list [source]#
Generates n-grams from a column in a dataset
- Parameters:
dataset (pd.DataFrame) – Dataframe containing the target columns.
target_column (str, optional) – Target column name in dataset DataFrame. The default is “abstracts”. Target column should be for text data, where tokens are retrived via empty spaces.
n (int) – Number of tokens in a gram to generate
limit (int) – Restrict number of top n-grams to return
save_path (str, optional) – If not None, saves the outputs as csv using the column names ‘Ngram’, ‘Count’. The default save_path is None.
- Returns:
Top ngrams as a list of tuples containing the ngram then count.
- Return type:
list
- get_vocabulary(dataset: DataFrame, target_column: str = None, max_df: int | float = 1.0, min_df: int = 1, max_features: int = None, save_path: str = None, **kwargs) list [source]#
Builds the vocabulary
- Parameters:
dataset (pd.DataFrame) – Dataframe containing the target columns.
target_column (str, optional) – Target column name in dataset DataFrame.
max_df (int or float, optional) – When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. The default is 1.0.
min_df (int or float, optional) – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float in range of [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. The default is 1.
max_features (int, optional) – If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. The default is None.
save_path (str, optional) – If not None, saves the outputs. The default is None.
- Return type:
List of tokens in the vocabulary.
- property n_jobs#
- property n_nodes#
- participation_tensor(dataset: DataFrame, target_columns: tuple = ('author_ids', 'paper_id', 'year'), dimension_order: list = [0, 1, 2], split_authors_with: str = ';', save_path: str = None, n_nodes: int = None, n_jobs: int = None, joblib_backend: str = 'multiprocessing', verbose: bool = False, return_object: bool = True, output_mode: str = 'pydata') tuple [source]#
Creates a boolean Authors by Papers by Time tensor. An non-zero entry at author i, paper j, year k means that author i published paper j in year k
- Parameters:
dataset (pd.DataFrame) – Dataframe containing the target columns.
target_columns (tuple, optional) – Target column names in dataset DataFrame. The default is (“author_ids”, “paper_id”, “year”). When assigning names in this tuple, type order should be preserved, e.g. time column name comes last.
dimension_order (list, optional) – Order in which the dimensions appear. For example, [0,1,2] means it is Authors, Papers, Time and [1,0,2] means it is Papers, Authors, Time.
split_authors_with (str, optional) – What symbol to use to get list of individual elements from string of target_columns[0]. The default is “;”.
save_path (str, optional) – If not None, saves the outputs. The default is None.
n_nodes (int, optional) – Number of nodes to use. Default is the Beaver default.
n_jobs (int, optional) – Number of jobs to use. Default is the Beaver default.
joblib_backend (str, optional) – Joblib parallel backend. Default is multiprocessing.
verbose (bool, optional) – Vebosity flag. The default is False.
return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.
output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘pydata’.
- something_words(dataset: DataFrame, target_columns: tuple = ('authorIDs', 'abstracts'), split_something_with: str = ';', options: dict = {'max_df': 0.5, 'min_df': 5}, highlighting: list = [], weights: list = [], verbose: bool = False, matrix_type: str = 'tfidf', return_object: bool = True, output_mode: str = 'scipy', save_path: str = None) tuple [source]#
Creates a Something by Words matrix. For example, Authors-Words. Here something is specified by first index of variable target_columns. Individual evelements of target_columns[0] is seperated by split_something_with. For example “autho1;author2” when split_something_with=”;”.
- Parameters:
dataset (pd.DataFrame) – Dataframe containing the target columns.
target_columns (tuple, optional) – Target column names in dataset DataFrame. The default is (“authorIDs”, “abstracts”). When assigning names in this tuple, type order should be preserved, e.g. text data column name comes second.
split_something_with (str, optional) – What symbol to use to get list of individual elements from string of target_columns[0]. The default is “;”.
options (str, optional) – Settings for when doing vectorization. The default is {“min_df”: 5, “max_df”: 0.5}.
highlighting (list, optional) – The vocabulary or list of tokens to highlight. The default is []. Other option is “count”.
weights (list or float or int, optional) – Weights of the highlighted words. The default is [].
verbose (bool, optional) – Vebosity flag. The default is False.
matrix_type (str, optional) – TF-IDF or Count vectorization. The default is “tfidf”. Other option is “count”
save_path (TYPE, optional) – If not None, saves the outputs. The default is None.
return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.
output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘scipy’.
- Returns:
Tuple of matrix, vocabulary for somethings (target information specified in target_columns[0]), and the vocabulary for words.
- Return type:
tuple
- something_words_time(dataset: DataFrame, vocabulary: list, target_columns: tuple = ('authorIDs', 'abstracts', 'year'), split_something_with: str = ';', save_path: str = None, tfidf_transformer: bool = False, unfold_at=1, verbose: bool = False, dimension_order: list = [0, 1, 2], return_object: bool = True, output_mode: str = 'pydata') tuple [source]#
Creates a Something by Words by Time tensor. For example, Authors-Words-Time. Here something is specified by first index of variable target_columns. Individual evelements of target_columns[0] is seperated by split_something_with. For example “autho1;author2” when split_something_with=”;”.
- Parameters:
dataset (pd.DataFrame) – Dataframe containing the target columns.
vocabulary (list) – Token vocabulary to use.
target_columns (tuple, optional) – Target column names in dataset DataFrame. The default is (“authorIDs”, “abstracts”, “year”). When assigning names in this tuple, type order should be preserved, e.g. time column name comes last.
split_something_with (str, optional) – What symbol to use to get list of individual elements from string of target_columns[0]. The default is “;”.
save_path (str, optional) – If not None, saves the outputs. The default is None.
tfidf_transformer (bool, optional) – If True, performs TF-IDF normalization via unfolding over dimension unfold_at. The default is False.
unfold_at (int, optional) – Which dimension to unfold the tensor for TF-IDF normalization, when tfidf_transformer=True. The default is 1.
verbose (bool, optional) – Vebosity flag. The default is False.
dimension_order (list, optional) – Order in which the dimensions appear. For example, [0,1,2] means it is Something, Words, Time. and [1,0,2] means it is Words, Something, Time.
return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.
output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘scipy’.
- Returns:
Tuple of matrix, vocabulary for somethings (target information specified in target_columns[0]), the vocabulary for words, and the vocabulary for time.
- Return type:
tuple
TELF.pre_processing.Beaver.cooccurrence module#
Created on Mon Nov 22 18:38:26 2021
@author: maksimekineren
- TELF.pre_processing.Beaver.cooccurrence.co_occurrence(documents, vocabulary, window_size=20, verbose=True, sentences=False, n_jobs=-1, n_nodes=1, parallel_backend='multiprocessing')[source]#
Forms co-occurance matrix.
- Parameters:
documents (list) – List of documents. Each entry in the list contains text. if sentences=True, then documents is a list of lists, where each entry in documents is a list of sentences.
window_size (int, optional) – Number of consecutive words to perform counting. If sentences based analysis used, window size indicate the number of sentences.
vocabulary (list) – List of unqiue words present in the all documents as vocabulary.
verbose (bool, optional) – Print progress or not.
sentences (bool, optional) – If True, documents are list of lists, where each entry in documents is a list of sentences from that document. In this case, window size is used as number of sentences, and the matrix is build based on the sentences. When False, window size is used, and documents is a list of documents.
- Returns:
M – Co-occurance matrix.
- Return type:
np.ndarray or sparse CSR matrix
TELF.pre_processing.Beaver.sppmi module#
TELF.pre_processing.Beaver.tenmat module#
tenmat.py creates a matricized tensor.
References
[1] General software, latest release: Brett W. Bader, Tamara G. Kolda and others, Tensor Toolbox for MATLAB, Version 3.2.1, www.tensortoolbox.org, April 5, 2021.
[2] Dense tensors: B. W. Bader and T. G. Kolda, Algorithm 862: MATLAB Tensor Classes for Fast Algorithm Prototyping, ACM Trans. Mathematical Software, 32(4):635-653, 2006, http://dx.doi.org/10.1145/1186785.1186794.
[3] Sparse, Kruskal, and Tucker tensors: B. W. Bader and T. G. Kolda, Efficient MATLAB Computations with Sparse and Factored Tensors, SIAM J. Scientific Computing, 30(1):205-231, 2007, http://dx.doi.org/10.1137/060676489.
[4] Chi, E.C. and Kolda, T.G., 2012. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications, 33(4), pp.1272-1299.
@author: Maksim Ekin Eren
- TELF.pre_processing.Beaver.tenmat.fold(X, axis, shape)[source]#
Create a tensor from matrix. :param X: an unfolded array :type X: ndarray/sparse array :param axis: Dimension number to fold on. :type axis: int :param shape: :type shape: target tensor shape
- TELF.pre_processing.Beaver.tenmat.unfold(X, mode)[source]#
Create a matricized tensor. i.e. Tensor unfolding along a mode. :param X: N-dim array to be unfoled. :type X: array :param mode: Dimension number to unfold on. :type mode: int
- Returns:
X – Matriced version of the sparse tensor in as dense matrix.
- Return type:
np.ndarray
TELF.pre_processing.Beaver.vectorize module#
- TELF.pre_processing.Beaver.vectorize.count(documents, options)[source]#
Count vectorizer.
- Parameters:
documents (list) – List of documents. Each entry in the list contains text.
options (dict) – Parameters for sklearn library.
- Returns:
X (list) – sparse bag of words.
vocabulary (list) – List of unqiue words present in the all documents as vocabulary.
- TELF.pre_processing.Beaver.vectorize.tfidf(documents, options)[source]#
TF-IDF
- Parameters:
documents (list) – List of documents. Each entry in the list contains text.
options (dict) – Parameters for sklearn library.
- Returns:
X (list) – sparse tf-idf matrix.
vocabulary (list) – List of unqiue words present in the all documents as vocabulary.