TELF.pre_processing.Beaver: Fast matrix and tensor building tool#

Beaver is a matrix and tensor building tool.

Example#

Several examples for Beaver can be found here. Here we will show example usage for creating a Document-Words matrix. First let’s load the example dataset (example dataset can be found here):

import pandas as pd

df = pd.read_csv("../../data/sample.csv")
df.info()

Next, let’s build the vocabulary:

from TELF.pre_processing import Beaver

beaver = Beaver()
settings = {
   "dataset":df,
   "target_column":"clean_abstract",
   "min_df":10,
   "max_df":0.5,
}

vocabulary = beaver.get_vocabulary(**settings)

Next we can build the matrix:

settings = {
   "dataset":df,
   "target_column":"clean_abstract",
   "options":{"min_df": 5, "max_df": 0.5, "vocabulary":vocabulary},
   "matrix_type":"tfidf",
   "highlighting":['aberration', 'ability', 'ablation', 'ablator', 'able'],
   "weights":2,
   "save_path":"./"
}

beaver.documents_words(**settings)

We cam then load the matrix:

import scipy.sparse as ss

# load the saved file which is in Sparse COO format
X_csr_sparse = ss.load_npz("documents_words.npz")

Available Functions#

Beaver.__init__([n_nodes, n_jobs])

Beaver.get_vocabulary(dataset[, ...])

Builds the vocabulary

Beaver.coauthor_tensor(dataset[, ...])

Create co-author tensor.

Beaver.cocitation_tensor(dataset[, ...])

Creates an Authors by Authors by Time tensor.

Beaver.participation_tensor(dataset[, ...])

Creates a boolean Authors by Papers by Time tensor.

Beaver.citation_tensor(dataset[, ...])

Creates an Authors by Papers by Time tensor.

Beaver.cooccurrence_matrix(dataset[, ...])

Generates co-occurance and SPPMI matrix.

Beaver.documents_words(dataset[, ...])

Creates document-words matrix.

Beaver.something_words(dataset[, ...])

Creates a Something by Words matrix.

Beaver.something_words_time(dataset, vocabulary)

Creates a Something by Words by Time tensor.

Beaver.get_ngrams(dataset[, target_column, ...])

Generates n-grams from a column in a dataset

Module Contents#

© 2022. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.

class TELF.pre_processing.Beaver.beaver.Beaver(n_nodes=1, n_jobs=1)[source]#

Bases: object

SUPPORTED_OUTPUT_FORMATS = {'pydata', 'scipy'}#
citation_tensor(dataset: DataFrame, target_columns: tuple = ('author_ids', 'paper_id', 'references', 'year'), dimension_order: list = [0, 1, 2], split_authors_with: str = ';', split_references_with: str = ';', save_path: str = None, n_nodes: int = None, n_jobs: int = None, joblib_backend: str = 'loky', verbose: bool = False, return_object: bool = True, output_mode: str = 'pydata') tuple[source]#

Creates an Authors by Papers by Time tensor. An non-zero entry x at author i, paper j, year k means that author i cited paper j x times in year k

Parameters:
  • dataset (pd.DataFrame) – Dataframe containing the target columns.

  • target_columns (tuple, optional) – Target column names in dataset DataFrame. The default is (“author_ids”, “paper_id”, “references”, “year”). When assigning names in this tuple, type order should be preserved, e.g. time column name comes last.

  • dimension_order (list, optional) – Order in which the dimensions appear. For example, [0,1,2] means it is Authors, Papers, Time and [1,0,2] means it is Papers, Authors, Time.

  • split_authors_with (str, optional) – What symbol to use to get list of individual elements from string of target_columns[0]. The default is “;”.

  • split_references_with (str, optional) – What symbol to use to get list of individual elements from string of target_columns[2]. The default is “;”.

  • save_path (str, optional) – If not None, saves the outputs. The default is None.

  • n_nodes (int, optional) – Number of nodes to use. Default is the Beaver default.

  • n_jobs (int, optional) – Number of jobs to use. Default is the Beaver default.

  • joblib_backend (str, optional) – Joblib parallel backend. Default is multiprocessing.

  • verbose (bool, optional) – Vebosity flag. The default is False.

  • return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.

  • output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘pydata’.

coauthor_tensor(dataset: DataFrame, target_columns: tuple = ('authorIDs', 'year'), split_authors_with: str = ';', verbose: bool = False, save_path: str = None, n_nodes: int = None, n_jobs: int = None, joblib_backend: str = 'multiprocessing', authors_idx_map: dict = {}, time_idx_map: dict = {}, return_object: bool = True, output_mode: str = 'pydata') tuple[source]#

Create co-author tensor. Returns tuple of tensor, authors, and time.

Parameters:
  • dataset (pd.DataFrame) – Dataframe containing the target columns.

  • target_columns (tuple, optional) – Target column names in dataset DataFrame. The default is (“authorIDs”, “year”). When assigning names in this tuple, type order should be preserved, e.g. time column name comes second.

  • split_authors_with (str, optional) – What symbol to use to get list of individual authors from string. The default is “;”.

  • verbose (bool, optional) – Vebosity flag. The default is False.

  • save_path (str, optional) – If not None, saves the outputs.. The default is None.

  • n_nodes (int, optional) – Number of nodes to use. Default is the Beaver default.

  • n_jobs (int, optional) – Number of jobs to use. Default is the Beaver default.

  • joblib_backend (str, optional) – Joblib parallel backend. Default is multiprocessing.

  • authors_idx_map (dict, optional) – Author to tensor dimension index mapping. Default is {}. If not passed, it is created.

  • time_idx_map (dict, optional) – Time to tensor dimension index mapping. Default is {}. If not passed, it is created.

  • return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.

  • output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘pydata’.

Returns:

Tuple of tensor, author vocabulary, and time vocabulary.

Return type:

tuple

cocitation_tensor(dataset: DataFrame, target_columns: tuple = ('authorIDs', 'year', 'paper_id', 'references'), split_authors_with: str = ';', split_references_with: str = ';', verbose: bool = False, save_path: str = None, n_nodes: int = None, n_jobs: int = None, joblib_backend: str = 'multiprocessing', authors_idx_map: dict = {}, time_idx_map: dict = {}, return_object: bool = True, output_mode: str = 'pydata') tuple[source]#

Creates an Authors by Authors by Time tensor. An non-zero entry x at author i, author j, year k means that author j cited author i x times in year k. Note that x is normalized. This means that for two papers a and b where a cites b, the n authors of a, and a single author from b, the author from b receives 1/n citations from each author on paper a.

Parameters:
  • dataset (pd.DataFrame) – Dataframe containing the target columns.

  • target_columns (tuple, optional) – Target column names in dataset DataFrame. The default is (“authorIDs”, “year”, “paper_id”, “references”). When assigning names in this tuple, type order should be preserved, e.g. time column name comes second.

  • split_authors_with (str, optional) – What symbol to use to get list of individual authors from string. The default is “;”.

  • split_references_with (TYPE, optional) – What symbol to use to get list of individual references from string. The default is “;”.

  • verbose (bool, optional) – Vebosity flag. The default is False.

  • save_path (str, optional) – If not None, saves the outputs. The default is None.

  • n_nodes (int, optional) – Number of nodes to use. Default is is the Beaver default.

  • n_jobs (int, optional) – Number of parallel jobs. The default is the Beaver default.

  • authors_idx_map (dict, optional) – Author to tensor dimension index mapping. Default is {}. If not passed, it is created.

  • time_idx_map (dict, optional) – Time to tensor dimension index mapping. Default is {}. If not passed, it is created.

  • return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.

  • output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘pydata’.

Returns:

Tuple of tensor, author vocabulary, and time vocabulary.

Return type:

tuple

cooccurrence_matrix(dataset: DataFrame, target_column: str = 'abstracts', cooccurrence_settings: dict = {}, sppmi_settings: dict = {}, save_path: str = None, return_object: bool = True, output_mode: str = 'scipy') tuple[source]#

Generates co-occurance and SPPMI matrix.

Parameters:
  • dataset (pd.DataFrame) – Dataframe containing the target columns.

  • target_column (str, optional) – Target column name in dataset DataFrame. The default is “abstracts”. Target column should be for text data, where tokens are retrived via empty spaces.

  • cooccurrence_settings (dict, optional) – Settings for co-occurance matrix. The default is dict. Options are: vocabulary, window_size=20, dense=True, verbose=True, sentences=False

  • sppmi_settings (dict, optional) – Settings for SPPMI matrix. The default is dict. Options are: shift=4

  • save_path (str, optional) – If not None, saves the outputs. The default is None.

  • return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.

  • output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘scipy’.

Returns:

Tuple of co-occurance and SPPMI matrix.

Return type:

tuple

documents_words(dataset: DataFrame, target_column: str = 'abstracts', options: dict = {'max_df': 0.5, 'min_df': 5}, highlighting: list = [], weights: list = [], matrix_type: str = 'tfidf', verbose: bool = False, return_object: bool = True, output_mode: str = 'scipy', save_path: str = None) tuple[source]#

Creates document-words matrix.

Parameters:
  • dataset (pd.DataFrame) – Dataframe containing the target columns.

  • target_column (str, optional) – Target column name in dataset DataFrame. The default is “abstracts”. Target column should be for text data, where tokens are retrived via empty spaces.

  • options (dict, optional) – Settings for when doing vectorization. The default is {“min_df”: 5, “max_df”: 0.5}.

  • matrix_type (str, optional) – TF-IDF or Count vectorization. The default is “tfidf”. Other option is “count”.

  • verbose (bool, optional) – Vebosity flag. The default is False.

  • highlighting (list, optional) – The vocabulary or list of tokens to highlight. The default is []. Other option is “count”.

  • weights (list or float or int, optional) – Weights of the highlighted words. The default is [].

  • save_path (str, optional) – If not None, saves the outputs. The default is None.

  • return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.

  • output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘scipy’.

Returns:

Tuple of matrix and vocabulary.

Return type:

tuple

get_ngrams(dataset: DataFrame, target_column: str = None, n: int = 1, limit: int = None, save_path: str = None) list[source]#

Generates n-grams from a column in a dataset

Parameters:
  • dataset (pd.DataFrame) – Dataframe containing the target columns.

  • target_column (str, optional) – Target column name in dataset DataFrame. The default is “abstracts”. Target column should be for text data, where tokens are retrived via empty spaces.

  • n (int) – Number of tokens in a gram to generate

  • limit (int) – Restrict number of top n-grams to return

  • save_path (str, optional) – If not None, saves the outputs as csv using the column names ‘Ngram’, ‘Count’. The default save_path is None.

Returns:

Top ngrams as a list of tuples containing the ngram then count.

Return type:

list

get_vocabulary(dataset: DataFrame, target_column: str = None, max_df: int | float = 1.0, min_df: int = 1, max_features: int = None, save_path: str = None, **kwargs) list[source]#

Builds the vocabulary

Parameters:
  • dataset (pd.DataFrame) – Dataframe containing the target columns.

  • target_column (str, optional) – Target column name in dataset DataFrame.

  • max_df (int or float, optional) – When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. The default is 1.0.

  • min_df (int or float, optional) – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float in range of [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. The default is 1.

  • max_features (int, optional) – If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. The default is None.

  • save_path (str, optional) – If not None, saves the outputs. The default is None.

Return type:

List of tokens in the vocabulary.

property n_jobs#
property n_nodes#
participation_tensor(dataset: DataFrame, target_columns: tuple = ('author_ids', 'paper_id', 'year'), dimension_order: list = [0, 1, 2], split_authors_with: str = ';', save_path: str = None, n_nodes: int = None, n_jobs: int = None, joblib_backend: str = 'multiprocessing', verbose: bool = False, return_object: bool = True, output_mode: str = 'pydata') tuple[source]#

Creates a boolean Authors by Papers by Time tensor. An non-zero entry at author i, paper j, year k means that author i published paper j in year k

Parameters:
  • dataset (pd.DataFrame) – Dataframe containing the target columns.

  • target_columns (tuple, optional) – Target column names in dataset DataFrame. The default is (“author_ids”, “paper_id”, “year”). When assigning names in this tuple, type order should be preserved, e.g. time column name comes last.

  • dimension_order (list, optional) – Order in which the dimensions appear. For example, [0,1,2] means it is Authors, Papers, Time and [1,0,2] means it is Papers, Authors, Time.

  • split_authors_with (str, optional) – What symbol to use to get list of individual elements from string of target_columns[0]. The default is “;”.

  • save_path (str, optional) – If not None, saves the outputs. The default is None.

  • n_nodes (int, optional) – Number of nodes to use. Default is the Beaver default.

  • n_jobs (int, optional) – Number of jobs to use. Default is the Beaver default.

  • joblib_backend (str, optional) – Joblib parallel backend. Default is multiprocessing.

  • verbose (bool, optional) – Vebosity flag. The default is False.

  • return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.

  • output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘pydata’.

something_words(dataset: DataFrame, target_columns: tuple = ('authorIDs', 'abstracts'), split_something_with: str = ';', options: dict = {'max_df': 0.5, 'min_df': 5}, highlighting: list = [], weights: list = [], verbose: bool = False, matrix_type: str = 'tfidf', return_object: bool = True, output_mode: str = 'scipy', save_path: str = None) tuple[source]#

Creates a Something by Words matrix. For example, Authors-Words. Here something is specified by first index of variable target_columns. Individual evelements of target_columns[0] is seperated by split_something_with. For example “autho1;author2” when split_something_with=”;”.

Parameters:
  • dataset (pd.DataFrame) – Dataframe containing the target columns.

  • target_columns (tuple, optional) – Target column names in dataset DataFrame. The default is (“authorIDs”, “abstracts”). When assigning names in this tuple, type order should be preserved, e.g. text data column name comes second.

  • split_something_with (str, optional) – What symbol to use to get list of individual elements from string of target_columns[0]. The default is “;”.

  • options (str, optional) – Settings for when doing vectorization. The default is {“min_df”: 5, “max_df”: 0.5}.

  • highlighting (list, optional) – The vocabulary or list of tokens to highlight. The default is []. Other option is “count”.

  • weights (list or float or int, optional) – Weights of the highlighted words. The default is [].

  • verbose (bool, optional) – Vebosity flag. The default is False.

  • matrix_type (str, optional) – TF-IDF or Count vectorization. The default is “tfidf”. Other option is “count”

  • save_path (TYPE, optional) – If not None, saves the outputs. The default is None.

  • return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.

  • output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘scipy’.

Returns:

Tuple of matrix, vocabulary for somethings (target information specified in target_columns[0]), and the vocabulary for words.

Return type:

tuple

something_words_time(dataset: DataFrame, vocabulary: list, target_columns: tuple = ('authorIDs', 'abstracts', 'year'), split_something_with: str = ';', save_path: str = None, tfidf_transformer: bool = False, unfold_at=1, verbose: bool = False, dimension_order: list = [0, 1, 2], return_object: bool = True, output_mode: str = 'pydata') tuple[source]#

Creates a Something by Words by Time tensor. For example, Authors-Words-Time. Here something is specified by first index of variable target_columns. Individual evelements of target_columns[0] is seperated by split_something_with. For example “autho1;author2” when split_something_with=”;”.

Parameters:
  • dataset (pd.DataFrame) – Dataframe containing the target columns.

  • vocabulary (list) – Token vocabulary to use.

  • target_columns (tuple, optional) – Target column names in dataset DataFrame. The default is (“authorIDs”, “abstracts”, “year”). When assigning names in this tuple, type order should be preserved, e.g. time column name comes last.

  • split_something_with (str, optional) – What symbol to use to get list of individual elements from string of target_columns[0]. The default is “;”.

  • save_path (str, optional) – If not None, saves the outputs. The default is None.

  • tfidf_transformer (bool, optional) – If True, performs TF-IDF normalization via unfolding over dimension unfold_at. The default is False.

  • unfold_at (int, optional) – Which dimension to unfold the tensor for TF-IDF normalization, when tfidf_transformer=True. The default is 1.

  • verbose (bool, optional) – Vebosity flag. The default is False.

  • dimension_order (list, optional) – Order in which the dimensions appear. For example, [0,1,2] means it is Something, Words, Time. and [1,0,2] means it is Words, Something, Time.

  • return_object (bool, optional) – Flag that determines whether the generated object is returned from this function. In the case of large tensors it may be better to save to disk without returning. Default is True.

  • output_mode (str, optional) – The type of object returned in the output. See supported options in Beaver.SUPPORTED_OUTPUT_FORMATS. Default is ‘scipy’.

Returns:

Tuple of matrix, vocabulary for somethings (target information specified in target_columns[0]), the vocabulary for words, and the vocabulary for time.

Return type:

tuple