TELF.applications.Cheetah: Advanced search by keywords and phrases#

Cheetah is a tool for performing custom fast searches for keywords and phrases in text.

Available Functions#

Cheetah.__init__(verbose)

Init an empty Cheetah object

Cheetah.index(data[, columns, index_file, ...])

Creates indices for selected columns in data for Cheetah search.

Cheetah.search([query, and_search, ...])

Search a dataset indexed by this Cheetah object.

Module Contents#

Created on Tue Feb 15 17:05:54 2022

@author: maksimekineren

class TELF.applications.Cheetah.cheetah.Cheetah(verbose: bool)[source]#

Bases: object

Init an empty Cheetah object

Parameters:

verbose (bool, optional) – Vebosity flag. The default is False.

Return type:

None

COLUMNS = {'abstract': 'abstract', 'affiliations': 'affiliations', 'author_ids': 'author_ids', 'title': 'title', 'year': 'year'}#
property columns: dict#

Retrieve the columns.

Parameters:

None

Returns:

A dictionary containing column names as keys and column values as values.

Return type:

dict

classmethod find_ngram(text: str, query: list, window_size: int = 5, ordered: bool = True) bool[source]#

Determine if the tokens in the list query are contained within the string text using a sliding window algorithm with a specified window_size. If ordered is True then the order of tokens appearing in query and text needs to be maintained for a positive match. Returns True is such a match is found.

Parameters:
  • text (str) – A string of multiple tokens that are separated by whitespace

  • query (list) – A list of tokens that should be checked for in text. Duplicate values in query are allowed and order will be maintained if ordered=True.

  • window_size (int, optional) –

    Set the size of the sliding window. NOTE: if window_size < len(query), no matches can ever be found as the query cannot fit

    in the window. Default=5.

  • ordered (bool, optional) – If True, preserve the order of tokens in query when searching for match. Default=True.

Returns:

True if ngram query was found in text. False otherwise.

Return type:

bool

index(data: DataFrame, columns: dict = None, index_file: str = None, reindex: bool = False, verbose: bool = True) None[source]#

Creates indices for selected columns in data for Cheetah search. author_ids and affiliations are expected to use the their respective SLIC data structures. See an example notebook for a sample of these data structures. Text data such as ‘title’ and ‘abstract’ should be pre-processed using Vulture simple clean. The text in these columns is expected to be lowercase with special characters removed. Tokens are delimited with a single whitespace.

Parameters:
  • data (pd.DataFrame) – Pandas DataFrame of papers

  • columns (dict, optional) – Dictionary where the keys are categories that can be mapped by Cheetah and the values are the corresponding columns names for these categories in data. See Cheetah.COLUMNS for an example of the structure and all currently supported keys. If columns is None, Cheetah will default to the Cheetah.COLUMNS values.

  • index_file (str, optional) – Path a to a previously generated Cheetah index file. If no path is passed, Cheetah will generate indices for one time use. If index_file is passed but the path does not exist, Cheetah will generate indices and save them for future use at the index_file path. If a path is passed and reindex=True, new indices will be generated and saved at index_file, overwriting the current contents of index_file if it exists.

  • reindex (int or float, optional) – If True, overwrite the index_file if it exists

  • verbose (bool, optional) – Vebosity flag. The default is False.

Return type:

None

property ngram_ordered: bool#

Get the status of ngram_ordered.

Parameters:

None

Returns:

Status of ngram ordering

Return type:

bool

property ngram_window_size: int#

Get the numeric size of the ngram window.

Parameters:

None

Returns:

ngram window size

Return type:

int

property query: list | str | None#

Get the last query of the object.

Parameters:

None

Returns:

The last query of the object, which can be either a list or a string.

Return type:

Union[list, str, None]

search(query: list = None, and_search: bool = True, in_title: bool = True, in_abstract: bool = True, save_path: bool = None, author_filter: list = [], affiliation_filter: list = [], country_filter: list = [], year_filter: list = [], ngram_window_size: int = 5, ngram_ordered: bool = True, do_results_table=False, link_search=False) DataFrame[source]#

Search a dataset indexed by this Cheetah object. Text can be searched using query and properties of the data can be filtered using year_filter, country_filter, author_filter, affiliation_filter. If both query and filter(s) are used, the results of the search are intersected. Note that trying to use a filter that was never indexed by Cheetah will result in an error.

Parameters:
  • query (str, list, dict, NoneType) –

    A string or a list of strings to lookup. n-grams for n>1 should be split with whitespace. Note that query will be pre-processed by converting all characters to lowecase and stripping all extra whitespace.

    >>> query = 'laser'                        # a single word to lookup
    >>> query = {'laser': 'node'}              # a single word with negative query
    >>> query = ['laser', 'angle deflection']  # a word and bigram to lookup
    >>> query = [{'laser': ['blue', 'green'],  # a word and bigram to lookup with multiple negative
                  'angle deflection']          #  search terms for the unigram
    >>> query = None                           # no query to lookup (using filters only)
    

  • and_search (bool, optional) – This option applies when multiple queries are being looked up simultenously. If True, the intersection of documents that match all queries is returned. Otherwise, the union. Default=True.

  • in_title (bool, optional) –

    If True, searches for queries in the indexed title text. Default=True. NOTE: If in_title and in_abstract are both True, the union between these

    two query searches is returned

  • in_abstract (bool, optional) –

    If True, searches for queries in the indexed abstract text. Default=True. NOTE: If in_title and in_abstract are both True, the union between these

    two query searches is returned

  • save_path (str, optional) – The path at which to save the resulting subset DataFrame. If the path is not defined, the result of the search is returned. The default is None.

  • author_filter (list, optional) – List of author ids that papers should be affiliated with. The default is [].

  • affiliation_filter (list, optional) – List of affiliation ids that papers should be affiliated with. The default is [].

  • country_filter (list, optional) – List of countries that papers should be affiliated with. The default is [].

  • year_filter (list, optional) – List of years that papers should be published in. The default is [].

  • ngram_window_size (int, optional) – The size of the window used in Cheetah.find_ngram(). This function is called if one or more entries in query are n-grams for n>1. ngram_window_size determines how many tokens can be examined at a time. For example for the text [‘aa bb bb cc cc dd’], the query ‘aa cc’ will be found if the window size is >= 4. Default=5. This value should be greater than the length of the n-gram.

  • ngram_ordered (bool) – The order used in Cheetah.find_ngram(). This function is called if one or more entries in query are n-grams for n>1. ngram_ordered determines if the order of tokens in ngram should be preserved while looking for a match. Default=True.

  • do_results_table (bool, optional) – Flag that determines if a results table should be generated for the search. If True, this table will provide explainability for why certain documents were selected by Cheetah. If False, None is returned as the second argument. Default=False

  • link_search (bool, optional) – A flag that controls if the queries should be linked in the positive/negative inclusion step. For example, take a document that contains the queried text “A” and “B”. However positive or negative inclusion partnered with “B” overrides the selection. If this flag is set to True then the inclusion step will be ignored since another query, “A”, had already selected the document as being on-topic (hence linking the search). Default=False

Returns:

  • return_data (None, pd.DataFrame) – If save_path is not defined, return the search result (pd.DataFrame object). However, if save_path is defined, return None and save result at save_path as a CSV file.

  • results_table (None, pd.DataFrame) – If do_results_table is True then this argument will provide explainability for Cheetah filtering. Otherwise this argument is None

TELF.applications.Cheetah.cheetah.add_with_union_of_others(d, s, key)[source]#

Compute the addition of a set associated with a given key and the union of all other sets in a dictionary.

Parameters:#

d: dict

A dictionary where keys are strings and values are sets.

s: set

The set (associated with key) to be added to

key: str

The key in the dictionary whose set is to be modified with the union of all other sets.

Returns:#

set:

The intersection set of the specified key’s set and the union of all other sets in the dictionary.

Raises:#

ValueError

If the specified key is not found in the dictionary.