TELF.applications.Cheetah: Advanced search by keywords and phrases#
Cheetah is a tool for performing custom fast searches for keywords and phrases in text.
Available Functions#
|
Init an empty Cheetah object |
|
Creates indices for selected columns in data for Cheetah search. |
|
Search a dataset indexed by this Cheetah object. |
Module Contents#
Created on Tue Feb 15 17:05:54 2022
@author: maksimekineren
- class TELF.applications.Cheetah.cheetah.Cheetah(verbose: bool)[source]#
Bases:
object
Init an empty Cheetah object
- Parameters:
verbose (bool, optional) – Vebosity flag. The default is False.
- Return type:
None
- COLUMNS = {'abstract': 'abstract', 'affiliations': 'affiliations', 'author_ids': 'author_ids', 'title': 'title', 'year': 'year'}#
- property columns: dict#
Retrieve the columns.
- Parameters:
None –
- Returns:
A dictionary containing column names as keys and column values as values.
- Return type:
dict
- classmethod find_ngram(text: str, query: list, window_size: int = 5, ordered: bool = True) bool [source]#
Determine if the tokens in the list query are contained within the string text using a sliding window algorithm with a specified window_size. If ordered is True then the order of tokens appearing in query and text needs to be maintained for a positive match. Returns True is such a match is found.
- Parameters:
text (str) – A string of multiple tokens that are separated by whitespace
query (list) – A list of tokens that should be checked for in text. Duplicate values in query are allowed and order will be maintained if ordered=True.
window_size (int, optional) –
Set the size of the sliding window. NOTE: if window_size < len(query), no matches can ever be found as the query cannot fit
in the window. Default=5.
ordered (bool, optional) – If True, preserve the order of tokens in query when searching for match. Default=True.
- Returns:
True if ngram query was found in text. False otherwise.
- Return type:
bool
- index(data: DataFrame, columns: dict = None, index_file: str = None, reindex: bool = False, verbose: bool = True) None [source]#
Creates indices for selected columns in data for Cheetah search. author_ids and affiliations are expected to use the their respective SLIC data structures. See an example notebook for a sample of these data structures. Text data such as ‘title’ and ‘abstract’ should be pre-processed using Vulture simple clean. The text in these columns is expected to be lowercase with special characters removed. Tokens are delimited with a single whitespace.
- Parameters:
data (pd.DataFrame) – Pandas DataFrame of papers
columns (dict, optional) – Dictionary where the keys are categories that can be mapped by Cheetah and the values are the corresponding columns names for these categories in data. See Cheetah.COLUMNS for an example of the structure and all currently supported keys. If columns is None, Cheetah will default to the Cheetah.COLUMNS values.
index_file (str, optional) – Path a to a previously generated Cheetah index file. If no path is passed, Cheetah will generate indices for one time use. If index_file is passed but the path does not exist, Cheetah will generate indices and save them for future use at the index_file path. If a path is passed and reindex=True, new indices will be generated and saved at index_file, overwriting the current contents of index_file if it exists.
reindex (int or float, optional) – If True, overwrite the index_file if it exists
verbose (bool, optional) – Vebosity flag. The default is False.
- Return type:
None
- property ngram_ordered: bool#
Get the status of ngram_ordered.
- Parameters:
None –
- Returns:
Status of ngram ordering
- Return type:
bool
- property ngram_window_size: int#
Get the numeric size of the ngram window.
- Parameters:
None –
- Returns:
ngram window size
- Return type:
int
- property query: list | str | None#
Get the last query of the object.
- Parameters:
None –
- Returns:
The last query of the object, which can be either a list or a string.
- Return type:
Union[list, str, None]
- search(query: list = None, and_search: bool = True, in_title: bool = True, in_abstract: bool = True, save_path: bool = None, author_filter: list = [], affiliation_filter: list = [], country_filter: list = [], year_filter: list = [], ngram_window_size: int = 5, ngram_ordered: bool = True, do_results_table=False, link_search=False) DataFrame [source]#
Search a dataset indexed by this Cheetah object. Text can be searched using query and properties of the data can be filtered using year_filter, country_filter, author_filter, affiliation_filter. If both query and filter(s) are used, the results of the search are intersected. Note that trying to use a filter that was never indexed by Cheetah will result in an error.
- Parameters:
query (str, list, dict, NoneType) –
A string or a list of strings to lookup. n-grams for n>1 should be split with whitespace. Note that query will be pre-processed by converting all characters to lowecase and stripping all extra whitespace.
>>> query = 'laser' # a single word to lookup >>> query = {'laser': 'node'} # a single word with negative query >>> query = ['laser', 'angle deflection'] # a word and bigram to lookup >>> query = [{'laser': ['blue', 'green'], # a word and bigram to lookup with multiple negative 'angle deflection'] # search terms for the unigram >>> query = None # no query to lookup (using filters only)
and_search (bool, optional) – This option applies when multiple queries are being looked up simultenously. If True, the intersection of documents that match all queries is returned. Otherwise, the union. Default=True.
in_title (bool, optional) –
If True, searches for queries in the indexed title text. Default=True. NOTE: If in_title and in_abstract are both True, the union between these
two query searches is returned
in_abstract (bool, optional) –
If True, searches for queries in the indexed abstract text. Default=True. NOTE: If in_title and in_abstract are both True, the union between these
two query searches is returned
save_path (str, optional) – The path at which to save the resulting subset DataFrame. If the path is not defined, the result of the search is returned. The default is None.
author_filter (list, optional) – List of author ids that papers should be affiliated with. The default is [].
affiliation_filter (list, optional) – List of affiliation ids that papers should be affiliated with. The default is [].
country_filter (list, optional) – List of countries that papers should be affiliated with. The default is [].
year_filter (list, optional) – List of years that papers should be published in. The default is [].
ngram_window_size (int, optional) – The size of the window used in Cheetah.find_ngram(). This function is called if one or more entries in query are n-grams for n>1. ngram_window_size determines how many tokens can be examined at a time. For example for the text [‘aa bb bb cc cc dd’], the query ‘aa cc’ will be found if the window size is >= 4. Default=5. This value should be greater than the length of the n-gram.
ngram_ordered (bool) – The order used in Cheetah.find_ngram(). This function is called if one or more entries in query are n-grams for n>1. ngram_ordered determines if the order of tokens in ngram should be preserved while looking for a match. Default=True.
do_results_table (bool, optional) – Flag that determines if a results table should be generated for the search. If True, this table will provide explainability for why certain documents were selected by Cheetah. If False, None is returned as the second argument. Default=False
link_search (bool, optional) – A flag that controls if the queries should be linked in the positive/negative inclusion step. For example, take a document that contains the queried text “A” and “B”. However positive or negative inclusion partnered with “B” overrides the selection. If this flag is set to True then the inclusion step will be ignored since another query, “A”, had already selected the document as being on-topic (hence linking the search). Default=False
- Returns:
return_data (None, pd.DataFrame) – If save_path is not defined, return the search result (pd.DataFrame object). However, if save_path is defined, return None and save result at save_path as a CSV file.
results_table (None, pd.DataFrame) – If do_results_table is True then this argument will provide explainability for Cheetah filtering. Otherwise this argument is None
- TELF.applications.Cheetah.cheetah.add_with_union_of_others(d, s, key)[source]#
Compute the addition of a set associated with a given key and the union of all other sets in a dictionary.
Parameters:#
- d: dict
A dictionary where keys are strings and values are sets.
- s: set
The set (associated with key) to be added to
- key: str
The key in the dictionary whose set is to be modified with the union of all other sets.
Returns:#
- set:
The intersection set of the specified key’s set and the union of all other sets in the dictionary.
Raises:#
- ValueError
If the specified key is not found in the dictionary.