TELF.applications.Bunny: Dataset generation tool for documents and their citations/references

TELF.applications.Bunny: Dataset generation tool for documents and their citations/references#

Dataset generation tool for documents and their citations/references.

Available Functions#

`Bunny.__init__`([s2_key, scopus_keys, ...])
`Bunny.estimate_hop`(df[, col])	Predict the maximum number of papers that will be in the DataFrame if another hop is performed.
`Bunny.suggest_filter_values`(df[, options])	Generate the possible Bunny filtering values for some options.
`Bunny.form_core_scopus`(data, data_type, keys)
`Bunny.form_core`(data, data_type[, s2_dir])
`Bunny.hop`(df, hops, modes[, use_scopus, ...])	Perform one or more hops along the citation network
`Bunny.get_affiliations`(df, scopus_keys, filters)
`Bunny.add_to_penguin`(source, path)	Get the downloaded papers for a given source collection from Penguin, represented as a fixed memory-size Bloom filter.
`Bunny.get_penguin_cache`(source[, max_items, ...])	Get the downloaded papers for a given source collection from Penguin, represented as a fixed memory-size Bloom filter.

Available Functions#

`AutoBunny.__init__`(core[, s2_key, ...])
`AutoBunny.run`(steps, *[, s2_key, ...])

Module Contents#

class TELF.applications.Bunny.bunny.Bunny(s2_key=None, scopus_keys=None, penguin_settings=None, output_dir='.', verbose=False)[source]#

Bases: object

FILTERS = {'AF-ID': 'Affiliation ID', 'AFFILCOUNTRY': 'Country', 'AFFILORG': 'Affiliation', 'AU-ID': 'Scopus Author ID', 'KEY': 'Keyword', 'PUBYEAR': 'Publication Year'}#

MODES = {'citations', 'references', 's2_author_ids'}#

add_to_penguin(source, path)[source]#

Get the downloaded papers for a given source collection from Penguin, represented as a fixed memory-size Bloom filter. This can then be given to iPenguin child objects to prevent re-downloading previously acquired papers.

Parameters:#

source: str: The name of the source collection from which to retrieve IDs. It should correspond to either SCOPUS_COL or S2_COL.
path: str, pathlike: The path to the directory where the downloaded files are cached, ready to be added into Penguin.

apply_filter(df, filters, filter_in_core=True, do_author_match=True)[source]#

classmethod estimate_hop(df, col='citations')[source]#

Predict the maximum number of papers that will be in the DataFrame if another hop is performed. The numbers of papers in the next hop is guaranteed to be less than or equal to this number

Parameters:: df (pandas.DataFrame) – The target Bunny DataFrame
Returns:: Maximum possible number of papers contained in the next hop
Return type:: int

form_core(data, data_type, s2_dir='s2')[source]#

form_core_scopus(data, data_type, keys, s2_dir='s2', scopus_dir='scopus')[source]#

get_affiliations(df, scopus_keys, filters, save_dir='scopus', filter_in_core=True, do_author_match=True)[source]#

get_penguin_cache(source, max_items=1.25, false_positive_rate=0.001)[source]#

Parameters:#

source: str: The name of the source collection from which to retrieve IDs. It should correspond to either SCOPUS_COL or S2_COL.
max_items: float, int, (optional): The maximum number of items expected to be stored in the Bloom filter. This can be a fixed integer or a float representing a multiplier of the current document count in the collection. Default is 1.25.
false_positive_rate: float, (optional): The desired false positive probability for the Bloom filter. Default is 0.001.

Returns:#

rbloom.Bloom: An instance of a Bloom filter populated with IDs from the specified collection.

Raises:#

ValueError

If ‘source’ does not match any of the predefined collection names, or if ‘max_items’ is not a float or an int.

If attempting to use this function without Penguin connection string details provided

hop(df, hops, modes, use_scopus=False, filters=None, hop_focus=None, scopus_keys=None, s2_dir='s2', scopus_dir='scopus', filter_in_core=True, max_papers=0, hop_priority='random')[source]#

Perform one or more hops along the citation network

This function allows the user to perform one or more hops on the citation network, using either references or citations as the method of expansion. The user can also optionally specify filters (using BunnyFilter format) to limit the scope of the search to some area. Note that if filters are being used, the user must specify Scopus API keys as only the Scopus API can provide filtering in the Bunny implementation.

Parameters:

df (pandas.DataFrame) – The target Bunny DataFrame
hops (int) – How many hops to perform. Must be >= 1
modes ((list, set, tuple)) – Which mode(s) to use Bunny in. Can either be ‘citations’, ‘references’, ‘s2_author_ids’
use_scopus (boolean) – Flag that determines if Scopus API should be called to fill in more detailed information such as affiliation, country of origin, keywords (PACs), etc. scopus_keys must be provided if this flag is set to True.
filters (BunnyFilter or BunnyQuery) – Specialized dataclass used for representing a boolean query at any level of nesting. See example notebooks to see how these filters can be created. use_scopus must be True and scopus_keys need to be provided to use Bunny filters. Default=None.
scopus_keys (list) – List of Scopus API keys which are used to call on the Scopus API to enrich hop-expanded DataFrame. Default=None.
scopus_dir (str, Path) – The directory at which to create a Scopus paper archive. This will cache previosly downloaded papers to save time and API call limits. The directory is expectd to exist inside Bunny.output_dir and a new directory will be created at Bunny.output_dir/scopus_dir if it cannot be found. Default=scopus.
filter_in_core (bool) – Flag that determines whether any filters should be applied to the core. This option is only needed if filters are specified. If True, core papers can be filtered and removed from the Bunny DataFrame. Default=True.
scopus_batch_size (int) – The maximum batch size for looking up papers using the Scopus API. Note that Scopus sets a maximum Boolean query at 10,000. The size of the Boolean query can be calculated approximately using (num filters + 1) * scopus_batch_size. Care should be taken to decrease the batch size if many filters are being used.
max_papers (int) – This variable is used to set an upper limit on how many papers can be featured in a hop. If set to 0, no upper limit will be used. Default is 0.
hop_priority (str) – If max_papers is not 0, this variable is used to prioritize which papers are selected for the hop. Options are in [‘random’, ‘frequency’]. Default is ‘random’.

Returns:

Hop-expanded result DataFrame

Return type:

pd.DataFrame

property penguin_settings#

property s2_key#

classmethod suggest_filter_values(df, options=['AFFILCOUNTRY', 'AFFILORG'])[source]#

Generate the possible Bunny filtering values for some options. Note that these values do not conclusively map all possible filter values used by Scopus but instead produce all filter values currently used by the given DataFrame df

Parameters:: df (pandas.DataFrame) – The target Bunny DataFrame
Returns:: suggestions – A dictionary where the keys are the filter types and the values are a set of suggested filter values
Return type:: dict

class TELF.applications.Bunny.bunny.BunnyFilter(filter_type: str, filter_value: str)[source]#

Bases: object

filter_type: str = <dataclasses._MISSING_TYPE object>#

filter_value: str = <dataclasses._MISSING_TYPE object>#

class TELF.applications.Bunny.bunny.BunnyOperation(operator: str, operands: List[TELF.applications.Bunny.bunny.BunnyFilter | ForwardRef('BunnyOperation')])[source]#

Bases: object

operands: List[BunnyFilter | BunnyOperation] = <dataclasses._MISSING_TYPE object>#

operator: str = <dataclasses._MISSING_TYPE object>#

TELF.applications.Bunny.bunny.evaluate_query_to_string(query)[source]#

Convert a Bunny query into a string that can be accepted by Scopus.

Parameters:: query (BunnyFilter, BunnyOperation) – The query object. This can be an instance of BunnyFilter or BunnyOperation.
Returns:: A string representation of the query that can be processed by Scopus
Return type:: str

TELF.applications.Bunny.bunny.find_doi(f)[source]#

Helper function that attempts to extract the DOI from a Scopus paper XML file

Parameters:: f (str) – Path to the Scopus XML file
Returns:: doi – Returns DOI if found, else None
Return type:: str, None

TELF.applications.Bunny.bunny.is_valid_query(query)[source]#

Validates the structure of a Bunny filter query.

The function checks whether a given input follows the intended structure of Bunny filter queries. Each element in the query is evaluated recursively, making sure that only BunnyFilter and BunnyOperation objects are present.

Parameters:: query ((Filter/Operation)) – The query to be validated. The query can be an instance of BunnyFilter or BunnyOperation dataclass.
Returns:: True if the query is valid, False otherwise.
Return type:: bool

Module Contents#

class TELF.applications.Bunny.auto_bunny.AutoBunny(core, s2_key=None, scopus_keys=None, output_dir=None, cache_dir=None, cheetah_index=None, verbose=False, use_vulture_cheetah=True)[source]#

Bases: object

CHEETAH_INDEX = {'abstract': 'clean_title_abstract', 'affiliations': 'affiliations', 'author_ids': 'author_ids', 'country': 'affiliations', 'title': None, 'year': 'year'}#

property cache_dir#

property cheetah_index#

property core#

property output_dir#

run(steps, *, s2_key=None, scopus_keys=None, cheetah_index=None, max_papers=250000, checkpoint=True, filter_type: str = None, filter_value=None)[source]#

property s2_key#

property scopus_keys#

class TELF.applications.Bunny.auto_bunny.AutoBunnyStep(modes: list, max_papers: int = 2000, hop_priority: str = 'random', cheetah_settings: dict = <factory>, vulture_settings: list = <factory>)[source]#

Bases: object

Class for keeping track of AutoBunny args

cheetah_settings: dict = <dataclasses._MISSING_TYPE object>#

hop_priority: str = 'random'#

max_papers: int = 2000#

modes: list = <dataclasses._MISSING_TYPE object>#

vulture_settings: list = <dataclasses._MISSING_TYPE object>#