TELF.pre_processing.iPenguin: Online information retrieval tool for Scopus, SemanticScholar, and OSTI#

Online information retrieval tool for Scopus, SemanticScholar, and OSTI

Available Functions#

OSTI.__init__([key, mode, name, n_jobs, verbose])

OSTI.count(data, mode[, verbose])

OSTI.search(data, mode, *[, n])

Available Functions#

SemanticScholar.__init__([key, mode, name, ...])

Initializes the iPenguin.SemanticScholar instance with the specified key, mode, and optional settings.

SemanticScholar.count(data, mode[, verbose])

SemanticScholar.search(data, mode, *[, n])

SemanticScholar.get_df(path[, targets, ...])

Available Functions#

Scopus.__init__(keys[, mode, name, n_jobs, ...])

Initializes the iPenguin.Scopus instance with the specified keys, mode, and optional settings.

Scopus.count(query[, verbose])

Scopus.search(query, *[, n])

Module Contents#

class TELF.pre_processing.iPenguin.OSTI.osti.OSTI(key: str | None = None, mode: str = 'fs', name: str = 'osti', n_jobs: int = -1, verbose: bool = False)[source]#

Bases: object

MODES = {'fs'}#
count(data, mode, verbose=None)[source]#
async count_coroutine(data, mode)[source]#
classmethod get_df(path, targets=None, *, backend='threading', n_jobs=1, verbose=False)[source]#

Parallelized function for generating a pandas DataFrame from OSTI JSON data.

Parameters:
  • path (str, pathlib.Path) – The path to the directory where the JSON files are stored

  • targets (set, list; optional) – A collection of OSTI_ids that need to be present in the DataFrame. A warning will be provided if all targets cannot be found at the given path. If targets is None, no filtering is performed and the DataFrame is formed from all json data stored at path.

  • n_jobs (int; optional) – How many parallel processes to use for this function. Default is 1 (single core)

Returns:

Resulting DataFrame. If no valid files are able to processed to form the DataFrame, this function will return None

Return type:

pd.DataFrame

property n_jobs#
search(data, mode, *, n=100)[source]#
async search_coroutine(data, mode, ignore, n, num_papers)[source]#

Module Contents#

class TELF.pre_processing.iPenguin.SemanticScholar.s2.SemanticScholar(key: str | None = None, mode: str = 'fs', name: str = 's2', n_jobs: int = -1, ignore=None, verbose: bool = False)[source]#

Bases: object

Initializes the iPenguin.SemanticScholar instance with the specified key, mode, and optional settings.

This constructor method sets up the SemanticScholar instance by initializing the key, mode, name, papers to ignore, verbosity of output, and number of jobs for parallel processing (if applicable). It then attempts to establish a connection to the S2 API to validate the key

Parameters:#

key: str, (optional)

The API key to be used for this instance. If defined, this must be a valid API key. Can also be None to download papers at a slower rate with no key.

mode: str, (optional)

The mode in which the SemanticScholar instance should operate in. Currently one mode is supported, ‘fs’. This is the file system mode and will download papers to the directory path provided with name.

name: str, (optional):

The name associated with the instance. In the case of ‘fs’ mode (the default), this parameter is expected to be the path to the directory where the downloaded files will be saved

ignore: set, list, Iterable, None, (optional)

This parameter allows for certain papers to be skipped and not downloaded. This is useful for speeding up download times by skipping previously downloaded papers and saving on API keys. If None, the ignore papers will be determined depending on the mode that this instance is operating in. If defined, this parameter needs to be a data structure that has the __contains__ method implemented. Default is None.

n_jobs: int, (optional)

The number of jobs for parallel processing. Default is -1 (use all available cores).

verbose: bool, int (optional)

If set to True, the class will print additional output for debugging or information purposes. Can also be an int where verbose >= 1 means True with a higher integer controlling the level of verbosity. Default is True.

BULK_SEARCH_LIMIT = 1000000#
MODES = {'fs'}#
SEARCH_LIMIT = 1000#
async cleanup_coroutine()[source]#
count(data, mode, verbose=None)[source]#
async count_coroutine(data, mode)[source]#
classmethod get_df(path, targets=None, *, backend='threading', n_jobs=1, verbose=False)[source]#
property n_jobs#
search(data, mode, *, n=0)[source]#
async search_coroutine(data, mode, ignore, n, num_papers)[source]#
TELF.pre_processing.iPenguin.SemanticScholar.s2.get_df_helper(files)[source]#

Module Contents#

class TELF.pre_processing.iPenguin.Scopus.scopus.Scopus(keys: list, mode: str = 'fs', name: str = 'scopus', n_jobs: int = -1, ignore=None, verbose: bool = False)[source]#

Bases: object

Initializes the iPenguin.Scopus instance with the specified keys, mode, and optional settings.

This constructor method sets up the Scopus instance by initializing the keys, mode, name, papers to ignore, verbosity of output, and number of jobs for parallel processing (if applicable). It then attempts to establish a connection to the Scopus API to validate the keys

Parameters:#

keys: list

The list of API keys to be used for this instance. The list must contain one or more valid API keys. An error will be thrown if any of the keys fail to validate.

mode: str, (optional)

The mode in which the Scopus instance should operate in. Currently one mode is supported, ‘fs’. This is the file system mode and will download papers to the directory path provided with name.

name: str, (optional):

The name associated with the instance. In the case of ‘fs’ mode (the default), this parameter is expected to be the path to the directory where the downloaded files will be saved

ignore: set, list, Iterable, None, (optional)

This parameter allows for certain papers to be skipped and not downloaded. This is useful for speeding up download times by skipping previously downloaded papers and saving on API keys. If None, the ignore papers will be determined depending on the mode that this instance is operating in. If defined, this parameter needs to be a data structure that has the __contains__ method implemented. Default is None.

n_jobs: int, (optional)

The number of jobs for parallel processing. Default is -1 (use all available cores).

verbose: bool, int (optional)

If set to True, the class will print additional output for debugging or information purposes. Can also be an int where verbose >= 1 means True with a higher integer controlling the level of verbosity. Values above 10 will activate debug for this higher level class. Values above 100 will activate debug print for the lower level ScopusAPI class. Values above 1000 will provide a full debug print and should only be used for testing. Default is True.

DEBUG_MODE = 10#
MODES = {'fs'}#
QUERY_LIMIT = 10000#
SEARCH_LIMIT = 2000#
async cleanup_coroutine()[source]#
count(query, verbose=None)[source]#
classmethod get_df(path, targets=None, *, backend='threading', n_jobs=1, verbose=False)[source]#

Parallelized function for generating a pandas DataFrame from Scopus JSON data. The data is also processed to adhere to SLIC standards.

Parameters:
  • path (str, pathlib.Path) – The path to the directory where the JSON files are stored

  • targets (set, list; optional) – A collection of eids (Scopus paper unique identifiers) that need to be present in the DataFrame. A warning will be provided if all targets cannot be found at the given path. If targets is None, no filtering is performed and the DataFrame is formed from all json data stored at path.

  • n_jobs (int; optional) – How many parallel processes to use for this function. Default is 1 (single core)

Returns:

  • pd.DataFrame – Resulting DataFrame. If no valid files are able to processed to form the DataFrame, this function will return None

  • targets – The list of Scopus paper ids that were expected to be downloaded

property keys#
property n_jobs#
async num_query_coroutine(query)[source]#
search(query, *, n=100)[source]#
async search_coroutine(data, ignore, n, num_papers)[source]#
split_query(expression: str) list[source]#