TELF.pre_processing.Orca: Duplicate author detector for text mining and information retrival#

Duplicate author detector for text mining and information retrival.

Available Functions#

Orca.__init__([duplicates, s2_duplicates, ...])

Orca.run(df[, scopus_duplicates, ...])

Run Orca and form SLIC ids for a given dataset

Orca.apply(df[, slic_df])

Apply the SLIC id mapping to a SLIC papers dataframe

Module Contents#

class TELF.pre_processing.Orca.orca.Orca(duplicates=None, s2_duplicates=None, verbose=False)[source]#

Bases: object

apply(df, slic_df=None)[source]#

Apply the SLIC id mapping to a SLIC papers dataframe

Parameters:
  • df (pandas.DataFrame) – The SLIC dataframe for which author SLIC ids need to be created

  • slic_df (pandas.DataFrame, optional) – A pre-computed DataFrame with SLIC id mappings. This parameter is provided in the rare cases that a SLIC map is being used between multiple datasets (i.e. dataset B is a subset of A and slic_df was computed for A). Be aware that setting a value for slic_df is not recommended! If using this parameter, verify that all desired scopus/s2 authors have existing SLIC ids. To be sure of the validity of your results, use Orca.run() before using Orca.apply() and do not pass a value for this parameter.

Returns:

orca_df – df with standarized author information (columns for ‘SLIC_ids’ and ‘SLIC_affiliations’)

Return type:

pandas.DataFrame

property duplicates#
run(df, scopus_duplicates=None, s2_duplicates=None, known_matches=None, n_jobs=-1)[source]#

Run Orca and form SLIC ids for a given dataset

Parameters:
  • df (pandas.DataFrame) – The SLIC dataframe for which author SLIC ids need to be created

  • scopus_duplicates (list(set), optional) – A list of sets where each set contains scopus author ids that refer to the same person. In the ideal case, each author only has one scopus id. However, this ideal does not hold up in practice and some authors are represented by two or more scopus ids. Duplicate authors can be found using the Orca.DuplicateAuthorFinder tool. If not provided, a pre-computed scopus duplicate map is used (pre-computed from 1 million Scopus papers). If provided, this map is overriden by the user input. Default is None.

  • s2_duplicates (list(set), optional) – A list of sets where each set contains s2 author ids that refer to the same person. If not provided, s2 author ids are not scanned for duplicates / only compared against scopus matches as duplicate detection. Default is None.

  • known_matches (dict, optional) – A dict of s2 id keys to scopus id values. This dictionary is used to override the author matching if groundtruth is known. This is useful for helping the tool work around edge cases. Default is None.

Return type:

None

property s2_duplicates#