TELF.pre_processing.Orca: Duplicate author detector for text mining and information retrival#

Duplicate author detector for text mining and information retrival.

Available Functions#

Orca.__init__([duplicates, s2_duplicates, ...])

Orca.run(df[, scopus_duplicates, ...])

Form the SLIC map from Scopus-only, S2-only, or hybrid dataframes.

Orca.apply(df[, slic_df])

Apply the SLIC id mapping to a SLIC papers dataframe.

Module Contents#

class TELF.pre_processing.Orca.orca.AuthorMatcher(df: DataFrame, n_jobs: int = -1, verbose: bool = False)[source]#

Bases: object

Fallback stub used when the real AuthorMatcher isn’t available. Produces an empty matches DataFrame (or builds rows from known_matches if you provide them). This is enough for Orca to proceed via the Scopus-only/S2-only residual logic.

match(known_matches: Dict[str, str | Iterable[str]] | None = None) DataFrame[source]#
class TELF.pre_processing.Orca.orca.Orca(duplicates=None, s2_duplicates=None, verbose=False)[source]#

Bases: object

Construct SLIC author ids + apply them to a SLIC-style paper dataframe.

apply(df, slic_df=None)[source]#

Apply the SLIC id mapping to a SLIC papers dataframe. Keeps papers even when SLIC author ids are missing (warns only).

property duplicates#
run(df, scopus_duplicates=None, s2_duplicates=None, known_matches=None, n_jobs=-1)[source]#

Form the SLIC map from Scopus-only, S2-only, or hybrid dataframes.

property s2_duplicates#