TELF.applications.Penguin: Text storage tool#

Text storage tool.

Available Functions#

Penguin.__init__(uri, db_name[, username, ...])

Initializes the Penguin instance with the specified URI, database name, and optional settings.

Penguin.add_many_documents(directory, source)

Processes a directory of document files that need to be added to the database.

Penguin.add_single_document(file_path, source)

Processes a single data file and adds / updates its content in the database.

Penguin.count_documents()

Return the number of documents in the Scopus and S2.

Penguin.text_search(target[, join, scopus, ...])

Search the Penguin database by text matching in either Scopus documents, S2 documents or both.

Penguin.id_search(ids[, as_pandas, n_jobs])

Search the Penguin database by document IDs in either Scopus documents, S2 documents, or both.

Penguin.citation_search(target[, scopus, ...])

Searches for documents based on citation or reference targets within a specified collection.

Penguin.query_by_author(collection_name, doc_id)

Query a MongoDB collection for a document by a direct match and then locate all of the other papers from the authors of the target paper that are present in the DB.

Penguin.resolve_document_id(pid)

Resolves the collection and unique identifier (uid) for a given document_id.

Penguin.add_tag(document_id, tag)

Adds a tag to the specified document.

Penguin.remove_tag(document_id, tag)

Removes a tag from the specified document.

Penguin.update_tags(document_id, new_tags)

Updates the tags for the specified document with a new set of tags.

Penguin.find_by_tag(tag[, as_pandas])

Finds and returns documents that have the specified tag.

Penguin.get_id_bloom(source[, max_items, ...])

Initializes a Bloom filter with IDs from a specified source collection.

Module Contents#

class TELF.applications.Penguin.penguin.Penguin(uri, db_name, username=None, password=None, n_jobs=-1, verbose=False)[source]#

Bases: object

A handler class for managing connections and operations with a documents MongoDB database.

The purpose of this class it to create a software layer to easily store and query Scopus and S2 documents. Penguin provides functionality to connect to a MongoDB database, perform read and write operations, and handle authentication for secure database interactions. It supports operations such as adding data, retrieving data, and verifying database integrity.

Initializes the Penguin instance with the specified URI, database name, and optional settings.

This constructor method sets up the Penguin instance by initializing the MongoDB URI, database name, verbosity of output, and number of jobs for parallel processing (if applicable). It then attempts to establish a connection to the MongoDB database by calling Penguin._connect().

Parameters:#

uri: str

The MongoDB URI used to establish a connection.

db_name: str

The name of the database to which the connection will be established.

username: str, (optional)

The username for the Mongo database. If None, will try to use DB without authentication. Default is None.

password: str, (optional)

The password for the Mongo database. If None, will try to use DB without authentication. Default is None.

n_jobs: int, (optional)

The number of jobs for parallel processing. Note that this setting is only used for adding new data to the database and converting documents to SLIC DataFrame format for output. Default is -1 (use all available cores).

verbose: bool, int (optional)

If set to True, the class will print additional output for debugging or information purposes. Can also be an int where verbose >= 1 means True with a higher integer controlling the level of verbosity. Default is True.

Raises:#

ValueError:

Attribute was given an invalid value

TypeError:

Attribute has an invalid type

ConnectionFailure:

If the connection to the MongoDB instance fails during initialization.

DEFAULT_ATTRIBUTES = {'S2': {'author_ids': ('authors', 'authorId'), 'citations': 'citations', 'doi': 'externalIds.DOI', 'id': 'paperId', 'references': 'references', 'text': ['title', 'abstract']}, 'Scopus': {'author_ids': ('bibrecord.head.author-group.author', '@auid'), 'citations': None, 'doi': 'doi', 'id': 'eid', 'references': None, 'text': ['title', 'abstract']}}#
S2_COL = 'S2'#
SCOPUS_COL = 'Scopus'#
add_many_documents(directory, source, overwrite=True, n_jobs=None)[source]#

Processes a directory of document files that need to be added to the database. The function can add both Scopus and S2 documents depending on the source agument. Documents will be added in parallel

Parameters:#

directory: str, pathlib.Path

The directory containing files to be added

source: str

The data source (either ‘Scopus’ or ‘S2’)

overwrite: bool, (optional)

If True and paper id already exists in the collection then the associated data will be updated/overwritten by the new data. Otherwise, this paper id is skipped. If paper id does not already exist in collection, this flag has no effect. Default is True.

n_jobs: int, (optional)

The number of jobs to run in parallel. If None, the class default for n_jobs will be used. Default is None.

add_single_document(file_path, source, overwrite=True)[source]#

Processes a single data file and adds / updates its content in the database.

This function handles the addition of a Scopus or S2 data file. If the file is from Scopus it will be added to the Scopus collection. If S2, it will be added to the S2 collection. In the case of Scopus, the function can hande JSON input (as expected from the Scopus API or iPenguin.Scopus) or XML input (the format used by the purchased data).

Parameters:#

file_path: str

The path to the data file to be processed.

source: str

The data source (either ‘Scopus’ or ‘S2’)

overwrite: bool, (optional)

If True and paper id already exists in the collection then the associated data will be updated/overwritten by the new data. Otherwise, this paper id is skipped. If paper id does not already exist in collection, this flag has no effect. Default is True.

Returns:#

None

Document is added or updated in the database

Raises:#

ValueError

If source does not match an expected value

add_tag(document_id, tag)[source]#

Adds a tag to the specified document.

Parameters:#

document_id: str

The identifier for the document to which the tag will be added. This should be a document id (either S2 or Scopus). The id should be prepended by either ‘eid:’ to signify a Scopus document or ‘s2id:’ to signify an S2 document.

tag: str

The tag to be added to the document.

Returns:#

None

Example:#

>>> penguin.add_tag('eid:[eid here]', 'tensor')
>>> penguin.add_tag('s2id:[s2id here]', 'tensor')

Searches for documents based on citation or reference targets within a specified collection.

This method allows for searching citations or references within S2 documents. It accepts a target paper ID or a list of target paper IDs targets and retrieves documents citing/referencing these targets. The method can return results as a Pandas DataFrame or a dictionary of cursors, depending on the as_pandas flag. The scopus parameter is currently not supported and will trigger a warning if set to True.

Parameters:#

target: str, [str]

The document paper IDs for which to look up citations/references

scopus: bool, (optional)

Currently not supported. Using this parameter will trigger a warning. This argument is added to the function to maintain a similar design structure as the other search functions. Default is False.

s2: bool, str, (optional)

Determines the behavior for searching within the S2 collection. If True, uses the default citation attribute. If a string is provided, it is used as the citation attribute. Since citations and refences use the same data structure, passing the argument ‘references’ for this parameter will trigger a reference search. Default is True.

as_pandas: bool, (optional)

If True, returns the search results as a SLIC DataFrame. If False, returns a dictionary of MongoDB cursors. Default is True.

n_jobs: int, (optional)

The number of parallel jobs to use for the query. If specified, overrides the instance’s default n_jobs setting.

Returns:#

pandas.DataFrame or dict

If ‘as_pandas’ is True, returns a Pandas DataFrame containing the combined search results from both collections, if applicable. If False, returns a dictionary with keys ‘scopus’ and ‘s2’ containing the cursors to the search results from the respective collections.

Raises:#

ValueError

If s2 is set to False, indicating that no valid collection is selected for the search.

count_documents()[source]#

Return the number of documents in the Scopus and S2.

Returns:#

dict:

A dictionary with keys ‘scopus’ and ‘s2’ containing the number of documents in each respective collection.

Raises:#

Exception

If there is an error accessing the database collections.

property db_name#
find_by_tag(tag, as_pandas=True)[source]#

Finds and returns documents that have the specified tag.

Parameters:#

tag: str

The tag to filter documents by.

as_pandas: bool, (optional)

If True, returns the search results as a SLIC DataFrame. If False, returns a dictionary

Returns:#

pandas.DataFrame or dict

If ‘as_pandas’ is True, returns a Pandas DataFrame containing the combined search results from for the tag from Scopus and S2 collections. If False, returns a dictionary with keys ‘scopus’ and ‘s2’ containing the cursors to the search results from the respective collections.

get_id_bloom(source, max_items=1.25, false_positive_rate=0.001)[source]#

Initializes a Bloom filter with IDs from a specified source collection.

This method selects a collection based on the ‘source’ parameter, which should match one of the predefined S2 or Scopus collection names (S2_COL or SCOPUS_COL). It then retrieves IDs from the selected collection and adds them to a Bloom filter. The Bloom filter is configured based on the estimated number of items (‘max_items’) and the desired false positive rate.

Parameters:#

source: str

The name of the source collection from which to retrieve IDs. It should correspond to either SCOPUS_COL or S2_COL.

max_items: float, int, (optional)

The maximum number of items expected to be stored in the Bloom filter. This can be a fixed integer or a float representing a multiplier of the current document count in the collection. Default is 1.25.

false_positive_rate: float, (optional)

The desired false positive probability for the Bloom filter. Default is 0.001.

Returns:#

rbloom.Bloom

An instance of a Bloom filter populated with IDs from the specified collection.

Raises:#

ValueError

If ‘source’ does not match any of the predefined collection names, or if ‘max_items’ is not a float or an int.

Notes:#

  • The Bloom filter is a probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.

  • The ‘max_items’ parameter impacts the size and the false positive rate of the Bloom filter. Adjusting this parameter can optimize the performance and accuracy based on the expected dataset size.

  • The function retrieves only the ID attribute from the documents in the collection, excluding the MongoDB ‘_id’ field, to populate the Bloom filter.

Search the Penguin database by document IDs in either Scopus documents, S2 documents, or both.

This method allows for searching documents across specified collections based on their IDs. Each id in ids needs to be prefixed with the type of id being evaluted. Current supported prefixes are [‘eid’, ‘s2id’, ‘doi’]. This corresponds to Scopus ids, SemanticScholar ids, and DOIs, respectively. The results can be returned as a Pandas DataFrame if ‘as_pandas’ is True, or as MongoDB cursors for each collection if False.

Parameters:#

ids: str, [str]

The ID or list of IDs to search for within the specified collections.

as_pandas: bool, optional

If True, returns the search results as a Pandas DataFrame. If False, returns a dictionary of MongoDB cursors. Default is True.

n_jobs: int, optional

The number of parallel jobs to use for the query. If specified, overrides the instance’s default n_jobs setting.

Returns:#

pandas.DataFrame or dict

If ‘as_pandas’ is True, returns a Pandas DataFrame containing the combined search results from both collections, if applicable. If False, returns a dictionary with keys ‘scopus’ and ‘s2’ containing the cursors to the search results from the respective collections.

Raises:#

ValueError

If the arguments passed to this function are invalid

Examples:#

>>> penguin.id_search(
        ids = [
            'doi:[doi here]',
            'doi:[doi here]',
            's2id:[s2id here]',
            'eid:[eid here]',
            'eid:[eid here]',
        ]
        as_pandas=True)
property n_jobs#
property password#
query_by_author(collection_name, doc_id, id_attribute='paperId', author_attribute_list='authors', author_attribute='authorId')[source]#

Query a MongoDB collection for a document by a direct match and then locate all of the other papers from the authors of the target paper that are present in the DB.

Parameters:#

collection_name: str

The name of the MongoDB collection to query.

doc_id: str

The document ID which to look up.

id_attribute: str

The name of the attribute where the document ID is stored. Defaults to ‘paperId’ for S2 papers.

author_attribute: str

The name of the attribute in the documents that contains author information

Returns:#

list:

A list of citing papers

remove_tag(document_id, tag)[source]#

Removes a tag from the specified document.

Parameters:#

document_id: str

The identifier for the document to which the tag will be removed. This should be a document id (either S2 or Scopus). The id should be prepended by either ‘eid:’ to signify a Scopus document or ‘s2id:’ to signify an S2 document.

tag: str

The tag to be removed from the document.

Returns:#

None

Example:#

>>> penguin.remove_tag('eid:[eid here]', 'tensor')
>>> penguin.remove_tag('s2id:[s2id here]', 'tensor')
resolve_document_id(pid)[source]#

Resolves the collection and unique identifier (uid) for a given document_id. The document id can either be a Scopus id or a SemanticScholar id.

Parameters:#

document_id: str

The identifier for the document that needs to be resolved. This should be a document id (either S2 or Scopus). The id should be prepended by either ‘eid:’ to signify a Scopus document or ‘s2id:’ to signify an S2 document.

Returns:#

tuple

A tuple containing the unique identifier (uid) and the associated collection.

property s2_attributes#
property scopus_attributes#

Search the Penguin database by text matching in either Scopus documents, S2 documents or both.

This method allows for searching text across specified fields in the Scopus and S2 collections. The scopus and s2 parameters specify which text fields within the documents should be used. By default these fields are the title and abstract attributes for each document. The search objective is defined by the target parameter. This can be a single string or a list of strings to be found in the text. If this is a list of strings then the relationship between them can be defined using the join parameter. This determines whether the results should be joined with a logical ‘AND’ or ‘OR’. When searching for text, a document will be returned if the target string is seen in one or more of the text fields being searched. This means that this function supports substring matching and is case insensitive. The results can be returned as a Pandas DataFrame if ‘as_pandas’ is True, or as MongoDB cursors for each collection if False.

Parameters:#

target: str, [str]

The text or list of text terms to search for within the specified fields.

joinstr, optional

The logical join to use across target terms. Can be ‘OR’ or ‘AND’. Default is ‘OR’.

scopus: Iterable, bool, (optional)

The fields to search within the first document collection. Expected to be an iterable of valid text fields found in Scopus. For simplicity, can also be a bool. If True, uses default text attributes defined by Penguin. If False or an empty list, the collection is not searched. Default is True.

s2: Iterable, bool, (optional)

Same behavior as scopus but for the S2 documents.

as_pandas: bool, (optional)

If True, returns the search results as a SLIC DataFrame. If False, returns a dictionary of MongoDB cursors. Default is True.

n_jobs: int, (optional)

The number of parallel jobs to use for the query. If specified, overrides the instance’s default n_jobs setting.

Returns:#

pandas.DataFrame or dict

If ‘as_pandas’ is True, returns a Pandas DataFrame containing the combined search results from both collections, if applicable. If False, returns a dictionary with keys ‘scopus’ and ‘s2’ containing the cursors to the search results from the respective collections.

Raises:#

ValueError

If the arguments passed to this function are invalid

Notes:#

  • The search is case-insensitive.

  • The method can search across multiple fields and terms. Fields are always joined by OR. You cannot perform a search where a target term is required to be in both the title AND abstract, for example. The relationship between different target terms, however, can be specified by the operator used for the join parameter.

  • When ‘as_pandas’ is False, the method returns cursors, which can be iterated over to access the individual documents. These cursors do not store the data but can be thought of as pointers to where the data resides on the server (in the database). As a result they are time sensitive and become stale if not dereferenced within 10 minutes. They are also useless if conneciton to the server is lost.

update_tags(document_id, new_tags)[source]#

Updates the tags for the specified document with a new set of tags. Used for updating the entire list of tags at once

Parameters:#

document_id: str

The identifier for the document to which the tags will be modified. This should be a document id (either S2 or Scopus). The id should be prepended by either ‘eid:’ to signify a Scopus document or ‘s2id:’ to signify an S2 document.

new_tags: list

The new set of tags to be assigned to the document.

Returns:#

None

Example:#

>>> penguin.update_tags('eid:[eid here]', ['tensor', 'PDE'])
>>> penguin.update_tags('s2id:[s2id here]', ['tensor', 'PDE'])
property uri#
property username#
property verbose#