TELF.pre_processing.Vulture.tokens_analysis package#

Submodules#

TELF.pre_processing.Vulture.tokens_analysis.top_words module#

TELF.pre_processing.Vulture.tokens_analysis.top_words.get_top_words(documents, top_n=10, n_gram=1, verbose=True, filename=None) DataFrame[source]#

Collects statistics for the top words or n-grams. Returns a table with columns word, tf, df, df_fraction, and tf_fraction.

  • word column lists the words in the top_n.

  • tf is the term-frequency, how many times given word occured in documents.

  • df is the document-frequency, in how documents given word occured.

  • df_fraction is df / len(documents)

  • tf_fraction is tf / (total number of unique tokens or n-grams)

Parameters:
  • documents (list or dict) – list or dictionary of documents. If dictionary, keys are the document IDs, values are the text.

  • top_n (int, optional) – Top n words or n-grams to report. The default is 10.

  • n_gram (int, optional) – 1 is words, or n-grams when > 1. The default is 1.

  • verbose (bool, optional) – Verbosity flag. The default is True.

  • filename (str, optional) – If not one, saves the table to the given location.

Returns:

Table for the statistics.

Return type:

pd.DataFrame

Module contents#