TELF.pre_processing.Vulture.tokens_analysis package#

Submodules#

TELF.pre_processing.Vulture.tokens_analysis.top_words module#

TELF.pre_processing.Vulture.tokens_analysis.top_words.get_top_words(documents, top_n=10, n_gram=1, verbose=True, filename=None) → DataFrame[source]#

Collects statistics for the top words or n-grams. Returns a table with columns word, tf, df, df_fraction, and tf_fraction.

word column lists the words in the top_n.
tf is the term-frequency, how many times given word occured in documents.
df is the document-frequency, in how documents given word occured.
df_fraction is df / len(documents)
tf_fraction is tf / (total number of unique tokens or n-grams)

Parameters:

documents (list or dict) – list or dictionary of documents. If dictionary, keys are the document IDs, values are the text.
top_n (int, optional) – Top n words or n-grams to report. The default is 10.
n_gram (int, optional) – 1 is words, or n-grams when > 1. The default is 1.
verbose (bool, optional) – Verbosity flag. The default is True.
filename (str, optional) – If not one, saves the table to the given location.

Returns:

Table for the statistics.

Return type:

pd.DataFrame

TELF.pre_processing.Vulture.tokens_analysis package

Contents

TELF.pre_processing.Vulture.tokens_analysis package#

Submodules#

TELF.pre_processing.Vulture.tokens_analysis.top_words module#

Module contents#