TELF.pre_processing.Vulture.tokens_analysis package#
Submodules#
TELF.pre_processing.Vulture.tokens_analysis.top_words module#
- TELF.pre_processing.Vulture.tokens_analysis.top_words.get_top_words(documents, top_n=10, n_gram=1, verbose=True, filename=None) DataFrame [source]#
Collects statistics for the top words or n-grams. Returns a table with columns word, tf, df, df_fraction, and tf_fraction.
word column lists the words in the top_n.
tf is the term-frequency, how many times given word occured in documents.
df is the document-frequency, in how documents given word occured.
df_fraction is df / len(documents)
tf_fraction is tf / (total number of unique tokens or n-grams)
- Parameters:
documents (list or dict) – list or dictionary of documents. If dictionary, keys are the document IDs, values are the text.
top_n (int, optional) – Top n words or n-grams to report. The default is 10.
n_gram (int, optional) – 1 is words, or n-grams when > 1. The default is 1.
verbose (bool, optional) – Verbosity flag. The default is True.
filename (str, optional) – If not one, saves the table to the given location.
- Returns:
Table for the statistics.
- Return type:
pd.DataFrame