TELF.pre_processing.Vulture package#

Subpackages#

Submodules#

TELF.pre_processing.Vulture.modules module#

TELF.pre_processing.Vulture.vulture module#

© 2022. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.

class TELF.pre_processing.Vulture.vulture.Vulture(*, n_jobs=-1, n_nodes=1, parallel_backend='multiprocessing', cache='/tmp', verbose=False)[source]#

Bases: object

Vulture is a parallel, multi-node parallel, and distributed parallel document pre-processing tool. It is designed to be simple and fast.

Vultures are natures’ cleaners!

DEFAULT_OPERATOR_PIPELINE = [NEDetector(module_type='OPERATOR', backend=None)]#
DEFAULT_PIPELINE = [SimpleCleaner(module_type='CLEANER', effective_stop_words=['characteristics', 'acknowledgment', 'characteristic', 'approximately', 'investigation', 'unfortunately', 'corresponding', 'substantially', 'significantly', 'automatically', 'predominantly', 'successfully', 'demonstrates', 'nevertheless', 'particularly', 'applications', 'specifically', 'consequently', 'respectively', 'representing', 'demonstrated', 'introduction', 'sufficiently', 'application', 'conclusions', ... (+1359 more)], patterns={'standardize_hyphens': (re.compile('[\\u002D\\u2010\\u2011\\u2012\\u2013\\u2014\\u2015\\u2212\\u2E3A\\u2E3B]'), '-'), 'remove_copyright_statement': None, 'remove_stop_phrases': None, 'make_lower_case': None, 'normalize': None, 'remove_trailing_dash': ('(?<!\\w)-|-(?!\\w)', ''), 'make_hyphens_words': ('([a-z])\\-([a-z])', ''), 'remove_next_line': ('\\n+', ' '), 'remove_email': ('\\S*@\\S*\\s?', ''), 'remove_formulas': ('\\b\\w*[\\=\\≈\\/\\\\\\±]\\w*\\b', ''), 'remove_dash': ('-', ''), 'remove_between_[]': ('\\[.*?\\]', ' '), 'remove_between_()': ('\\(.*?\\)', ' '), 'remove_[]': ('[\\[\\]]', ' '), 'remove_()': ('[()]', ' '), 'remove_\\': ('\\\\', ' '), 'remove_numbers': ('\\d+', ''), 'remove_standalone_numbers': ('\\b\\d+\\b', ''), 'remove_nonASCII_boundary': ('\\b[^\\x00-\\x7F]+\\b', ''), 'remove_nonASCII': ('[^\\x00-\\x7F]+', ''), 'remove_tags': ('&lt;/?.*?&gt;', ''), 'remove_special_characters': ('[!|"|#|$|%|&|\\|\\\'|(|)|*|+|,|.|/|:|;|<|=|>|?|@|[|\\|]|^|_|`|{|\\||}|~]', ''), 'isolate_frozen': None, 'remove_extra_whitespace': ('\\s+', ' '), 'remove_stop_words': None, 'min_characters': None}, exclude_hyphenated_stopwords=False, sw_pattern=re.compile('\\b[\\w-]+\\b'))]#
PARALLEL_BACKEND_OPTIONS = {'loky', 'multiprocessing', 'threading'}#
property cache#
clean(documents, steps=None, substitutions=None, save_path=None)[source]#
clean_dataframe(df, columns, steps=None, substitutions=None, append_to_original_df=False, concat_cleaned_cols=False)[source]#
property n_jobs#
property n_nodes#
operate(documents, steps=None, save_path=None, file_name='')[source]#
property parallel_backend#
property save_path#
use_mpi()[source]#
property verbose#
TELF.pre_processing.Vulture.vulture.chunk_tuple_list(l, n_chunks)[source]#

Splits the given list of (key, value) tuples into sub-lists.

Parameters:
  • l (list of tuple) – List of (key, value) tuples to split.

  • n_chunks (int) – How many sets of sub-lists to create.

Yields:

list – Sub-list containing (key, value) tuples.

Module contents#