TELF.pre_processing.Squirrel: Dataset pruning tool#

Dataset pruning tool.

Available Functions#

Squirrel.__init__(data_source, output_dir, ...)

Squirrel.__call__()

Run each pruner in sequence, passing the DataFrame result from one into the next. Before running each pruner, copy the latest “*_accept” column into prev_accept.

Module Contents#

class TELF.pre_processing.Squirrel.squirrel.Squirrel(data_source: str | Path | DataFrame, output_dir: str | Path, pipeline: List, label_column='type', reference_label=0, aggregrate_prune=True, data_column='title_abstract')[source]#

Bases: object

Orchestrate a sequence of pruners that each take a DataFrame and return a DataFrame.

Parameters:
  • data_source (str | Path | pd.DataFrame) – CSV path or initial DataFrame to process.

  • output_dir (str | Path) – Base directory for pruner outputs.

  • pipeline (list) – List of pruner instances; each __call__ must return a DataFrame.

Available Functions#

EmbeddingPruner.__init__(*[, ...])

Initialize the EmbeddingPruner.

EmbeddingPruner.__call__(df, output_dir, ...)

Execute pruning: annotate 'embed_accept' for rows that were inliers.

EmbeddingPruner.load_or_compute_embeddings(df, ...)

Load or compute embeddings for the specified column in the DataFrame.

EmbeddingPruner.select_inliers(df, emb, ...)

Compute which rows are within threshold distance to reference centroid

Module Contents#

class TELF.pre_processing.Squirrel.pruners.embed_prune.EmbeddingPruner(*, embedding_model: str = 'SCINCL', distance_std_factor: float = 3.0, overwrite_embeddings: bool = False, use_gpu: bool | None = None, verbose: bool = True)[source]#

Bases: object

Prune documents by distance from a reference-class centroid in embedding space

Initialize the EmbeddingPruner.

Parameters:
  • embedding_model (str) – Name of the embedding model to use.

  • distance_std_factor (float) – Multiplier on standard deviation to set distance threshold.

  • overwrite_embeddings (bool) – If True, always recompute embeddings even if cache exists.

  • use_gpu (bool or None) – Whether to use GPU for embedding. If None, auto-detect.

  • verbose (bool) – Whether to display progress bars during embedding.

load_or_compute_embeddings(df, output_dir, data_column) ndarray[source]#

Load or compute embeddings for the specified column in the DataFrame.

Parameters:
  • df (pd.DataFrame) – The input DataFrame to be processed.

  • output_dir (str or Path) – Directory to save the output files.

  • data_column (str) – Column name containing the data to be voted on.

select_inliers(df, emb: ndarray, label_column: str, reference_label: int | str) ndarray[source]#

Compute which rows are within threshold distance to reference centroid

Parameters:
  • df (pd.DataFrame) – DataFrame containing the dataset.

  • emb (np.ndarray) – Embedding matrix of shape (n_samples, embedding_dim).

  • label_column (str) – Column name indicating class labels.

  • reference_label (int | str) – Label used as the reference class for centroid.

Returns:

inliers_mask – Mask indicating rows within distance threshold.

Return type:

np.ndarray of bool

Available Functions#

LLMPruner.__init__(llm_model_name, ...[, ...])

Perform LLM-based refinement on an embedding-pruned dataset, annotating each document with an llm_accept boolean column.

LLMPruner.__call__(df, output_dir, ...)

Run LLM voting across all rows, annotate llm_accept, save vote records and the annotated DataFrame, and return it.

Module Contents#

class TELF.pre_processing.Squirrel.pruners.llm_prune.LLMPruner(llm_model_name: str, llm_api_url: str, llm_vote_trials: int, llm_promote_threshold: float, llm_temperature: float, verbose: bool = True)[source]#

Bases: object

Perform LLM-based refinement on an embedding-pruned dataset, annotating each document with an llm_accept boolean column.

Parameters:
  • llm_model_name (str) – Ollama model identifier (e.g. “llama3.1:405b”).

  • llm_api_url (str) – Base URL for the Ollama API.

  • llm_vote_trials (int) – Number of independent votes per document.

  • llm_promote_threshold (float) – Fraction of “yes” votes required to accept a previously rejected doc.

  • llm_temperature (float) – Sampling temperature for the LLM.

  • verbose (bool) – Whether to show tqdm progress bars.