TELF.pre_processing.Squirrel: Dataset pruning tool#

Dataset pruning tool.

Available Functions#

`Squirrel.__init__`(data_source, output_dir, ...)
`Squirrel.__call__`()	Run each pruner in sequence, passing the DataFrame result from one into the next. Before running each pruner, copy the latest “*_accept” column into prev_accept.

Module Contents#

class TELF.pre_processing.Squirrel.squirrel.Squirrel(data_source: str | Path | DataFrame, output_dir: str | Path, pipeline: List, label_column='type', reference_label=0, aggregrate_prune=True, data_column='title_abstract')[source]#

Bases: object

Orchestrate a sequence of pruners that each take a DataFrame and return a DataFrame.

Parameters:

data_source (str | Path | pd.DataFrame) – CSV path or initial DataFrame to process.
output_dir (str | Path) – Base directory for pruner outputs.
pipeline (list) – List of pruner instances; each __call__ must return a DataFrame.

Available Functions#

`EmbeddingPruner.__init__`(*[, ...])	Initialize the EmbeddingPruner.
`EmbeddingPruner.__call__`(df, output_dir, ...)	Execute pruning: annotate 'embed_accept' for rows that were inliers.
`EmbeddingPruner.load_or_compute_embeddings`(df, ...)	Load or compute embeddings for the specified column in the DataFrame.
`EmbeddingPruner.select_inliers`(df, emb, ...)	Compute which rows are within threshold distance to reference centroid

Module Contents#

class TELF.pre_processing.Squirrel.pruners.embed_prune.EmbeddingPruner(*, embedding_model: str = 'SCINCL', distance_std_factor: float = 3.0, overwrite_embeddings: bool = False, use_gpu: bool | None = None, verbose: bool = True)[source]#

Bases: object

Prune documents by distance from a reference-class centroid in embedding space

Initialize the EmbeddingPruner.

Parameters:

embedding_model (str) – Name of the embedding model to use.
distance_std_factor (float) – Multiplier on standard deviation to set distance threshold.
overwrite_embeddings (bool) – If True, always recompute embeddings even if cache exists.
use_gpu (bool or None) – Whether to use GPU for embedding. If None, auto-detect.
verbose (bool) – Whether to display progress bars during embedding.

load_or_compute_embeddings(df, output_dir, data_column) → ndarray[source]#

Load or compute embeddings for the specified column in the DataFrame.

Parameters:

df (pd.DataFrame) – The input DataFrame to be processed.
output_dir (str or Path) – Directory to save the output files.
data_column (str) – Column name containing the data to be voted on.

select_inliers(df, emb: ndarray, label_column: str, reference_label: int | str) → ndarray[source]#

Compute which rows are within threshold distance to reference centroid

Parameters:

df (pd.DataFrame) – DataFrame containing the dataset.
emb (np.ndarray) – Embedding matrix of shape (n_samples, embedding_dim).
label_column (str) – Column name indicating class labels.
reference_label (int | str) – Label used as the reference class for centroid.

Returns:

inliers_mask – Mask indicating rows within distance threshold.

Return type:

np.ndarray of bool

Available Functions#

`LLMPruner.__init__`(llm_model_name, ...[, ...])	Perform LLM-based refinement on an embedding-pruned dataset, annotating each document with an llm_accept boolean column.
`LLMPruner.__call__`(df, output_dir, ...)	Run LLM voting across all rows, annotate llm_accept, save vote records and the annotated DataFrame, and return it.

Module Contents#

class TELF.pre_processing.Squirrel.pruners.llm_prune.LLMPruner(llm_model_name: str, llm_api_url: str, llm_vote_trials: int, llm_promote_threshold: float, llm_temperature: float, verbose: bool = True)[source]#

Bases: object

Perform LLM-based refinement on an embedding-pruned dataset, annotating each document with an llm_accept boolean column.

Parameters:

llm_model_name (str) – Ollama model identifier (e.g. “llama3.1:405b”).
llm_api_url (str) – Base URL for the Ollama API.
llm_vote_trials (int) – Number of independent votes per document.
llm_promote_threshold (float) – Fraction of “yes” votes required to accept a previously rejected doc.
llm_temperature (float) – Sampling temperature for the LLM.
verbose (bool) – Whether to show tqdm progress bars.