TELF.pre_processing.Squirrel: Dataset pruning tool#
Dataset pruning tool.
Available Functions#
|
|
|
Run each pruner in sequence, passing the DataFrame result from one into the next. Before running each pruner, copy the latest “*_accept” column into prev_accept. |
Module Contents#
- class TELF.pre_processing.Squirrel.squirrel.Squirrel(data_source: str | Path | DataFrame, output_dir: str | Path, pipeline: List, label_column='type', reference_label=0, aggregrate_prune=True, data_column='title_abstract')[source]#
Bases:
object
Orchestrate a sequence of pruners that each take a DataFrame and return a DataFrame.
- Parameters:
data_source (str | Path | pd.DataFrame) – CSV path or initial DataFrame to process.
output_dir (str | Path) – Base directory for pruner outputs.
pipeline (list) – List of pruner instances; each __call__ must return a DataFrame.
Available Functions#
|
Initialize the EmbeddingPruner. |
|
Execute pruning: annotate 'embed_accept' for rows that were inliers. |
Load or compute embeddings for the specified column in the DataFrame. |
|
|
Compute which rows are within threshold distance to reference centroid |
Module Contents#
- class TELF.pre_processing.Squirrel.pruners.embed_prune.EmbeddingPruner(*, embedding_model: str = 'SCINCL', distance_std_factor: float = 3.0, overwrite_embeddings: bool = False, use_gpu: bool | None = None, verbose: bool = True)[source]#
Bases:
object
Prune documents by distance from a reference-class centroid in embedding space
Initialize the EmbeddingPruner.
- Parameters:
embedding_model (str) – Name of the embedding model to use.
distance_std_factor (float) – Multiplier on standard deviation to set distance threshold.
overwrite_embeddings (bool) – If True, always recompute embeddings even if cache exists.
use_gpu (bool or None) – Whether to use GPU for embedding. If None, auto-detect.
verbose (bool) – Whether to display progress bars during embedding.
- load_or_compute_embeddings(df, output_dir, data_column) ndarray [source]#
Load or compute embeddings for the specified column in the DataFrame.
- Parameters:
df (pd.DataFrame) – The input DataFrame to be processed.
output_dir (str or Path) – Directory to save the output files.
data_column (str) – Column name containing the data to be voted on.
- select_inliers(df, emb: ndarray, label_column: str, reference_label: int | str) ndarray [source]#
Compute which rows are within threshold distance to reference centroid
- Parameters:
df (pd.DataFrame) – DataFrame containing the dataset.
emb (np.ndarray) – Embedding matrix of shape (n_samples, embedding_dim).
label_column (str) – Column name indicating class labels.
reference_label (int | str) – Label used as the reference class for centroid.
- Returns:
inliers_mask – Mask indicating rows within distance threshold.
- Return type:
np.ndarray of bool
Available Functions#
|
Perform LLM-based refinement on an embedding-pruned dataset, annotating each document with an llm_accept boolean column. |
|
Run LLM voting across all rows, annotate llm_accept, save vote records and the annotated DataFrame, and return it. |
Module Contents#
- class TELF.pre_processing.Squirrel.pruners.llm_prune.LLMPruner(llm_model_name: str, llm_api_url: str, llm_vote_trials: int, llm_promote_threshold: float, llm_temperature: float, verbose: bool = True)[source]#
Bases:
object
Perform LLM-based refinement on an embedding-pruned dataset, annotating each document with an llm_accept boolean column.
- Parameters:
llm_model_name (str) – Ollama model identifier (e.g. “llama3.1:405b”).
llm_api_url (str) – Base URL for the Ollama API.
llm_vote_trials (int) – Number of independent votes per document.
llm_promote_threshold (float) – Fraction of “yes” votes required to accept a previously rejected doc.
llm_temperature (float) – Sampling temperature for the LLM.
verbose (bool) – Whether to show tqdm progress bars.