hippynn
Full Documentation for hippynn
package.
Click here for a summary page.
The hippynn python package.
- settings
Values for the current hippynn settings. See Library Settings for a description.
- class Database(arr_dict: dict[str, ~numpy.ndarray], inputs: list[str], targets: list[str], seed: [<class 'int'>, <class 'numpy.random.mtrand.RandomState'>, <class 'tuple'>], test_size: float | int = None, valid_size: float | int = None, num_workers: int = 0, pin_memory: bool = True, allow_unfound: bool = False, auto_split: bool = False, device: ~torch.device = None, dataloader_kwargs: dict[str, object] = None, quiet=False)[source]
Bases:
object
Class for holding a pytorch dataset, splitting it, generating dataloaders, etc.”
- add_split_masks(dict_to_add_to=None, split_prefix=None)[source]
Add split masks to the dataset. This function is used internally before writing databases.
When using the dict_to_add_to parameter, this function writes numpy arrays. When adding to self.splits, this function writes tensors. :param dict_to_add_to: where to put the split masks. Default to self.splits. :param split_prefix: prefix for mask names :return:
- get_device() device [source]
Determine what device the database resides on. Raises ValueError if multiple devices are encountered.
- Returns:
device.
- make_automatic_splits(split_prefix=None, dry_run=False)[source]
Split the database automatically. Since the user specifies this routine, it fails pretty strictly.
- Parameters:
split_prefix – None, use default. If otherwise, use this prefix to determine what arrays are masks.
dry_run – Only validate that existing split masks are correct; don’t perform splitting.
- Returns:
- make_database_cache(file: str = './hippynn_db_cache.npz', overwrite: bool = False, **override_kwargs) Database [source]
Cache the database as-is, and re-open it.
Useful for creating an easy restart script if the storage space is available. The new datatbase will by default inherit the properties of this database.
usage: >>> database = database.make_database_cache()
- Parameters:
file – where to store the database
overwrite – whether to overwrite an existing cache file with this name.
override_kwargs – passed to NPZDictionary instead of the current database settings.
- Returns:
The new database created from the cache.
- make_explicit_split(split_name: str, split_indices: ndarray)[source]
- Parameters:
split_name – name for split, typically ‘train’, ‘valid’, ‘test’
split_indices – the indices of the items for the split
- Returns:
- make_explicit_split_bool(split_name: str, split_mask: ndarray | tensor)[source]
- Parameters:
split_name – name for split, typically ‘train’, ‘valid’, ‘test’
split_mask – a boolean array for where to split
- Returns:
- make_generator(split_name: str, evaluation_mode: str, batch_size: int | None = None, subsample: float | bool = False)[source]
Makes a dataloader for the given type of split and evaluation mode of the model.
In most cases, you do not need to call this function directly as a user.
- Parameters:
split_name – str; “train”, “valid”, or “test” ; selects data to use
evaluation_mode – str; “train” or “eval”. Used for whether to shuffle.
batch_size – passed to pytorch
subsample – fraction to subsample
- Returns:
dataloader containing relevant data
- make_random_split(split_name: str, split_size: int | float)[source]
Make a random split using self.random_state to select items.
- Parameters:
split_name – String naming the split, can be anything, but ‘train’, ‘valid’, and ‘test’ are special.
split_size – int (number of items) or float<1, fraction of samples.
- Returns:
- make_trainvalidtest_split(*, test_size: int | float, valid_size: int | float)[source]
Make a split for train, valid, and test out of any remaining unsplit entries in the database. The size is specified in terms of test and valid splits; the train split will be the remainder.
If you wish to specify precise rows for each split, see make_explict_split or make_explicit_split_bool.
This function takes keyword-arguments only in order to prevent confusion over which size is which.
The types of both test_size and valid_size parameters must match.
- Parameters:
test_size – int (count) or float (fraction) of data to assign to test split
valid_size – int (count) or float (fraction) of data to assign to valid split
- Returns:
None
- remove_high_property(key: str, atomwise: bool, norm_per_atom: bool = False, species_key: str = None, cut: float | None = None, std_factor: float | None = 10, norm_axis: int | None = None)[source]
For removing outliers from a dataset. Use with caution; do not inadvertently remove outliers from benchmarks!
The parameters cut and std_factor can be set to None to avoid their steps. the per_atom and atom_var properties are exclusive; they cannot both be true.
- Parameters:
key – The property key in the dataset to check for high values
atomwise – True if the property is defined per atom in axis 1, otherwise property is treated as whole-system value
norm_per_atom – True if the property should be normalized by atom counts
species_key – Which array represents the atom presence; required if per_atom is True
cut – If values > mu + cut, the system is removed. The step done first.
std_factor – If (value-mu)/std > std_fact, the system is trimmed. This step done second.
norm_axis – if not None, the property array is normed on the axis. Useful for vector properties like force.
- Returns:
- send_to_device(device: device = None)[source]
Move the database to an accelerator device if possible. In some circumstances this can accelerate training.
Note
If the database is moved to a GPU, pin_memory will be set to False and num_workers will be set to 0.
- Parameters:
device – device to move to, if None, try to auto-detect.
- Returns:
- sort_by_index(index_name: str = 'indices')[source]
Sort arrays in each split of the database by an index key.
The default is ‘indices’, also possible is ‘split_indices’, or any other variable name in the database.
- Parameters:
index_name
- Returns:
None
- trim_by_species(species_key: str, keep_splits_same_size: bool = True)[source]
Remove any excess padding in a database.
- Parameters:
species_key – what array to use to mark atom presence.
keep_splits_same_size – true: trim by the minimum amount across splits, false: trim by the maximum amount for each split.
- Returns:
None
- write_h5(split: str | None = None, h5path: str | None = None, species_key: str = 'species', overwrite: bool = False)[source]
Write this database to the pyanitools h5 format. See
hippynn.databases.h5_pyanitools.write_h5()
for details.Note: This function will error if h5py is not installed.
- Parameters:
split
h5path
species_key
overwrite
- Returns:
- write_npz(file: str, record_split_masks: bool = True, compressed: bool = True, overwrite: bool = False, split_prefix: str | None = None, return_only: bool = False)[source]
- Parameters:
file – str, Path, or file object compatible with np.save
record_split_masks – whether to generate and place masks for the splits into the saved database.
compressed – whether to use np.savez_compressed (True) or np.savez
overwrite – Whether to accept an existing path. Only used if fname is str or path.
split_prefix – optionally override the prefix for the masks computed by the splits.
return_only – if True, ignore the file string and just return the resulting dictionary of numpy arrays.
- Returns:
- property var_list
- class DirectoryDatabase(directory, name, inputs, targets, *args, quiet=False, allow_unfound=False, **kwargs)[source]
Bases:
Database
,Restartable
Database stored as NPY files in a directory.
- Parameters:
directory – directory path where the files are stored
name – prefix for the arrays.
This function loads arrays of the format f”{name}{db_name}.npy” for each variable db_name in inputs and targets.
Other arguments: See
Database
.Note
This database loader does not support the
allow_unfound
setting in the baseDatabase
. The variables to load must be set explicitly in the inputs and targets.
- class GraphModule(required_inputs, nodes_to_compute)[source]
Bases:
Module
- extra_repr()[source]
Set the extra representation of the module.
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
- forward(*input_values)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class IdxType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
Enum
- Atoms = 'Atoms'
- MolAtom = 'MolAtom'
- MolAtomAtom = 'MolAtomAtom'
- Molecules = 'Molecules'
- NotFound = 'NOT FOUND'
- Pair = 'Pair'
- QuadMol = 'QuadMol'
- QuadPack = 'QuadPack'
- Scalar = 'Scalar'
- class NPZDatabase(file, inputs, targets, *args, allow_unfound=False, quiet=False, **kwargs)[source]
Bases:
Database
,Restartable
- class Predictor(inputs, outputs, return_device=device(type='cpu'), model_device=None, requires_grad=False, name=None)[source]
Bases:
object
The predictor is a dressed-up GraphModule which gives access to the outputs of individual nodes.
In many cases you may simply want to use the
from_graph
method to generate a predictor.The predictor will take the model graph, convert the output nodes into a padded index state, and build a new graph for these operations.
- apply_to_database(db, **kwargs)[source]
Note: kwargs are passed to self.__call__, e.g. the
batch_size
parameter.
- classmethod from_graph(graph, additional_outputs=None, **kwargs)[source]
Construct a new predictor from an existing GraphModule.
- Parameters:
graph – graph to create predictor for. The predictor makes a shallow copy of this graph. e.g. it may move parameters from that graph to the model_device.
additional_outputs – List of additional nodes to include in outputs
kwargs – passed to
__init__
- Returns:
predictor instance
- property inputs
- property model_device
- property outputs
- active_directory(dirname, create=None)[source]
Context manager for temporarily switching the current working directory.
If create is None, always succeed. If create is True, only succeed if the directory does not exist, and create one. If create is False, only succeed if the directory does exist, and switch to it.
In other words, use create=True if you want to force that it’s a new directory. Use create=False if you want to switch to an existing directory. Use create=None create a directory if you are okay with either alternative.
- Parameters:
dirname – directory to enter
create – (None,True,False)
- Returns:
None
- Raises:
If directory status not compatible with create constraints.
- hierarchical_energy_initialization(energy_module, database=None, trainable_after=False, decay_factor=0.01, encoder=None, energy_name=None, species_name=None, peratom=False)[source]
Computes values for the non-interacting energy using the training data.
- Parameters:
energy_module – HEnergyNode or torch module for energy prediction
database – InterfaceDB object to get training data, required if model contains E0 term
trainable_after – Determines if it should change .requires_grad attribute for the E0 parameters
decay_factor – change initialized weights of further energy layers by
df**N
for layer Nencoder – species encoder, can be auto-identified from energy node
energy_name – name for the energy variable, can be auto-identified from energy node
species_name – name for the species variable, can be auto-identified from energy node
peratom
- Returns:
None
- load_checkpoint(structure_fname: str, state_fname: str, restart_db=False, map_location=None, model_device=None, **kwargs) dict [source]
Load checkpoint file from given filename.
For details more information on to use this function, see Restarting training.
- Parameters:
structure_fname – name of the structure file
state_fname – name of the state file
restart_db – restore database or not, defaults to False
map_location – device mapping argument for
torch.load
, defaults to Nonemodel_device – automatically handle device mapping. Defaults to None, defaults to None
- Returns:
experiment structure
- load_checkpoint_from_cwd(map_location=None, model_device=None, **kwargs) dict [source]
Same as
load_checkpoint
, but using default filenames.- Parameters:
map_location (Union[str, dict, torch.device, Callable], optional) – device mapping argument for
torch.load
, defaults to Nonemodel_device (Union[int, str, torch.device], optional) – automatically handle device mapping. Defaults to None, defaults to None
- Returns:
experiment structure
- Return type:
dict
- load_model_from_cwd(map_location=None, model_device=None, **kwargs) GraphModule [source]
Only load model from current working directory.
- Parameters:
map_location (Union[str, dict, torch.device, Callable], optional) – device mapping argument for
torch.load
, defaults to Nonemodel_device (Union[int, str, torch.device], optional) – automatically handle device mapping. Defaults to None, defaults to None
- Returns:
model with reloaded parameters
- log_terminal(file, *args, **kwargs)[source]
- Param:
file: filename or string
- Param:
args: piped to
open(file,*args,**kwargs)
if file is a string- Param:
kwargs: piped to
open(file,*args,**kwargs)
if file is a string
Context manager where stdout and stderr are redirected to the specified file in addition to the usual stdout and stderr. The manager yields the file. Writes to the opened file object with “with log_terminal(…) as <file>” will not automatically be piped into the terminal.
- make_ensemble(models, *, targets: List[str] = 'auto', inputs: List[str] = 'auto', prefix: str = 'ensemble_', quiet=False) Tuple[GraphModule, Tuple[Dict[str, int], Dict[str, int]]] [source]
Make an ensemble out of a set of models. The ensemble graph can then be used with a predictor, ase graph, or etc.
The selected nodes to ensemble are classed by the db_name associated with the nodes.
When using “auto” mode for inputs and outputs:
The input to the ensemble will be the combined inputs for all models in the ensemble.
The output of the ensemble will be the combined outputs for all models.
Otherwise, the set of ensemble inputs is explicitly specified, and errors may occur if the requested set of inputs and outputs is not available.
The result ensemble graph has several outputs, each of which has .mean, .std, and .all attributes which reflect the statistics of the models in the ensemble.
Note that it is not required that all models have the same sets of inputs and outputs.
If a desired node is automatically ensembled, it probably does not have a db_name. A remedy for this is to load the graphs with hippynn.graphs.ensemble.get_graphs, then find the requested nodes in the graphs and assign them the db_name. Then pass these graphs to make_ensemble.
For more information on the models parameter, see the
get_graphs()
function.- Parameters:
models – list containing str, node, or graphmodule, or str to glob for model directories.
targets – list of db_name strings or the string ‘auto’, which will attempt to infer.
inputs – list of db_name strings of the string ‘auto’, which will attempt to infer.
prefix – specifies the prefix for the db_name of created ensemble nodes.
quiet – whether to print information about the constructed ensemble.
- Returns:
ensemble GraphModule, (intput_info, output_info)
- reload_settings(**kwargs)[source]
Attempt to reload the hippynn library settings.
- Settings sources are, in order from least to greatest priority:
Default values
- The file ~/.hippynnrc, which is a standard python config file which contains
variables under the section name [GLOBALS].
- A file specified by the environment variable HIPPYNN_LOCAL_RC_FILE
which is treated the same as the user rc file.
Environment variables prefixed by
HIPPYNN_
, e.g.HIPPYNN_DEFAULT_PLOT_FILETYPE
.Keyword arguments passed to this function.
- Parameters:
kwargs – explicit settings to change.
- Returns:
- set_custom_kernels(active: bool | str = True) str [source]
Activate or deactivate custom kernels for interaction.
- This function changes the global variables:
- Special non-implementation-name values are:
True: - Use the best GPU kernel from recommended implementations, error if none are available.
False: - equivalent to “pytorch”
“auto”: - Equivalently to True if recommended is available, else equivalent to “pytorch”
- Parameters:
active – implementation name to activate
- Returns:
active, actual implementation selected.
- setup_and_train(training_modules: TrainingModules, database: Database, setup_params: SetupParams, store_all_better=False, store_best=True, store_every=0)[source]
- Param:
training_modules: see
setup_training()
- Param:
database: see
train_model()
- Param:
setup_params: see
setup_training()
- Param:
store_all_better: Save the state dict for each model doing better than a previous one
- Param:
store_best: Save a checkpoint for the best model
- Param:
store_every: Save a checkpoint for every certain epochs
- Returns:
See
train_model()
Shortcut for setup_training followed by train_model.
Note
The training loop will capture KeyboardInterrupt exceptions to abort the experiment early. If you would like to gracefully kill training programmatically, see
train_model()
with callbacks argument.Note
Saves files in the current running directory; recommend you switch to a fresh directory with a descriptive name for your experiment.
- setup_training(training_modules: TrainingModules, setup_params: SetupParams)[source]
Prepares training_modules for training with experiment_params.
- Param:
training_modules: Tuple of model, training loss, and evaluation losses (Can be built from graph using graphs.assemble_training_modules)
- Param:
setup_params: parameters controlling how training is performed (See
SetupParams
)
Roughly:
sets devices for training modules
if no controller given:
instantiates and links optimizer to the learnable params on the model
instantiates and links scheduler to optimizer
builds a default controller with setup params
creates a MetricTracker for storing the training metrics
- Returns:
(optimizer,evaluator,controller,metrics,callbacks)
- test_model(database, evaluator, batch_size, when, metric_tracker=None)[source]
Tests the model on the database according to the model_evaluator metrics. If a plot_maker is attached to the model evaluator, it will make plots. The plots will go in a sub-folder specified by when the testing is taking place. The results are then printed.
- Parameters:
database – The database test the model on.
evaluator – The evaluator containing model and evaluation losses to measure.
when – A string to specify what plots are currently to be used.
metric_tracker – (Optional) metric tracker to save metrics on. If not provided, a blank one will be constructed.
- Returns:
metric tracker
- train_model(training_modules, database, controller, metric_tracker, callbacks, batch_callbacks, store_all_better=False, store_best=True, store_every=0, store_structure_file=True, store_metrics=True, quiet=False)[source]
Performs training loop, allows keyboard interrupt. When done, reinstate the best model, make plots and metrics over time, and test the model.
- Parameters:
training_modules – tuple-like of model, loss, and evaluator
database – Database
controller – Controller
metric_tracker – MetricTracker for storing model performance
callbacks – callbacks to perform after every epoch.
batch_callbacks – callbacks to perform after every batch
store_best – Save a checkpoint for the best model
store_all_better – Save the state dict for each model doing better than a previous one
store_every – Save a checkpoint for every certain epochs
store_structure_file – Save the structure file for this experiment
store_metrics – Save the metric tracker for this experiment.
quiet – If True, disable printing during training (still prints testing results).
- Returns:
metric_tracker
Note
callbacks take the form of an iterable of callables and will be called with cb(epoch,new_best)
epoch indicates the epoch number
new_best indicates if the model is a new best model
Note
batch_callbacks take the form of an iterable of callables and will each be called with cb(batch_inputs, batch_model_outputs, batch_targets)
Note
You may want to make your callbacks store other state, if so, an easy way is to make them a callable object.
Note
callback state is not managed by
hippynn
. If your wish to save or load callback state, you will have to manage that manually (possibly with a callback itself).
Subpackages
Submodules