Hyperparameter optimization with Ax and Ray =========================================== Here is an example of how you can perform hyperparameter optimization sequentially (with Ax) or in parallel (with Ax and Ray). Prerequisites ------------- The packages required to perform this task are `Ax`_ and `ray`_. :: conda install -c conda-forge "ray < 2.7.0" pip install ax-platform!=0.4.1 .. note:: The scripts have been tested with `ax-platform 0.4.0` and `ray 2.6.3`, and many previous versions of the two packages. Unfortunately, several changes made in recent versions of `ray` will break this script. You should install `ray < 2.7.0`. ``pip install`` is recommended by the Ax developers even if a conda environment is used. As of now (Sep 2024), `ax-platform 0.4.1` is broken. See the `issue`_ here. Please avoid this version in your setup. .. note:: If you can update this example and scripts to accommodate the changes in the latest Ray package, feel free to submit a pull request. Typical workflow ---------------- Ax is a package that can perform Bayesian optimization. With the given parameter range, a set of initial trials is generated. Then based on the metrics returned from these trials, new test parameters are generated. By default, this Ax workflow can only be performed sequentially. We can combine Ray and Ax to utilize multiple GPUs on the same node. Ray interfaces with Ax to pull trial parameters and then automatically distribute the trials to available resources. With this, we can perform an asynchronous parallelized hyperparameter optimization. Create an Ax experiment ^^^^^^^^^^^^^^^^^^^^^^^ You can create a basic Ax experiment this way .. code-block:: python from ax.service.ax_client import AxClient ax_client = AxClient() ax_client.create_experiment( name="hyper_opt", parameters=[ { "name": "parameter_a", "type": "fixed", "value_type": "float", "value": 0.6, }, { "name": "parameter_b", "type": "range", "value_type": "int", "bounds": [20, 40], }, { "name": "parameter_c", "type": "choice", "bounds": [30, 40, 50, 60, 70], }, { "name": "parameter_d", "type": "range", "value_type": "float", "bounds": [0.001, 1], "log_scale": True, }, ], objectives={ "Metric": ObjectiveProperties(minimize=True), }, parameter_constraints=[ "parameter_b <= parameter_c", ], ) Here we create an Ax experiment called "hyper_opt", with 4 parameters, `parameter_a`, `parameter_b`, `parameter_c`, and `parameter_d`. Our goal is to minimize a metric called "Metric". A few crucial things to note: * You can give a range, choice, or fixed value to each parameter. You might want to specify the data type as well. A fixed parameter makes sense here because you can do the optimization with only a subset of parameters without the need to modify your training function. * Constraints can be applied to the search space like the example shows, but there is no easy way to achieve a constraint that contains mathematical expressions (for example, `parameter_a < 2 * parameter_b`). * For each experiment, Ax will generate a dictionary as the input of the training function. The dictionary will look like:: { "parameter_a": 0.6, "parameter_b": 30, "parameter_c": 40, "parameter_d": 0.2 } As such, the training function must be able to take a dictionary as the input (as a single dictionary or keyword arguments) and use these values to set up the training. * The `objectives` keyword argument takes a dictionary of variables. The keys of the dictionary **MUST** exist in the dictionary returned from the training function. In this example, the training function must return a dictionary like:: return { ... "Metric": metric, ... } The above two points will become more clear when we go through the training function. Training function ^^^^^^^^^^^^^^^^^ You only need a minimal change to your existing training script to use it with Ax. In most cases, you just have to wrap the whole script into a function .. code-block:: python def training(trial_index, parameter_a, parameter_b, parameter_c, parameter_d): # set up the network with the parameters ... network_params = { ... "parameter_a": parameter_a, ... } network = networks.Hipnn( "hipnn_model", (species, positions), module_kwargs=network_params ) # train the network # `metric_tracker` contains the losses from HIPPYNN with hippynn.tools.active_directory(str(trial_index)): metric_tracker = train_model( training_modules, database, controller, metric_tracker, callbacks=None, batch_callbacks=None, ) # return the desired metric to Ax, for example, validation loss return { "Metric": metric_tracker.best_metric_values["valid"]["Loss"] } Note how we can utilize the parameters passed in and return **Metric** at the end. Apparently, we have the freedom to choose different metrics to return here. We can even use mathematical expressions to combine some metrics together. .. note:: Ax does NOT create a directory for a trial. If your training function does not take care of the working directory, all results will be saved into the same folder, i.e., `cwd`. To avoid this, the training function needs to create a unique path for each trial. In this example, we use the `trial_index` to achieve this purpose. With Ray, this step is NOT necessary. .. _run-sequential-experiments: Run sequential experiments ^^^^^^^^^^^^^^^^^^^^^^^^^^ Next, we can run the experiments .. code-block:: python for k in range(30): parameter, trial_index = ax_client.get_next_trial() ax_client.complete_trial(trial_index=trial_index, raw_data=training(trial_index, **parameter)) # Save the experiment as a JSON file ax_client.save_to_json_file(filepath="hyperopt.json") data_frame = ax_client.get_trials_data_frame().sort_values("Metric") data_frame.to_csv("hyperopt.csv", header=True) For example, we will run 30 trials here and the results will be saved into a json file and a CSV file. The JSON file will contain all the details of the trials, which can be used to restart the experiment or add additional trials to the experiment. As it contains too many details to be human-friendly, we save a more human-friendly CSV that only contains the trial indices, parameters, and metrics. Asynchronous parallelized optimization with Ray ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To use Ray to distribute the trials across GPUs parallelly, a small update is needed for the training function .. code-block:: python from ray.air import session def training(parameter_a, parameter_b, parameter_c, parameter_d): # setup and train are the same # `with hippynn.tools.active_directory() line` is not needed .... # instead of return, we use `session.report` to communicate with `ray` session.report( { "Metric": metric_tracker.best_metric_values["valid"]["Loss"] } ) Instead of a simple `return`, we need the `report` method from `ray.air.session` to report the final metric to `ray`. Also, to run the trials, instead of a loop in :ref:`run-sequential-experiments`, we have to use the interfaces between the two packages from `ray` .. code-block:: python from ray.tune.experiment.trial import Trial from ray.tune.search import ConcurrencyLimiter from ray.tune.search.ax import AxSearch # to make sure ray loads local packages correctly ray.init(runtime_env={"working_dir": "."}) algo = AxSearch(ax_client=ax_client) # 4 GPUs available algo = ConcurrencyLimiter(algo, max_concurrent=4) tuner = tune.Tuner( # assign 1 GPU for one trial tune.with_resources(training, resources={"gpu": 1}), # run 10 trials tune_config=tune.TuneConfig(search_alg=algo, num_samples=10), # configuration of ray run_config=air.RunConfig( # all results will be saved in a subfolder inside the "test" folder of the current working directory local_dir="./test", verbose=0, log_to_file=True, ), ) # run the trials tuner.fit() # save the results as the end # to save the file after each trial, a callback is needed # see advanced details ax_client.save_to_json_file(filepath="hyperopt.json") data_frame = ax_client.get_trials_data_frame().sort_values("Metric") data_frame.to_csv("hyperopt.csv", header=True) This is all you need. The results will be saved in the path of `./test/{trial_function_name}_{timestamp}`. Each trial will be saved within a subfolder named `{trial_function_name}_{random_id}_{index}_{truncated_parameters}`. Advanced details ^^^^^^^^^^^^^^^^ Relative import """"""""""""""" If you save the training function into a separate file and import it into the Ray script, one line has to be added before the trials start, .. code-block:: python ray.init(runtime_env={"working_dir": "."}) assuming the current directory (".") contains the training and Ray script. Without this line, Ray will NOT be able to find the training script and import the training function. Callbacks for Ray """"""""""""""""" When running `ray.tune`, a set of callback functions can be called during the process. Ray has a `documentation`_ on the callback functions. You can build your own for your convenience. However, here is a callback function to save the JSON and CSV files at the end of each trial and handle failed trials, which should cover the most basic functionalities. .. code-block:: python from ray.tune.logger import JsonLoggerCallback, LoggerCallback class AxLogger(LoggerCallback): def __init__(self, ax_client: AxClient, JSON_name: str, csv_name: str): """ A logger callback to save the progress to a JSON file after every trial ends. Similar to running `ax_client.save_to_json_file` every iteration in sequential searches. Args: ax_client (AxClient): ax client to save json_name (str): name for the JSON file. Append a path if you want to save the \ JSON file to somewhere other than cwd. csv_name (str): name for the CSV file. Append a path if you want to save the \ CSV file to somewhere other than cwd. """ self.ax_client = ax_client self.json = json_name self.csv = csv_name def log_trial_end( self, trial: Trial, id: int, metric: float, runtime: int, failed: bool = False ): self.ax_client.save_to_json_file(filepath=self.json) shutil.copy(self.json, f"{trial.local_dir}/{self.json}") try: data_frame = self.ax_client.get_trials_data_frame().sort_values("Metric") data_frame.to_csv(self.csv, header=True) except KeyError: pass shutil.copy(self.csv, f"{trial.local_dir}/{self.csv}") if failed: status = "failed" else: status = "finished" print( f"AX trial {id} {status}. Final loss: {metric}. Time taken" f" {runtime} seconds. Location directory: {trial.logdir}." ) def on_trial_error(self, iteration: int, trials: list[Trial], trial: Trial, **info): id = int(trial.experiment_tag.split("_")[0]) - 1 ax_trial = self.ax_client.get_trial(id) ax_trial.mark_abandoned(reason="Error encountered") self.log_trial_end( trial, id + 1, "not available", self.calculate_runtime(ax_trial), True ) def on_trial_complete( self, iteration: int, trials: list["Trial"], trial: Trial, **info ): # trial.trial_id is the random id generated by ray, not ax # the default experiment_tag starts with ax' trial index # but this workaround is totally fragile, as users can # customize the tag or folder name id = int(trial.experiment_tag.split("_")[0]) - 1 ax_trial = self.ax_client.get_trial(id) failed = False try: loss = ax_trial.objective_mean except ValueError: failed = True loss = "not available" else: if np.isnan(loss) or np.isinf(loss): failed = True loss = "not available" if failed: ax_trial.mark_failed() self.log_trial_end( trial, id + 1, loss, self.calculate_runtime(ax_trial), failed ) @classmethod def calculate_runtime(cls, trial: AXTrial): delta = trial.time_completed - trial.time_run_started return int(delta.total_seconds()) To use callback functions, simple add a line in ``ray.RunConfig``:: ax_logger = AxLogger(ax_client, "hyperopt_ray.json", "hyperopt.csv") run_config=air.RunConfig( local_dir="./test", verbose=0, callbacks=[ax_logger, JsonLoggerCallback()], log_to_file=True, ) Restart/extend an experiment """""""""""""""""""""""""""" .. note:: Due to the complexity of handling the individual trial path with Ray, it is not possible to restart unfinished trials at this moment. Restarting an experiment or adding additional trials to an experiment shares the same workflow. The key is the JSON file saved from the experiment. To reload the experiment state: .. code-block:: python ax_client = AxClient.load_from_json_file(filepath="hyperopt_ray.json") Then we can pull new parameters from this experiment, and these parameters will be generated based on all finished trials. If more trials need to be added to this experiment, simply increase `num_samples` in `ray.tune.TuneConfig`: .. code-block:: python # this will end the experiment when 20 trials are finished tune_config=tune.TuneConfig(search_alg=algo, num_samples=20) Sometimes, you may want to make changes to the experiment itself when reloading the experiment, for example, the search space. This can easily achieved by .. code-block:: python ax_client.set_search_space( [ { "name": "parameter_b", "type": "fixed", "value_type": "int", "value": 25, }, { "name": "parameter_c", "type": "choice", "values": [30, 40, 50], }, ] ) after the `ax_client` object is reloaded. .. note:: To use the `ax_client.set_search_space` method, the original experiment must be created with `immutable_search_space_and_opt_config=False`, i.e., .. code-block:: python ax_client.create_experiment( ... immutable_search_space_and_opt_config=False, ... ) If the original experiment is not created with this option, there is not much we can do. The example scripts with a modified QM7 training script are provided in `examples`_. This tutorial is contributed by `Xinyang Li`_ and the examples scripts are developed by `Sakib Matin`_ and `Xinyang Li`_. .. _ray: https://docs.ray.io/en/latest/ .. _Ax: https://github.com/facebook/Ax .. _issue: https://github.com/facebook/Ax/issues/2711 .. _documentation: https://docs.ray.io/en/latest/tune/tutorials/tune-metrics.html .. _examples: https://github.com/lanl/hippynn/tree/development/examples/hyperparameter_optimization .. _Xinyang Li: https://github.com/tautomer .. _Sakib Matin: https://github.com/sakibmatin