Performance Tips
Devices
Using GPU resources will often speed up your training by more than a factor of ten.
In general hippynn will aim to use the “cuda” device if available, and then fall back on the default pytorch device.
To specify the device just pass a torch.device or compatible string to the SetupParams for a training run.
This will take care of devices for training.
At application time, if you’re using a GraphModule(), Predictor(), HippynnCalculator
called obj, then you can use obj.to(torch.device('cuda')) to move these objects to the GPU.
Custom Kernels
hippynn has low-level implementations of the message passing functions for hip-nn. These are almost always helpful, but the exact degree depends, mostly on the size of the batches and the size of the network. In some cases with large models, there are multiple factors of improvement! These improve both the memory usage and speed for evaluating the network. You shouldn’t need anything special for these to work on the GPU using triton, which is included with pytorch. On CPU they require installing numba. Custom kernels are described in more detail here here.
Profiling
Especially during development, it can be very insightful to produce a trace of the CPU and GPU activities to find bottlenecks exist in the training.
To analyze this, the setup_and_profile() functions as a drop-in replacement for the setup_and_train() function.
The function accepts the same kwargs as setup_and_train(),
so users may quickly receive the robust feature offerings of the PyTorch profiler API with one function call.
This will output a json file which is compatible with the google chrome tracing tools.
Scaling
hipppynn models and loss functions are just pytorch modules, so you should be able to interface them to any accelerator package of choice.
We recommend pytorch lightning for scaling. hippynn includes an interface to Pytorch Lightning for scaling to multiple GPUs or multiple nodes. See the documentation page for more infromation.
There is legacy compatibility with the simple torch.nn.DataParallel by passing a list of device indices to
the SetupParams for an experiment, but this is not a strong strategy for scaling training because it is limited by
single-process CPU dispatch.