... | ... | @@ -217,6 +217,14 @@ In [5]: %timeit np.interp(2.22, x, y) |
|
|
|
|
|
The average error in magnitude of `d_inner_h` when using linear interpolation (compared with cubic) is 0.1%, and max error is 3.7%.
|
|
|
|
|
|
In the master task, the bulk of time is spent in `sample` (dynesty library). Within this function, `_new_point` is called, which distributes tasks to MPI workers. The second-most expensive function is `n_effective`, which accounts for 3.3% of the time spent sampling.
|
|
|
|
|
|
| sample (total) | _new_point | n_effective |
|
|
|
|----------------|------------|-------------|
|
|
|
| 68.132 | 64.969 | 2.270 |
|
|
|
|
|
|
The calculation of `n_effective` (Kish Effective Sample Size) is used as a stopping criterion, and not for the actual sampling. The run uses the default stopping value of `n_effective = np.inf`, which causes the criterion to never trigger. Although the code checks if `n_effective` is `None` before evaluating the function, it does not check for infinity.
|
|
|
|
|
|
## MPI task overloading
|
|
|
MPI barrier time (caused by workers finishing at different times) can be reduced by overloading the tasks, i.e. setting the number of live points to be greater than the number of workers. This is done using the `queue_size` argument of `NestedSamppler`. For example,
|
|
|
```
|
... | ... | @@ -240,7 +248,7 @@ The speedup from overloading can be taken advantage of without the later slowdow |
|
|
|
|
|
Measurements show that the overheads associated with MPI communication are insignificant relative to the computation time. Optimisations to the MPI communication pattern are not expected to improve performance. Since these optimisations are trivial to implement, they are now included in a modified version of the Schwimmbad library on the ADACS branch to confirm that there is no improvement.
|
|
|
|
|
|
However, several other areas may benefit from optimisation.
|
|
|
However, several other areas may benefit from optimisation. While some of these changes can be implemented directly in Parallel Bilby, others will require improvements to dynesty and Schwimmbad.
|
|
|
|
|
|
## Load balancing
|
|
|
|
... | ... | @@ -252,7 +260,9 @@ A more advanced optimisation is to allow iterations to progress without waiting |
|
|
|
|
|
## Reducing serial Bilby overheads
|
|
|
|
|
|
Optimisations to the serial Bilby algorithm will translate directly to the parallel version. When optimisations such as ROQ are used, the cost of computing the waveform is decreased. However, other parts of the likelihood evaluation become dominant. For example, the calibration function requires calls to a cubic interpolation routine, which is more expensive than an ROQ waveform calculation. If it can be determined that calibration has only a minor effect on the results, then users may consider disabling it. In ROQ runs, the coalescence time is calculated using another cubic interpolation function. Replacing this with a linear interpolation results in significant speed up, at the loss of accuracy. Further investigation into the trade-off between result accuracy (linked to calibration and time interpolation) and solution time (for scenarios requiring low-latency) are required.
|
|
|
Optimisations to the serial Bilby algorithm will translate directly to the parallel version. When optimisations such as ROQ are used, the cost of computing the waveform is decreased. However, other parts of the likelihood evaluation become dominant.
|
|
|
|
|
|
For example, the calibration function requires calls to a cubic interpolation routine, which is more expensive than an ROQ waveform calculation. If it can be determined that calibration has only a minor effect on the results, then users may consider disabling it. In ROQ runs, the coalescence time is calculated using another cubic interpolation function. Replacing this with a linear interpolation results in significant speed up, at the loss of accuracy. Further investigation into the trade-off between result accuracy (linked to calibration and time interpolation) and solution time (for scenarios requiring low-latency) are required.
|
|
|
|
|
|
## Minimising serial code
|
|
|
|
... | ... | @@ -260,4 +270,4 @@ In all parallel codes, the maximum speedup that can be achieved is given by Amda |
|
|
|
|
|
The first is the writing of checkpoints, which at the time of testing, occurred every 10 minutes. This is wasteful, as it provides users no real advantages over a longer checkpointing interval. In a recent update, the checkpoint time is an adjustable parameter, with the default increased to 1 hour. This is expected to provide a 5-10% improvement in run time. For further optimisation, the serial bottleneck of checkpointing can be eliminated entirely by offloading the I/O to one worker task while the other tasks continue with the next iteration.
|
|
|
|
|
|
The second serial portion is the processing and updating of points at the end of each iteration. Since the next iteration depends on the processing of these points, it cannot begin until this is complete. Investigation into whether these calls can be parallelised or vectorised is necessary. |
|
|
\ No newline at end of file |
|
|
The second serial portion is the processing and updating of points at the end of each iteration. Since the next iteration depends on the processing of these points, it cannot begin until this is complete. An optimisation requiring minimal effort is disabling the calculation of `n_effective` in the dynesty library, since it is not used for the stopping criteria. This will provide a speedup of roughly 3%. |
|
|
\ No newline at end of file |