... | @@ -238,7 +238,7 @@ The speedup from overloading can be taken advantage of without the later slowdow |
... | @@ -238,7 +238,7 @@ The speedup from overloading can be taken advantage of without the later slowdow |
|
|
|
|
|
# Proposed Optimisations
|
|
# Proposed Optimisations
|
|
|
|
|
|
Measurements show that the overheads associated with MPI communication are insignificant relative to the computation time. Optimisations to the MPI communication pattern are not expected to improve performance. Since these optimisations are trivial to implement, they are now included in a modified version of the Schwimmbad library on the ADACS branch to confirm that there is no effect.
|
|
Measurements show that the overheads associated with MPI communication are insignificant relative to the computation time. Optimisations to the MPI communication pattern are not expected to improve performance. Since these optimisations are trivial to implement, they are now included in a modified version of the Schwimmbad library on the ADACS branch to confirm that there is no improvement.
|
|
|
|
|
|
However, several other areas may benefit from optimisation.
|
|
However, several other areas may benefit from optimisation.
|
|
|
|
|
... | @@ -246,18 +246,18 @@ However, several other areas may benefit from optimisation. |
... | @@ -246,18 +246,18 @@ However, several other areas may benefit from optimisation. |
|
|
|
|
|
Snapshots show that the dispersion in task length is high at the beginning of each run, and decays as iterations progress. Since all MPI workers must wait for the next iteration before proceeding, this results in significant wasted time. The wasted time scales superlinearly with the number of cores. For runs that have high core counts and short run times, the dispersion remains high at the end of the run.
|
|
Snapshots show that the dispersion in task length is high at the beginning of each run, and decays as iterations progress. Since all MPI workers must wait for the next iteration before proceeding, this results in significant wasted time. The wasted time scales superlinearly with the number of cores. For runs that have high core counts and short run times, the dispersion remains high at the end of the run.
|
|
|
|
|
|
This problem must be fixed if the code if efficient scaling is desired beyond 512 cores. Here, we have attempted to reduce the load imbalance by overloading the number of tasks onto the workers. While this did not improve overall runtime, the task length dispersion was reduced at the beginning of each run. Based on these findings, the effectiveness of an adaptive pool size (where the pool is initially overloaded and gradually reduced over the course of the run) is promising. There is currently no simple option to implement this, since modifications to the dynesty library are required.
|
|
This problem must be fixed if the code if efficient scaling is desired beyond 512 cores. In this study, we have attempted to reduce the load imbalance by overloading the number of tasks onto the workers. While this did not improve overall runtime, the task length dispersion was reduced at the beginning of each run. Based on these findings, adaptively varying the pool size may be an effective optimisation. A promising idea is to start by overloading the workers, then gradually reducing the number of tasks as the completion times converge. There is currently no simple method of implementing this, since modifications to the dynesty library are required.
|
|
|
|
|
|
A more advanced optimisation is to allow iterations to progress without waiting for workers to return. This may reduce the wasted time by ignoring points that are not necessary for an iteration to progress. Tasks performed by some workers will be wasted, but it should be investigated (mathematically) whether the results can still be reused in the next iteration if they lie within the reduced volume, further recuperating some of the wasted time. This will require modifications to the dynesty library and the Schwimmbad library.
|
|
A more advanced optimisation is to allow iterations to progress without waiting for workers to return. This may reduce the wasted time by ignoring points that are not necessary for an iteration to progress. Tasks performed by some workers will be wasted, but it should be investigated (mathematically) whether the skipped results can still be reused in the next iteration if they lie within the reduced volume, further recuperating some of the wasted time. This will require modifications to the dynesty library and the Schwimmbad library.
|
|
|
|
|
|
## Reducing serial Bilby overheads
|
|
## Reducing serial Bilby overheads
|
|
|
|
|
|
Optimisations to the serial Bilby algorithm will translate directly to the parallel version. When optimisations such as ROQ are used, the cost of computing the waveform is decreased. However, other parts of the likelihood evaluation become dominant. For example, the calibration function requires calls to a cubic interpolation routine, which is more expensive than an ROQ waveform calculation. If it can be determined that calibration has only a minor effect on the results, then it should be considered if these features should be disabled. In ROQ runs, the coalescence time is calculated using another cubic interpolation function. Replacing this with a linear interpolation results in significant speed up, at the loss of accuracy. Further investigation into the trade-off between result accuracy (linked to calibration and time interpolation) and solution time (for scenarios requiring low-latency) are required.
|
|
Optimisations to the serial Bilby algorithm will translate directly to the parallel version. When optimisations such as ROQ are used, the cost of computing the waveform is decreased. However, other parts of the likelihood evaluation become dominant. For example, the calibration function requires calls to a cubic interpolation routine, which is more expensive than an ROQ waveform calculation. If it can be determined that calibration has only a minor effect on the results, then users may consider disabling it. In ROQ runs, the coalescence time is calculated using another cubic interpolation function. Replacing this with a linear interpolation results in significant speed up, at the loss of accuracy. Further investigation into the trade-off between result accuracy (linked to calibration and time interpolation) and solution time (for scenarios requiring low-latency) are required.
|
|
|
|
|
|
## Minimising serial code
|
|
## Minimising serial code
|
|
|
|
|
|
In all parallel codes, the maximum speedup that can be achieved is given by Amdahl's law. There are currently two significant routines that are performed in serial.
|
|
In all parallel codes, the maximum speedup that can be achieved is given by Amdahl's law. There are currently two significant routines that are performed in serial. During serial portions of the code, all worker tasks remain idle.
|
|
|
|
|
|
The first is the writing of checkpoints, which at the time of testing, occurred every 10 minutes. This is wasteful, as it provides users no real advantages. In a recent update, the checkpoint time is an adjustable parameter, with the default increased to 1 hour. This is expected to provide a 5-10% improvement in run time.
|
|
The first is the writing of checkpoints, which at the time of testing, occurred every 10 minutes. This is wasteful, as it provides users no real advantages over a longer checkpointing interval. In a recent update, the checkpoint time is an adjustable parameter, with the default increased to 1 hour. This is expected to provide a 5-10% improvement in run time. For further optimisation, the serial bottleneck of checkpointing can be eliminated entirely by offloading the I/O to one worker task while the other tasks continue with the next iteration.
|
|
|
|
|
|
The second is the processing and updating of points at the end of each iteration. Investigation into whether these calls can be parallelised or vectorised is necessary. |
|
The second serial portion is the processing and updating of points at the end of each iteration. Since the next iteration depends on the processing of these points, it cannot begin until this is complete. Investigation into whether these calls can be parallelised or vectorised is necessary. |
|
\ No newline at end of file |
|
\ No newline at end of file |