Changes

Conrad Chan · f1a8b7e1
--- a/ADACS:-Scaling-and-Profiling-notes.md
+++ b/ADACS:-Scaling-and-Profiling-notes.md
@@ -235,3 +235,29 @@ Barrier time is initially high because the sampling points are spread across the
 ![barrier_overloading_128](uploads/5e02abde8a681a5829124c068f413305/barrier_overloading_128.png)

 The speedup from overloading can be taken advantage of without the later slowdown by only enabling overloading at the beginning of each run.
+
+# Proposed Optimisations
+
+Measurements show that the overheads associated with MPI communication are insignificant relative to the computation time. Optimisations to the MPI communication pattern are not expected to improve performance. Since these optimisations are trivial to implement, they are now included in a modified version of the Schwimmbad library on the ADACS branch to confirm that there is no effect.
+
+However, several other areas  may benefit from optimisation.
+
+## Load balancing
+
+Snapshots show that the dispersion in task length is high at the beginning of each run, and decays as iterations progress. Since all MPI workers must wait for the next iteration before proceeding, this results in significant wasted time. The wasted time scales superlinearly with the number of cores. For runs that have high core counts and short run times, the dispersion remains high at the end of the run.
+
+This problem must be fixed if the code if efficient scaling is desired beyond 512 cores. Here, we have attempted to reduce the load imbalance by overloading the number of tasks onto the workers. While this did not improve overall runtime, the task length dispersion was reduced at the beginning of each run. Based on these findings, the effectiveness of an adaptive pool size (where the pool is initially overloaded and gradually reduced over the course of the run) is promising. There is currently no simple option to implement this, since modifications to the dynesty library are required.
+
+A more advanced optimisation is to allow iterations to progress without waiting for workers to return. This may reduce the wasted time by ignoring points that are not necessary for an iteration to progress. Tasks performed by some workers will be wasted, but it should be investigated (mathematically) whether the results can still be reused in the next iteration if they lie within the reduced volume, further recuperating some of the wasted time. This will require modifications to the dynesty library and the Schwimmbad library.
+
+## Reducing serial Bilby overheads
+
+Optimisations to the serial Bilby algorithm will translate directly to the parallel version. When optimisations such as ROQ are used, the cost of computing the waveform is decreased. However, other parts of the likelihood evaluation become dominant. For example, the calibration function requires calls to a cubic interpolation routine, which is more expensive than an ROQ waveform calculation. If it can be determined that calibration has only a minor effect on the results, then it should be considered if these features should be disabled. In ROQ runs, the coalescence time is calculated using another cubic interpolation function. Replacing this with a linear interpolation results in significant speed up, at the loss of accuracy. Further investigation into the trade-off between result accuracy (linked to calibration and time interpolation) and solution time (for scenarios requiring low-latency) are required.
+
+* Minimising serial code
+
+In all parallel codes, the maximum speedup that can be achieved is given by Amdahl's law. There are currently two significant routines that are performed in serial. 
+
+The first is the writing of checkpoints, which at the time of testing, occurred every 10 minutes. This is wasteful, as it provides users no real advantages. In a recent update, the checkpoint time is an adjustable parameter, with the default increased to 1 hour. This is expected to provide a 5-10% improvement in run time.
+
+The second is the processing and updating of points at the end of each iteration. Investigation into whether these calls can be parallelised or vectorised is necessary.
\ No newline at end of file