Changes

Conrad Chan · c9f0d746
--- a/ADACS:-Scaling-and-Profiling-notes.md
+++ b/ADACS:-Scaling-and-Profiling-notes.md
@@ -33,6 +33,11 @@ For all of the above scaling tests, the [OzSTAR Job Monitor](https://supercomput
 Runs using an environment entirely provided by conda (including OpenMPI) showed no significant difference in performance, nor did they show significant system time on the job monitor. So far it has not been possible to reproduce the observed system time usage. The next time this is observed, the input parameters and Python environment should be noted for further investigation.
+## Effect of nlive
+The effect of changing nlive from 1000 to 2000 and 4000 is measured, to determine if this is imposing a ceiling on scaling. While scaling improves slightly for nlive=2000, it this does not appear to be the main bottleneck.
+![nlive_scaling](uploads/2c74964e8b92f4ba5a6b4ae0355ca7fb/nlive_scaling.png)
 # Profiling
 ## Modifications to the Schwimmbad library
@@ -47,3 +52,32 @@ if not pool.is_master():
 are never actually executed, because the `__init__` function of `MPIPool` already has these lines in it. The worker tasks wait and execute in `__init__` and then call `sys.exit(0)` once the pool is closed, and never get to any of the subsequent code. The code still behaves as expected, and this issue has no effect on performance or the results.
 `sys.exit(0)` prevents most profilers from functioning properly, since the task is killed before the profiler gets a chance to write any output. Modifications have been made to allow MPI tasks to persist after the pool is closed.
+## PyInstrument
+PyInstrument is used to measure the time spent in function calls. Workers spend most of their time in the `wait` function, where they wait for a task to assigned by the master task, compute the task, and then return the results. Within the `wait` function call, time is spent performing the computation (for example, calling `sample_unif`). These measurements are used to determine the fraction of time spent performing actual computations, and communication overhead. PyInstrument itself imposes a small overhead, but does not significantly alter the overall picture.
+The total CPU time (i.e., summed over all cores) shows that the overhead increases significantly with the number of cores, while the compute time remains roughly constant.
+![overhead](uploads/ad62c8271e23fe36e190d56db7de1782/overhead.png)
+## Detailed MPI timers
+Time-dependent measurements of where time are necessary to pinpoint the cause of the overhead. Measurements for 8 and 32 cores should show good performance, and 64 cores should show any effect of inter-node communication. 256 is where the efficiency falls of significantly and problems may reveal themselves.
+The 5 clocks are
+- **compute**: time spent doing the actual computations, and should be ideally 100%. For 8 and 32 cores, this gets very close.
+- **mpi_recv**: time that a worker spends receiving a job from the master
+- **mpi_send**: time that a worker spends sending the results back to the master. The good news is that the time spent doing send and receives is negligible.
+- **barrier**: although there are no explicit MPI Barriers in the code, this is a measurement of the time that a worker is sitting idle waiting for other workers to finish. In this case, it happens when some workers take longer than other to finish.
+- **master_serial**: time when the worker is doing nothing because the master is performing serial code. This is ideally zero, because of Amdahl’s law.
+![8](uploads/34d38c69b628dda057005554496aeb48/8.png)
+![32](uploads/9046222abf8380287ee9abb0d5b272c2/32.png)
+![64](uploads/00f4507a8b4092c68e47440df0a4a786/64.png)
+![256](uploads/75acb04c4fa1ec69a245262543fa2a71/256.png)
+There are 2 interesting points to take away from this:
+1. The serial time is dominated by writing checkpoints. A huge spike can be seen every 10 minutes. Because the high core count jobs finish sooner, there are fewer data points. Checkpoints are an insurance against system crashes - losing 1 walltime hour of compute time in this rare event is not a big deal. Checkpointing every 10 minutes is paying a huge cost for very little extra benefit. At 256 cores, the code is spending over 15% of the CPU time writing checkpoints. To think of it a different way - the time saved by not making such frequent checkpoints could be used to re-compute the lost work in the event of a system crash. Note: we can’t invoke Amdahl’s law directly here, because the snapshots happen based on wall time, not iterations (CPU time).
+2. The barrier time (a measure of the dispersion of finishing times of the workers) starts quite high (20-40%) before decreasing to zero. This is the other reason for poor scaling. For the 256 core case, the job finishes before the dispersion drops to zero.
\ No newline at end of file