|
|
pBilby has been demonstrated to reduce the wall time of GW inference as expected according to the scaling C*ln(1+Ncores/C). However, it was recently discovered that a large amount (>50%) of the total CPU time is *system time* rather than *user time*, meaning that the bulk of the run time is spent by the program executing code in kernel space. From an algorithmic perspective, this is surprising as MPI communication should be low relative to expensive function calls that ought to dominate the overall run time. In practice, this means that the overall wall time of pBilby is a factor of a few higher than it ought to be. **The goal of the proposed project is to optimize pBilby by understanding and removing ineffciencies due to excessive kernel calls.**
|
|
|
|
|
|
# Scaling
|
|
|
|
|
|
## Baseline case
|
|
|
- Using config [GW150914_config.ini](uploads/62b85731c21429dba46c8a5d68819747/GW150914_config.ini)
|
|
|
- 32 cores on OzSTAR Skylake nodes (one whole node)
|
|
|
- Modules:
|
|
|
```
|
|
|
ml purge
|
|
|
ml gni/2020.0
|
|
|
ml python/3.7.4
|
|
|
ml numpy/1.18.2-python-3.7.4
|
|
|
ml scipy/1.4.1-python-3.7.4
|
|
|
ml matplotlib/3.2.1-python-3.7.4
|
|
|
ml git/2.18.0
|
|
|
ml mpi4py/3.0.3-python-3.7.4
|
|
|
```
|
|
|
**Sampling wall time = 3:29:27**
|
|
|
|
|
|
## Scaling measurements
|
|
|
Keep all parameters the same as baseline case except for number of cores. For cores >32, use multiples of whole nodes. For cores less than <32, use a fraction of one node. Efficiency is calculated relative to the 8 core run. Scaling is close to perfect up to 128 cores, and then diminishes rapidly. Given that 0.75 is the minimum acceptable efficiency in the OzSTAR terms of service, the most MPI tasks that should be used is 256.
|
|
|
|
|
|

|
|
|
|
|
|
## Comparison with sstar (Sandy Bridge) nodes
|
|
|
Runs performed using 14 to 448 cores. Across all core counts, runs were consistently 0.65x the speed, which is to be expected. Efficiency scaling is the same as Skylake
|
|
|
|
|
|

|
|
|
|
|
|
## Comparison with conda pre-build binary
|
|
|
For all of the above scaling tests, the [OzSTAR Job Monitor](https://supercomputing.swin.edu.au/monitor) the large fraction of system time usage could not be reproduced. One possibility is the pre-build binaries and MPI libraries provided by conda have their function calls classified differently (e.g. I/O or MPI communication). Aside from this, codes will generally run faster using the modules provided on OzSTAR since they are specifically compiled and optimised for the system.
|
|
|
|
|
|
Runs using an environment entirely provided by conda (including OpenMPI) showed no significant difference in performance, nor did they show significant system time on the job monitor. So far it has not been possible to reproduce the observed system time usage. The next time this is observed, the input parameters and Python environment should be noted for further investigation.
|
|
|
|
|
|
# Profiling
|
|
|
|
|
|
## Modifications to the Schwimmbad library
|
|
|
|
|
|
There is some misunderstanding of how the library should be called. The Schwimmbad documentation itself contains incorrect instructions.
|
|
|
|
|
|
The lines of code for the worker tasks
|
|
|
```
|
|
|
if not pool.is_master():
|
|
|
pool.wait()
|
|
|
```
|
|
|
are never actually executed, because the `__init__` function of `MPIPool` already has these lines in it. The worker tasks wait and execute in `__init__` and then call `sys.exit(0)` once the pool is closed, and never get to any of the subsequent code. The code still behaves as expected, and this issue has no effect on performance or the results.
|
|
|
|
|
|
`sys.exit(0)` prevents most profilers from functioning properly, since the task is killed before the profiler gets a chance to write any output. Modifications have been made to allow MPI tasks to persist after the pool is closed. |
|
|
\ No newline at end of file |