Overview of a study into the parallelization scaling of dynesty
Quick test v1
Overview
I've set up a simple bilby_pipe
job which runs a fiducial BBH injection and recovery. To speed things up, I've reduced the model to a non-spinning system, used all available marginalizations, and reduced the sampler settings from our reviewed settings (nact=5
). The reviewed settings (nact=10
) are highly conservative, but the nact=5
case produces results in about half the time and are sufficient for testing purposes (they are sufficiently close to what the outputs should be).
To run the job, copy the ini
and prior
files (see the drop-down's below) and run
$ bilby_pipe config.ini --request-cpus 16 --outdir OUTDIR
Note, the command line arguments overrides the request-cpus=
and outdir=
given in the config.ini
file itself and makes it easy to script running several different jobs. During the setup, the request-cpus=16
ensures that (a) the HTCondor submit files contain request-cpus=16
and (b) that the sampler is instructed to use 16 cores (specifically, in the file OUTDIR/test_config_complete.ini
, the lines sampler-kwargs = {npool: 16}
).
If you want to submit the files at the same time, add --submit
to the command line call. For further details on bilby_pipe ini files, see the documentation.
Files
config.ini
a modified version of the 4s review test
################################################################################
## Calibration arguments
################################################################################
calibration-model=None
spline-calibration-envelope-dict=None
spline-calibration-nodes=5
spline-calibration-amplitude-uncertainty-dict=None
spline-calibration-phase-uncertainty-dict=None
################################################################################
## Data generation arguments
################################################################################
ignore-gwpy-data-quality-check=True
gps-tuple=None
gps-file=None
timeslide-file=None
timeslide-dict=None
trigger-time=None
gaussian-noise=True
n-simulation=0
data-dict=None
data-format=None
channel-dict=None
################################################################################
## Detector arguments
################################################################################
coherence-test=False
detectors=['H1', 'L1']
duration=4
generation-seed=1010
psd-dict=None
psd-fractional-overlap=0.5
post-trigger-duration=2.0
sampling-frequency=4096
psd-length=32
psd-maximum-duration=1024
psd-method=median
psd-start-time=None
maximum-frequency=1024
minimum-frequency=20
zero-noise=False
tukey-roll-off=0.4
resampling-method=lal
################################################################################
## Injection arguments
################################################################################
injection=True
injection-dict={'chirp_mass': 40.051544979894693, 'mass_ratio': 0.9183945489993522, 'a_1': 0., 'a_2': 0., 'tilt_1': 0., 'tilt_2': 0., 'phi_12': 0., 'phi_jl': 0., 'luminosity_distance': 1000., 'dec': 0.2205292600865073, 'ra': 3.952677097361719, 'theta_jn': 1.8795187965094322, 'psi': 2.6973435044499543, 'phase': 3.686990398567503, 'geocent_time': 0.040833669551002205}
injection-file=None
injection-numbers=None
injection-waveform-approximant=None
################################################################################
## Job submission arguments
################################################################################
accounting=ligo.dev.o3.cbc.pe.lalinference
label=test
local=False
local-generation=False
local-plot=False
request-cpus=16
outdir=outdir_1
periodic-restart-time=28800
request-memory=4.0
request-memory-generation=None
singularity-image=None
scheduler=condor
scheduler-args=None
scheduler-module=None
scheduler-env=None
submit=False
transfer-files=False
log-directory=None
online-pe=False
osg=False
################################################################################
## Likelihood arguments
################################################################################
distance-marginalization=True
distance-marginalization-lookup-table=TDP.npz
phase-marginalization=True
time-marginalization=True
jitter-time=True
reference-frame=sky
time-reference=geocent
likelihood-type=GravitationalWaveTransient
roq-folder=None
roq-scale-factor=1
extra-likelihood-kwargs=None
################################################################################
## Output arguments
################################################################################
create-plots=True
plot-calibration=False
plot-corner=False
plot-marginal=False
plot-skymap=False
plot-waveform=False
plot-format=png
create-summary=True
email=gregory.ashton@ligo.org
existing-dir=None
webdir=None
summarypages-arguments=None
################################################################################
## Prior arguments
################################################################################
default-prior=BBHPriorDict
deltaT=0.2
prior-file=4s_modified.prior
prior-dict=None
convert-to-flat-in-component-mass=False
################################################################################
## Post processing arguments
################################################################################
postprocessing-executable=None
postprocessing-arguments=None
single-postprocessing-executable=None
single-postprocessing-arguments=None
################################################################################
## Sampler arguments
################################################################################
sampler=dynesty
sampling-seed=None
n-parallel=4
sampler-kwargs={nact=5, nlive=1000}
################################################################################
## Waveform arguments
################################################################################
waveform-generator=bilby.gw.waveform_generator.WaveformGenerator
reference-frequency=100
waveform-approximant=IMRPhenomPv2
catch-waveform-errors=True
pn-spin-order=-1
pn-tidal-order=-1
pn-phase-order=-1
pn-amplitude-order=0
mode-array=None
frequency-domain-source-model=lal_binary_black_hole
4s_modified.prior
a modified version of the 4s prior (0 spin)
chirp_mass = Uniform(name='chirp_mass', minimum=12.299703, maximum=45, unit='$M_{\odot}$')
mass_ratio = Uniform(name='mass_ratio', minimum=0.125, maximum=1)
a_1 = 0
a_2 = 0
tilt_1 = 0
tilt_2 = 0
phi_12 = 0
phi_jl = 0
luminosity_distance = PowerLaw(alpha=2, name='luminosity_distance', minimum=1e2, maximum=5e3)
dec = Cosine(name='dec')
ra = Uniform(name='ra', minimum=0, maximum=2 * np.pi, boundary='periodic')
theta_jn = Sine(name='theta_jn')
psi = Uniform(name='psi', minimum=0, maximum=np.pi, boundary='periodic')
phase = Uniform(name='phase', minimum=0, maximum=2 * np.pi, boundary='periodic')
Scaling test using any node on CIT
I ran this bash script using the files above
for i in {1..20};
do
bilby_pipe config.ini --request-cpus $i --outdir outdir_${i}
done
For the cases request_cpus > 16
, I manually modified the .submit
files to only request 16 cpus. In this sense, everything above 16
is instructing the sampler to use more cores than requested.
This is the speedup (the effective time per likelihood evaluation relative to the 1-core job).
Notes
- The speed-up continue above n=16, clearly there are more cores available and bilby is stealing resources it did not request.
- There is some variability between results which is likely due to variance in the performance of the randomly-selected nodes
Scaling test using node 1997 on CIT
At Stuart's recommendation, I ran the job above again, but restricting the submit to only node1997, which has 8 available cores. The run script to generate the jobs and overwrite the submit files with the requirements etc is:
for i in {1..8};
do
bilby_pipe config.ini --request-cpus $i --outdir outdir_${i}
sed -i 's/queue/+Optimization_SkyHough = True\nRequirements = (TARGET.Optimization_SkyHough =?= True)\nqueue/g' outdir_${i}/submit/*par*submit
done
for i in {9..14};
do
bilby_pipe config.ini --request-cpus $i --outdir outdir_${i}
sed -i 's/queue/+Optimization_SkyHough = True\nRequirements = (TARGET.Optimization_SkyHough =?= True)\nqueue/g' outdir_${i}/submit/*par*submit
sed -i "s/request_cpus = ${i}/request_cpus = 8/g" outdir_${i}/submit/*par*submit
done
Here is the timing plot (both the total timing and the per-likelihood call timing)
Here is the speed-up relative to the 1-core job
Notes
- This demonstrates the linear scaling
- This demonstrates that if the code is instructed to use more cores than the available number, we reach a plateue
Update: 04/12/2020
After it was pointed out that the scaling was linear, but that the gradient was not close to one, I studied the behaviour in a little more detail. First, here is a re-run of the data above showing different gradients:
Here, it looks like the gradient is ~0.5. This is at odds with the pbilby paper which demonstrated a speed-up close to the theoretically expected behaviour (Eq 10):
After digging in, I realized that the single-core job used about half as many likelihood evaluations as the parallelized version. Here is a table of the number of evaluations:
n cores | # likelihood evaluations [millions] |
---|---|
1 | 0.59 |
4 | 1.5 |
8 | 1.5 |
12 | 1.5 |
16 | 1.6 |
So, this offers to ways to calculate the speed up. The usual "total time" method, or on a "per-likelihood". On this basis, things look much better!
Of course, what we really care about is "total time". So, some conclusions:
- The parallel algorithm is different from the serial algorithm.
- The parallel algorithm is about 2-3 times less efficient than the serial algorithm.
- This explains the difference in speedups (pbilby speedups where measured per-likelihood)
- It is worth stating: while it is less efficient, the parallel algorithm does let you scale!
- This suggests the parallel algorithm could be improved yielding up to a factor of 3 in speed gains.
Note: For the first run of the update, the ratio of likelihood evaluations between the serial and parallel jobs was ~2.8 while for the second run it was 2.7)