Runs not resuming from checkpoint if aborted on cluster
If a bilby_pipe
run gets aborted for some reason, when I relaunch it using condor_submit_dag outdir/submit/dag_*.submit
it fails to launch from the previous checkpoint and starts sampling from the beginning instead.
The .err
file then shows a message like:
00:14 bilby INFO : Resume file ks_xhm_out/result/test_ks_xhm_run04_data0_0-0_analysis_H1L1V1_dynesty_par0_resume.pickle does not exist.
00:14 bilby INFO : Generating initial points from the prior
even though the above-mentioned *par0_resume.pickle
exists in the correct location.
Example of this can be seen with the par0
pipeline in this directory on CIT: /home/divyajyoti.nln/SIQM_HM/test_run_kappa_bilby_4Jan2022_env/run04/ks_xhm/ks_xhm_out/log_data_analysis
.
The pipeline got aborted with dlogz=1.099
but when resumed, it again started sampling from the beginning.
tail *out
==> test_ks_xhm_run04_data0_0-0_analysis_H1L1V1_dynesty_par0.out <==
6819it [4:38:09, 2.45s/it, bound:1943 nc: 1 ncall:6.4e+05 eff:1.1% logz-ratio=707.57+/-0.68 dlogz:1.099>0.1]
11935it [1:16:49, 2.73s/it, bound:3729 nc:101 ncall:1.2e+06 eff:1.0% logz-ratio=593.80+/-0.33 dlogz:230.721>0.1]
Also note, since the other 3 pipelines had finished before the job got aborted, they remained unaffected when the run was relaunched.