Skip to content

Runs not resuming from checkpoint if aborted on cluster

If a bilby_pipe run gets aborted for some reason, when I relaunch it using condor_submit_dag outdir/submit/dag_*.submit it fails to launch from the previous checkpoint and starts sampling from the beginning instead. The .err file then shows a message like:

00:14 bilby INFO    : Resume file ks_xhm_out/result/test_ks_xhm_run04_data0_0-0_analysis_H1L1V1_dynesty_par0_resume.pickle does not exist.
00:14 bilby INFO    : Generating initial points from the prior

even though the above-mentioned *par0_resume.pickle exists in the correct location.

Example of this can be seen with the par0 pipeline in this directory on CIT: /home/divyajyoti.nln/SIQM_HM/test_run_kappa_bilby_4Jan2022_env/run04/ks_xhm/ks_xhm_out/log_data_analysis. The pipeline got aborted with dlogz=1.099 but when resumed, it again started sampling from the beginning.

tail *out
==> test_ks_xhm_run04_data0_0-0_analysis_H1L1V1_dynesty_par0.out <==
6819it [4:38:09,  2.45s/it, bound:1943 nc:  1 ncall:6.4e+05 eff:1.1% logz-ratio=707.57+/-0.68 dlogz:1.099>0.1]  
11935it [1:16:49,  2.73s/it, bound:3729 nc:101 ncall:1.2e+06 eff:1.0% logz-ratio=593.80+/-0.33 dlogz:230.721>0.1]

Also note, since the other 3 pipelines had finished before the job got aborted, they remained unaffected when the run was relaunched.

Edited by Divyajyoti NLN