Pickle dump entire sampler in dynesty
We've noticed some pretty horrendous issues with restarting after a checkpoint recently.
I think that this is due to not saving all of the relevant state information.
This MR ensures the whole sampler will be saved.
I also took the liberty of adding two more plots.
- One is the run plot which dynesty produces, e.g.,
- The other is a little less pretty but shows the bound idx, number of likelihood calls and sampling scale as a function of the nested sampling iteration, e.g.,
The above plot shows the issue we had, the large spike followed by a higher nc steady state is when the run was interrupted and reloaded from the resume file (note that I stopped this run before it completely converged).
This is what that plot looks like with no interruption
This is what the plot looks like with the new checkpointing
Merge request reports
Activity
changed milestone to %0.6.6
added Bug High priority Sampling labels
The failure of the test seems to be related to https://www.gitmemory.com/issue/uqfoundation/dill/329/515638620.
The test passed as
$ python test/sampler_test.py
but not as
$ pytest test/sampler_test.py
Edited by Colm Talbotadded 1 commit
- 99dddb52 - Check the sampler is picklable before saving to make test run.
added 8 commits
-
99dddb52...2c7dd519 - 6 commits from branch
master
- 8c8a77bf - Merge remote-tracking branch 'origin' into improve-dynesty-checkpointing
- 823cd9e6 - Fix docstring
-
99dddb52...2c7dd519 - 6 commits from branch
- Resolved by Colm Talbot
@colm.talbot the lower plot looks great, is it checkpointing in there I guess?
I've started a PP test using this branch, I'll let you know how it fairs.
- Resolved by Colm Talbot
When the jobs first kick off I get this message
14:04 bilby INFO : Reading resume file outdir_pp_test_high_mass_dynesty_distance-phase-time/result/pp_test_high_mass_dynesty_distance-phase-time_data5_0_analysis_H1L1_dynesty_resume.pickle 14:04 bilby WARNING : Failed to read resume file outdir_pp_test_high_mass_dynesty_distance-phase-time/result/pp_test_high_mass_dynesty_distance-phase-time_data5_0_analysis_H1L1_dynesty_resume.pickle
I think it just needs a check that if the file doesn't exist.
- Resolved by Colm Talbot
@colm.talbot I stopped the jobs that I had running and resubmitted the dag, it seemed to fail to read in:
- Resolved by Gregory Ashton
@colm.talbot I ran it locally and hit a
KerError
, adding this line resolved itif "external_sampler" in state: del state['external_sampler']
@colm.talbot do you also think it is worth wrapping the "iteration" plot in a try except?
@colm.talbot the PP tests are struggling. Looking at this log it seems the job reached the stage of computing SNRs and then was kicked by HTCondor. It started again fine, but didn't have a resume file to start up from (at least it didn't print anything to that effect).
It could be that bilby_pipe isn't transferring the resume file around properly, but this worked with the old version and I didn't think the resume file name had changed. I'll look into it, but I thought I'd post incase you had any insight.
- Resolved by Gregory Ashton
Okay resubmitted here. Sorry for spamming this MR.
I've created a MR for something I've been thinking about a while:
This basically checks every
n_check_point
and if the file is less thancheck_point_delta_t
it doesn't check point. What do you think?Okay, I'm pretty happy that this is working now. And, better yet it works with proper HTCondor checkpointing!
Here is a rundir for GW170608. Here is the log message when it self evicts
20:08 bilby INFO : Written checkpoint file outdir_GW170608/result/GW170608_data0_1180922494-5_analysis_H1L1_dynesty_par1_resume.pickle 20:15 bilby INFO : Run interrupted by alarm signal 14: checkpoint and exit on 77 20:15 bilby INFO : Written checkpoint file outdir_GW170608/result/GW170608_data0_1180922494-5_analysis_H1L1_dynesty_par1_resume.pickle 20:15 bilby_pipe INFO : Running bilby_pipe version: 0.3.10: (CLEAN) 00bf110 2020-03-19 19:15:15 -0500 ... ... 20:18 bilby INFO : Reading resume file outdir_GW170608/result/GW170608_data0_1180922494-5_analysis_H1L1_dynesty_par1_resume.pickle 20:18 bilby INFO : Resume file successfully loaded.
I also have a 4s pp test running to check it running at scale. I'll update when things finish.
Note that for these to work, one needs the latest master of bilby_pipe which implements some changes to the checkpointing. Previous versions should still work, using the previous behaviour (HTCondor kicks the jobs after 4hrs and it gets resubmitted).
mentioned in commit parallel_bilby@1852a4de
mentioned in merge request parallel_bilby!49 (merged)
mentioned in commit parallel_bilby@c3415661
It is checkpointing regularly, being evicted and restarting several times, and the plots look good:
One thing to ntoe here: in the end, it takes several times to "Reconstruct the posterior", this is because I set the "periodic-restart-time' to 2hrs. It turns out the reconstruction can take up to 4hrs or so. I restarted the jobs with a longer restart time and everything was fine.
added 12 commits
-
846fd377...aebe24d5 - 11 commits from branch
master
- 67142b36 - Merge branch 'master' into 'improve-dynesty-checkpointing'
-
846fd377...aebe24d5 - 11 commits from branch
- Resolved by Colm Talbot
@colm.talbot did you add changes to that effect?
added 1 commit
- 6a297cdb - Ignore flake rule 503 which clashes with black
added 1 commit
- a90e4e95 - Test whether bilby or dynesty versions have changed in resume file
mentioned in commit parallel_bilby@0ed96c91
mentioned in commit parallel_bilby@db4ba5cc