Skip to content
Snippets Groups Projects

Pickle dump entire sampler in dynesty

Merged Colm Talbot requested to merge improve-dynesty-checkpointing into master

We've noticed some pretty horrendous issues with restarting after a checkpoint recently.

I think that this is due to not saving all of the relevant state information.

This MR ensures the whole sampler will be saved.

I also took the liberty of adding two more plots.

  • One is the run plot which dynesty produces, e.g., image
  • The other is a little less pretty but shows the bound idx, number of likelihood calls and sampling scale as a function of the nested sampling iteration, e.g.,

image

The above plot shows the issue we had, the large spike followed by a higher nc steady state is when the run was interrupted and reloaded from the resume file (note that I stopped this run before it completely converged).

This is what that plot looks like with no interruption

image

This is what the plot looks like with the new checkpointing

image

Edited by Colm Talbot

Merge request reports

Pipeline #114293 passed

Pipeline passed for 9d26b7a7 on improve-dynesty-checkpointing

Test coverage 60.00% (0.00%) from 1 job
Approved by

Merged by Gregory AshtonGregory Ashton 4 years ago (Mar 30, 2020 8:18am UTC)

Merge details

  • Changes merged into master with b6c63b55 (commits were squashed).
  • Deleted the source branch.

Pipeline #114697 passed with warnings

Pipeline passed with warnings for b6c63b55 on master

Test coverage 60.00% (0.00%) from 1 job

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Colm Talbot resolved all threads

    resolved all threads

  • Colm Talbot added 1 commit

    added 1 commit

    • c847c753 - Improve log messages for dynesty resume

    Compare with previous version

  • Gregory Ashton resolved all threads

    resolved all threads

  • Gregory Ashton resolved all threads

    resolved all threads

  • @colm.talbot I ran it locally and hit a KerError, adding this line resolved it

            if "external_sampler" in state:
                del state['external_sampler']
  • @colm.talbot do you also think it is worth wrapping the "iteration" plot in a try except?

  • Colm Talbot added 1 commit

    added 1 commit

    Compare with previous version

  • Author Maintainer

    I've made those changes. The iteration plot would probably have been fairly safe, but I guess there's no harm in being cautious.

  • Gregory Ashton approved this merge request

    approved this merge request

  • Gregory Ashton unapproved this merge request

    unapproved this merge request

  • @colm.talbot the PP tests are struggling. Looking at this log it seems the job reached the stage of computing SNRs and then was kicked by HTCondor. It started again fine, but didn't have a resume file to start up from (at least it didn't print anything to that effect).

    It could be that bilby_pipe isn't transferring the resume file around properly, but this worked with the old version and I didn't think the resume file name had changed. I'll look into it, but I thought I'd post incase you had any insight.

  • Okay I think that it was because at somepoint n_check_point got hard wired to 10000 in bilby_pipe_review. So it just wasnted check pointing in time *doh!

  • I've created a MR for something I've been thinking about a while:

    !748 (closed)

    This basically checks every n_check_point and if the file is less than check_point_delta_t it doesn't check point. What do you think?

  • Gregory Ashton added 3 commits

    added 3 commits

    • 97004c15 - Implement a time-since-last-write for the checkpoint criteria
    • a6608f38 - Add safe pickle/dill file dump
    • 406e17bc - Allow user-settable exit code

    Compare with previous version

  • Gregory Ashton added 1 commit

    added 1 commit

    Compare with previous version

  • Okay, I'm pretty happy that this is working now. And, better yet it works with proper HTCondor checkpointing!

    Here is a rundir for GW170608. Here is the log message when it self evicts

    20:08 bilby INFO    : Written checkpoint file outdir_GW170608/result/GW170608_data0_1180922494-5_analysis_H1L1_dynesty_par1_resume.pickle
    20:15 bilby INFO    : Run interrupted by alarm signal 14: checkpoint and exit on 77
    20:15 bilby INFO    : Written checkpoint file outdir_GW170608/result/GW170608_data0_1180922494-5_analysis_H1L1_dynesty_par1_resume.pickle
    20:15 bilby_pipe INFO    : Running bilby_pipe version: 0.3.10: (CLEAN) 00bf110 2020-03-19 19:15:15 -0500
    ...
    ...
    20:18 bilby INFO    : Reading resume file outdir_GW170608/result/GW170608_data0_1180922494-5_analysis_H1L1_dynesty_par1_resume.pickle
    20:18 bilby INFO    : Resume file successfully loaded.

    I also have a 4s pp test running to check it running at scale. I'll update when things finish.

    Note that for these to work, one needs the latest master of bilby_pipe which implements some changes to the checkpointing. Previous versions should still work, using the previous behaviour (HTCondor kicks the jobs after 4hrs and it gets resubmitted).

  • mentioned in merge request parallel_bilby!49 (merged)

  • The PP test is passing

    It is checkpointing regularly, being evicted and restarting several times, and the plots look good: image One thing to ntoe here: in the end, it takes several times to "Reconstruct the posterior", this is because I set the "periodic-restart-time' to 2hrs. It turns out the reconstruction can take up to 4hrs or so. I restarted the jobs with a longer restart time and everything was fine.

  • Colm Talbot added 12 commits

    added 12 commits

    Compare with previous version

  • Author Maintainer

    We should make sure that we neatly handle reading in an old resume file. I suspect that the current version in this branch will crash. We should just log a warning and restart the run.

  • Gregory Ashton added 2 commits

    added 2 commits

    • 0ced2b62 - Add warning for corrupr resume files
    • c2b1a591 - Merge branch 'improve-dynesty-checkpointing' of git.ligo.org:Monash/tupak into...

    Compare with previous version

  • Gregory Ashton approved this merge request

    approved this merge request

  • Colm Talbot added 1 commit

    added 1 commit

    • 6a297cdb - Ignore flake rule 503 which clashes with black

    Compare with previous version

  • Author Maintainer

    I added a commit to the wrong MR (the one changing the setup). I don't think it's harmful though.

  • Colm Talbot added 1 commit

    added 1 commit

    • a90e4e95 - Test whether bilby or dynesty versions have changed in resume file

    Compare with previous version

  • Colm Talbot added 1 commit

    added 1 commit

    Compare with previous version

  • Colm Talbot added 1 commit

    added 1 commit

    Compare with previous version

  • Colm Talbot added 1 commit

    added 1 commit

    Compare with previous version

  • Gregory Ashton approved this merge request

    approved this merge request

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading