Skip to content

Restarting on exits and evictions

Sophie Hourihane requested to merge sophie.hourihane/bayeswave:CBC_master into CBC

NOTE: All these updates have to do with condor.

Now when BayesWave is restarted (via submitting a dag on condor), instead of starting from the very beginning, it will see whether there are checkpoints that BW should be starting from. To do this it looks at the directory that is available on the remote node (which is empty if BW exited and non-empty if BW was evicted) and the directory that is available locally (on CIT).

By looking inside of the checkpoint files, cp_files.py decides which directory is more advanced. It runs on the more advanced one. Then delete_corruption.py deletes any excess lines in the chain files that might be there after BW exited with a signal error.

This changes the default exit strategy (which was just to start from the beginning), and hopefully saves a ton of computation time.

Also this allows the user to resubmit dag files that might have failed for cluster errors instead of BayesWave errors.

Merge request reports