Restarting on exits and evictions
NOTE: All these updates have to do with condor.
Now when BayesWave is restarted (via submitting a dag on condor), instead of starting from the very beginning, it will see whether there are checkpoints that BW should be starting from. To do this it looks at the directory that is available on the remote node (which is empty if BW exited and non-empty if BW was evicted) and the directory that is available locally (on CIT).
By looking inside of the checkpoint files, cp_files.py
decides which directory is more advanced. It runs on the more advanced one.
Then delete_corruption.py
deletes any excess lines in the chain files that might be there after BW exited with a signal
error.
This changes the default exit strategy (which was just to start from the beginning), and hopefully saves a ton of computation time.
Also this allows the user to resubmit dag files that might have failed for cluster errors instead of BayesWave errors.