Adding files delete_corruption.py and cp_files.py
Updates to how condor runs and restarts bayeswave. To implement these changes, add add --smart-restart
to the bayeswave_pipe
call.
Silly example:
bayeswave_pipe --trigger-time 1000 --smart-restart config.ini
Implementation Details
setupdirs.py
in the condor pre-command now runs:
cp_files.py
which copies the trigdir directory on the submit node (usually CIT) and compares that to the trigdir on the remote node. Depending on which one is further along (decided by looking at the checkpoint files, primarily temperature.dat
and state.dat
) it keeps the further along directory, and deletes the less progressed directory. Note, this will only work on machines with shared file systems. This will fail on the OSG.
After cp_files.py
is run, delete_corruption.py
is run which checks how long each trigdir/chains/MODEL_*.dat
file should be, and crops them to their correct length (ie the length that the checkpoint says that it should be). This is necessary since when runs crash with signal errors, the printed lines are often corrupted.
These changes mean that on failed runs, bayeswave dag
files can be resubmitted without starting from the very beginning.