Periodic hold for long-running jobs
Introduce the following into the workflow generation:
periodic_hold = (JobStatus == 2) && (time() - EnteredCurrentStatus > 81000)
periodic_hold_subcode = 12345
periodic_release = (JobStatus == 5) && (time() - EnteredCurrentStatus > 5*60) && (PeriodicHoldSubCode =?= 12345)
want_graceful_removal = true
This puts the job on hold every 23.5 hours and releases them 5 minutes later to ensure that checkpoint files are transferred even if the pilot dies after 24 hours.
Make sure the user can specify how long until jobs get held.