Stopping runs prematurely
Hi folks,
I noticed that runs will either continue until they have finished sampling, or you get a TIMEOUT error because the requested time has elapsed. I imagine this would come up for very long runs, which would require multiple job submissions?
I originally discovered this when profiling the code because I was looking for a parameter to prematurely stop a run.
However, it also seems like bad practice because the jobs don't exit cleanly, and they waste the last checkpointing interval of work. This is a lot of work for long checkpoint intervals -- on the other hand, setting a short interval seems to cause unnecessary slowdown.
Also, if the code happens to be writing a checkpoint when it gets killed, then the checkpoint will be corrupt. Backup checkpoints would solve this, but I'm guessing this is to protect against hardware failures. It seems like a roundabout way of solving the problem (and may not be bulletproof).
My question is: What is the reason for not including an option for prematurely stopping the run and then writing a checkpoint. Rather than relying on the job killer instead?