Stopping runs prematurely

Hi folks,

I noticed that runs will either continue until they have finished sampling, or you get a TIMEOUT error because the requested time has elapsed. I imagine this would come up for very long runs, which would require multiple job submissions?

I originally discovered this when profiling the code because I was looking for a parameter to prematurely stop a run.

However, it also seems like bad practice because the jobs don't exit cleanly, and they waste the last checkpointing interval of work. This is a lot of work for long checkpoint intervals -- on the other hand, setting a short interval seems to cause unnecessary slowdown.

Also, if the code happens to be writing a checkpoint when it gets killed, then the checkpoint will be corrupt. Backup checkpoints would solve this, but I'm guessing this is to protect against hardware failures. It seems like a roundabout way of solving the problem (and may not be bulletproof).

My question is: What is the reason for not including an option for prematurely stopping the run and then writing a checkpoint. Rather than relying on the job killer instead?