rotating checkpoints
As most people will have encountered at some point, checkpoint files can get corrupted if the run happens to crash just while writing them, potentially losing all progress up to that point. A standard solution to this problem would be two rotating checkpoints:
- at checkpoint time,
cp
ormv
the existing file to.pickle.bk
orpickle.0
or something like that. - write the new checkpoint
However, this will increase the disk footprint of the runs significantly, which may itself be a problem on some clusters depending on their allocation policy, so it would be best to have a user option to turn this feature on/off.