Checkpointing too often?
I have a pbilby job running with
nact=20
nlive=2000
walks=100
nodes=28
ntasks_per_node=10
The following image is that of the logs with some extra timestamps:
Event | Time |
---|---|
Start PE | 8:40:38PM |
Stop Sampling | 8:40:38PM |
"Written checkpoint file" | 8:41:28PM |
"Written checkpoint file" | 8:58:20PM |
"Written checkpoint file" | 9:14:34PM |
"Written checkpoint file" | 9:27:13PM |
Each time the sampling stops and the checkpoint is being written ~2-3 minutes get used up. Instead of checkpointing every 15 min, would it be good to instead checkpoint every 30 or even 60 min?