Skip to content

Checkpointing too often?

I have a pbilby job running with

nact=20
nlive=2000
walks=100
nodes=28
ntasks_per_node=10

The following image is that of the logs with some extra timestamps: default_checkpointing

Event Time
Start PE 8:40:38PM
Stop Sampling 8:40:38PM
"Written checkpoint file" 8:41:28PM
"Written checkpoint file" 8:58:20PM
"Written checkpoint file" 9:14:34PM
"Written checkpoint file" 9:27:13PM

Each time the sampling stops and the checkpoint is being written ~2-3 minutes get used up. Instead of checkpointing every 15 min, would it be good to instead checkpoint every 30 or even 60 min?