Checkpointing BROKEN due to gsl_rng_write
Resuming from checkpointed jobs (like when you update the memory requirement due to the suspected memory leak ;) ) fails with a segfault.
The culprit seems to be the temperature.dat state file in the checkpoint directory. Manual inspection shows a mix of human-readable and binary data. The problem lies here:
https://git.ligo.org/lscsoft/bayeswave/blob/master/src/BayesWaveMCMC.c#L732
Specifically:
This function writes the random number state of the random number generator r to the stream stream in binary format
So at line 732, BayesWaveMCMC is writing binary data to an ascii file. Hence the hot mess of binary and ascii in temperature.dat
of jobs that got held.
Solutions:
- Figure out how to make GSL write (and read!) the state as an integer OR:
- Write the binary state of the RNG to a different file, taking care to make sure the corresponding read routines are updated, too.
Assigning @tyson-littenberg formally but @james-clark and @katerina.chatziioannou will try to take a look asap.
Edited by James Alexander Clark PhD