A user sent this message
I am running a full parameter estimation of GW170817, akin to the GW150914 example (Using slurm scheduler, dynesty, nlive=1000, nact=5, 124cpus, 1 node, mem=1000GB/node).
At the very end of the H1L1V1.sh job, while computing SNRs, about 7-15 thousand iterations in (out of a total 31000), I simply get a job killed message (I have attached the .err and .out files from 2 different re-runs (using the checkpointed analysis). Since the analysis is checkpointed right before the computing per-detector log likelihoods step, I re-ran this with multiple memory and CPU configurations, since I thought it was a memory issue.
Having still got the same error at 124cpu & 1000GB per node, and with 30cpus & 1000GB/node, and other even 10cpus & 1000GB, I think it might not be memory related after all. In contrast, the analysis of individual detector data, when run with 60GB of memory, ran into trouble (although it clearly stated an oom error), and after raising to 256GB all the individual analyses ended well. The only information I could gather is that, for example, doing an sacct command after the job is dead, reports exit code 9:0. I have also attached my config file for both these runs (when modifying mem or cpus, I just modified the .sh submission files and set the npool arg in sample_kwargs in the config.ini file, I added all of these for convenience).
I have also ran just the analysis job, without any generation step or plotting to eliminate any potential dependency issue, and I still run into the same problem. In both .out files in the last step where it is computing SNRs, there is an almost instantaneous (thousands of iterations per second) speed up to around 8k iterations then it gets really stuck and dramatically slows down until it eventually dies. In less robust runs on my own machine with smaller data sets, these final steps usually were incredibly quick compared to the main analysis. I wonder what is the cause of this behaviour. Playing around with the available memory throughout different reruns did lead to somewhat different number of iterations until it dies as can be seen in the .out files above, but I couldn't establish a clear correlation of more mem leads to more iterations completed. Either way, isn't 1000GB excessive for a 17 parameter nested sampler with nlive=1000?
Do you know what could be causing this, or have you seen this before? Any help would be greatly appreciated.
Digging into this, there seem to be known memory issues when using
Pool.map with a long input.
We had already essentially worked around this problem using the caching method for parameter reconstruction and per-detector likelihoods.
As that already had a lot of duplicated code, I refactored what we had to reduce the duplication.
This doesn't completely remove the excess memory usage, but it makes it more manageable.